CEEMS Exporter Metrics
The following are the list of metrics exposed by CEEMS exporter along with the labels for each metric and its description. The first column shows the collector that metric belongs to.
Collector | Name | Labels | Description |
---|---|---|---|
cpu | ceems_cpu_count | hostname | Number of CPUs in the current host |
cpu | ceems_cpu_per_core_count | hostname | Number of logical CPUs per physical CPU |
cpu | ceems_cpu_seconds_total | hostname, mode | Number of seconds spent in each mode |
meminfo | ceems_meminfo_MemTotal_bytes | hostname | Total memory in the current host. As reported in /proc/meminfo |
meminfo | ceems_meminfo_MemFree_bytes | hostname | Total free memory in the current host. As reported in /proc/meminfo |
meminfo | ceems_meminfo_MemAvailable_bytes | hostname | Total available memory in the current host. As reported in /proc/meminfo |
ipmi | ceems_ipmi_dcmi_current_watts | hostname | Current power consumption reported by IPMI DCMI |
ipmi | ceems_ipmi_dcmi_avg_watts | hostname | Average power consumption reported by IPMI DCMI within sampling period |
ipmi | ceems_ipmi_dcmi_min_watts | hostname | Minimum power consumption reported by IPMI DCMI within sampling period |
ipmi | ceems_ipmi_dcmi_max_watts | hostname | Maximum power consumption reported by IPMI DCMI within sampling period |
rapl | ceems_rapl_package_joules_total | path, index | Current RAPL package energy value. Labels index and path gives info about package details. |
rapl | ceems_rapl_dram_joules_total | path, index | Current RAPL DRAM energy value. Labels index and path gives info about package details. |
rapl | ceems_rapl_core_joules_total | path, index | Current RAPL core energy value. Labels index and path gives info about package details. |
rapl | ceems_rapl_package_power_limit_watts_total | path, index | Current RAPL power limit value. Labels index and path gives info about package details. |
slurm, libvirt | ceems_compute_unit_cpus | manager, uuid | Number of CPUs allocated for compute unit identified by label uuid . |
slurm, libvirt | ceems_compute_unit_cpu_user_seconds_total | manager, uuid | Number of CPU seconds in user space for compute unit identified by label uuid . |
slurm, libvirt | ceems_compute_unit_cpu_system_seconds_total | manager, uuid | Number of CPU seconds in kernel space for compute unit identified by label uuid . |
slurm, libvirt | ceems_compute_unit_memory_total_bytes | manager, uuid | Total memory allocated for compute unit identified by label uuid . |
slurm, libvirt | ceems_compute_unit_memory_used_bytes | manager, uuid | Current total memory used by compute unit identified by label uuid . |
slurm, libvirt | ceems_compute_unit_memory_rss_bytes | manager, uuid | Current RSS memory used by compute unit identified by label uuid . |
slurm, libvirt | ceems_compute_unit_memory_fail_count | manager, uuid | Current number of memory limit hits by compute unit identified by label uuid . |
slurm, libvirt | ceems_compute_unit_memsw_fail_count | manager, uuid | Current number of memory + swap limit hits by compute unit identified by label uuid . |
slurm, libvirt | ceems_compute_unit_memory_cache_bytes | manager, uuid | Current cached memory by compute unit identified by label uuid . |
slurm, libvirt | ceems_compute_unit_cpu_psi_seconds | manager, uuid | Current number of CPU PSI seconds of compute unit identified by label uuid . |
slurm, libvirt | ceems_compute_unit_memory_psi_seconds | manager, uuid | Current number of memory PSI seconds of compute unit identified by label uuid . |
slurm | ceems_compute_unit_rdma_hca_handles | manager, uuid | Current number of allocated RDMA HCA handles for compute unit identified by label uuid . |
slurm | ceems_compute_unit_rdma_hca_objects | manager, uuid | Current number of allocated RDMA HCA objects for compute unit identified by label uuid . |
slurm,libvirt | ceems_compute_unit_gpu_index_flag | manager, gpuuuid, index | GPU identified by label index or gpuuuid is allocated to job identified by label uuid . |
libvirt | ceems_compute_unit_blkio_read_total_bytes | manager, device | Total block IO bytes read by instance identified by label uuid . |
libvirt | ceems_compute_unit_blkio_write_total_bytes | manager, device | Total block IO bytes written by instance identified by label uuid . |
libvirt | ceems_compute_unit_blkio_read_total_requests | manager, device | Total block IO read requests by instance identified by label uuid . |
libvirt | ceems_compute_unit_blkio_write_total_requests_ | manager, device | Total block IO write requests by instance identified by label uuid . |
perf | ceems_perf_cpucycles_total | manager, uuid | Total number of CPU cycles for compute unit identified by label uuid . Hardware event reported by perf subsystem. |
perf | ceems_perf_instructions_total | manager, uuid | Total number of CPU instructions for compute unit identified by label uuid . Hardware event reported by perf subsystem. |
perf | ceems_perf_branch_instructions_total | manager, uuid | Total number of CPU branch instructions for compute unit identified by label uuid . Hardware event reported by perf subsystem. |
perf | ceems_perf_branch_misses_total | manager, uuid | Total number of CPU branch misses for compute unit identified by label uuid . Hardware event reported by perf subsystem. |
perf | ceems_perf_cache_refs_total | manager, uuid | Total number of cache references for compute unit identified by label uuid . Hardware event reported by perf subsystem. |
perf | ceems_perf_cache_misses_total | manager, uuid | Total number of cache misses for compute unit identified by label uuid . Hardware event reported by perf subsystem. |
perf | ceems_perf_ref_cpucycles_total | manager, uuid | Total number of CPU reference CPU cycles for compute unit identified by label uuid . Hardware event reported by perf subsystem. |
perf | ceems_perf_page_faults_total | manager, uuid | Total number of page faults for compute unit identified by label uuid . Software event reported by perf subsystem. |
perf | ceems_perf_context_switches_total | manager, uuid | Total number of context switches for compute unit identified by label uuid . Software event reported by perf subsystem. |
perf | ceems_perf_cpu_migrations_total | manager, uuid | Total number of CPU migrations for compute unit identified by label uuid . Software event reported by perf subsystem. |
perf | ceems_perf_minor_faults_total | manager, uuid | Total number of minor page faults for compute unit identified by label uuid . Software event reported by perf subsystem. |
perf | ceems_perf_major_faults_total | manager, uuid | Total number of major page faults for compute unit identified by label uuid . Software event reported by perf subsystem. |
perf | ceems_perf_cache_l1d_read_hits_total | manager, uuid | Total number of L1 cache read hits for compute unit identified by label uuid . Hardware cache event reported by perf subsystem. |
perf | ceems_perf_cache_l1d_read_misses_total | manager, uuid | Total number of L1 cache read misses for compute unit identified by label uuid . Hardware cache event reported by perf subsystem. |
perf | ceems_perf_cache_l1d_write_hits_total | manager, uuid | Total number of L1 cache write hits for compute unit identified by label uuid . Hardware cache event reported by perf subsystem. |
perf | ceems_perf_cache_l1_instr_read_misses_total | manager, uuid | Total number of L1 instruction read misses for compute unit identified by label uuid . Hardware cache event reported by perf subsystem. |
perf | ceems_perf_cache_tlb_instr_read_hits_total | manager, uuid | Total number of TLB cache instruction read hits for compute unit identified by label uuid . Hardware cache event reported by perf subsystem. |
perf | ceems_perf_cache_tlb_instr_read_misses_total | manager, uuid | Total number of TLB cache instructions read misses for compute unit identified by label uuid . Hardware cache event reported by perf subsystem. |
perf | ceems_perf_cache_ll_read_hits_total | manager, uuid | Total number of LL cache read hits for compute unit identified by label uuid . Hardware cache event reported by perf subsystem. |
perf | ceems_perf_cache_ll_read_misses_total | manager, uuid | Total number of LL cache read misses for compute unit identified by label uuid . Hardware cache event reported by perf subsystem. |
perf | ceems_perf_cache_ll_write_hits_total | manager, uuid | Total number of LL cache write hits for compute unit identified by label uuid . Hardware cache event reported by perf subsystem. |
perf | ceems_perf_cache_ll_write_misses_total | manager, uuid | Total number of LL cache write misses for compute unit identified by label uuid . Hardware cache event reported by perf subsystem. |
perf | ceems_perf_cache_bpu_read_hits_total | manager, uuid | Total number of BPU cache read hits for compute unit identified by label uuid . Hardware cache event reported by perf subsystem. |
perf | ceems_perf_cache_bpu_read_misses_total | manager, uuid | Total number of BPU cache read misses for compute unit identified by label uuid . Hardware cache event reported by perf subsystem. |
ebpf | ceems_ebpf_write_bytes_total | manager, uuid, mountpoint | Total number of bytes written by compute unit identified by label uuid to different mounts identified by mountpoint . |
ebpf | ceems_ebpf_write_requests_total | manager, uuid, mountpoint | Total number of write requests by compute unit identified by label uuid to different mounts identified by mountpoint . |
ebpf | ceems_ebpf_write_errors_total | manager, uuid, mountpoint | Total number of write errors by compute unit identified by label uuid to different mounts identified by mountpoint . |
ebpf | ceems_ebpf_read_bytes_total | manager, uuid, mountpoint | Total number of bytes read by compute unit identified by label uuid to different mounts identified by mountpoint . |
ebpf | ceems_ebpf_read_requests_total | manager, uuid, mountpoint | Total number of read requests by compute unit identified by label uuid to different mounts identified by mountpoint . |
ebpf | ceems_ebpf_read_errors_total | manager, uuid, mountpoint | Total number of read errors by compute unit identified by label uuid to different mounts identified by mountpoint . |
ebpf | ceems_ebpf_open_requests_total | manager, uuid | Total number of open requests by compute unit identified by label uuid to different mounts identified by mountpoint . |
ebpf | ceems_ebpf_open_errors_total | manager, uuid | Total number of open request errors by compute unit identified by label uuid to different mounts identified by mountpoint . |
ebpf | ceems_ebpf_create_requests_total | manager, uuid | Total number of create requests by compute unit identified by label uuid to different mounts identified by mountpoint . |
ebpf | ceems_ebpf_create_errors_total | manager, uuid | Total number of create request errors by compute unit identified by label uuid to different mounts identified by mountpoint . |
ebpf | ceems_ebpf_unlink_requests_total | manager, uuid | Total number of unlink/remove requests by compute unit identified by label uuid to different mounts identified by mountpoint . |
ebpf | ceems_ebpf_unlink_errors_total | manager, uuid | Total number of unlink/remove request errors by compute unit identified by label uuid to different mounts identified by mountpoint . |
ebpf | ceems_ebpf_ingress_packets_total | manager, uuid, proto, family | Total number of ingress packets of protocol proto and family family by compute unit identified by label uuid . |
ebpf | ceems_ebpf_ingress_bytes_total | manager, uuid, proto, family | Total number of ingress bytes of protocol proto and family family by compute unit identified by label uuid . |
ebpf | ceems_ebpf_egress_packets_total | manager, uuid, proto, family | Total number of egress packets of protocol proto and family family by compute unit identified by label uuid . |
ebpf | ceems_ebpf_egress_bytes_total | manager, uuid, proto, family | Total number of egress bytes of protocol proto and family family by compute unit identified by label uuid . |
ebpf | ceems_ebpf_retrans_packets_total | manager, uuid, proto, family | Total number of retransmission packets of protocol proto and family family by compute unit identified by label uuid (Only for TCP). |
ebpf | ceems_ebpf_retrans_bytes_total | manager, uuid, proto, family | Total number of retransmission bytes of protocol proto and family family by compute unit identified by label uuid . |
rdma | ceems_rdma_port_constraint_errors_received_total | manager, device, port | Total number of packets received on the switch physical port that are discarded (system-wide metric). |
rdma | ceems_rdma_port_constraint_errors_transmitted_total | manager, device, port | Total number of packets not transmitted from the switch physical port (system-wide metric). |
rdma | ceems_rdma_port_data_received_bytes_total | manager, device, port | Total number of data octets received on all links (system-wide metric). |
rdma | ceems_rdma_port_data_transmitted_bytes_total | manager, device, port | Total number of data octets transmitted on all links (system-wide metric). |
rdma | ceems_rdma_port_discards_received_total | manager, device, port | Total number of inbound packets discarded by the port because the port is down or congested (system-wide metric). |
rdma | ceems_rdma_port_discards_transmitted_total | manager, device, port | Total number of outbound packets discarded by the port because the port is down or congested (system-wide metric). |
rdma | ceems_rdma_port_errors_received_total | manager, device, port | Total number of packets containing an error that were received on this port (system-wide metric). |
rdma | ceems_rdma_port_packets_received_total | manager, device, port | Total number of packets received on all VLs by this port (including errors) (system-wide metric). |
rdma | ceems_rdma_port_packets_transmitted_total | manager, device, port | Total number of packets transmitted on all VLs from this port (including errors). |
rdma | ceems_rdma_state_id | manager, device, port | State of the InfiniBand port (0: no change, 1: down, 2: init, 3: armed, 4: active, 5: act defer). |
rdma | ceems_rdma_rx_write_requests | manager, uuid, device, port | Total number of received write requests for the associated QPs for device device and compute unit identified by label uuid . |
rdma | ceems_rdma_rx_read_requests | manager, uuid, device, port | Total number of Number of received read requests for the associated QPs for device device and compute unit identified by label uuid . |
rdma | ceems_rdma_rx_atomic_requests | manager, uuid, device, port | Total number of received atomic request for the associated QPs for device device and compute unit identified by label uuid . |
rdma | ceems_rdma_req_cqe_error | manager, uuid, device, port | Total number of times requester detected CQEs completed with errors for device device and compute unit identified by label uuid . |
rdma | ceems_rdma_req_cqe_flush_error | manager, uuid, device, port | Total number of times requester detected CQEs completed with flushed errors for device device and compute unit identified by label uuid . |
rdma | ceems_rdma_req_remote_access_errors | manager, uuid, device, port | Total number of times requester detected remote access errors for device device and compute unit identified by label uuid . |
rdma | ceems_rdma_req_remote_invalid_request | manager, uuid, device, port | Total number of times requester detected remote invalid request errors for device device and compute unit identified by label uuid . |
rdma | ceems_rdma_resp_cqe_error | manager, uuid, device, port | Total number of times responder detected CQEs completed with errors for device device and compute unit identified by label uuid . |
rdma | ceems_rdma_resp_cqe_flush_error | manager, uuid, device, port | Total number of times responder detected CQEs completed with flushed errors for device device and compute unit identified by label uuid . |
rdma | ceems_rdma_resp_local_length_error | manager, uuid, device, port | Total number of times responder detected local length errors for device device and compute unit identified by label uuid . |
rdma | ceems_rdma_resp_remote_access_errors | manager, uuid, device, port | Total number of times responder detected remote access errors for device device and compute unit identified by label uuid . |
rdma | ceems_rdma_qps_active | manager, uuid, device, port | Total number of active QPs for device device and compute unit identified by label uuid . |
rdma | ceems_rdma_cqs_active | manager, uuid, device, port | Total number of active CQs for device device and compute unit identified by label uuid . |
rdma | ceems_rdma_mrs_active | manager, uuid, device, port | Total number of active MRs for device device and compute unit identified by label uuid . |
rdma | ceems_rdma_cqe_len_active | manager, uuid, device, port | Total Length of active CQEs for device device and compute unit identified by label uuid . |
rdma | ceems_rdma_mrs_len_active | manager, uuid, device, port | Total Length of active MRs for device device and compute unit identified by label uuid . |