Skip to main content

CEEMS Exporter Metrics

The following are the list of metrics exposed by CEEMS exporter along with the labels for each metric and its description. The first column shows the collector that metric belongs to.

CollectorNameLabelsDescription
cpuceems_cpu_counthostnameNumber of CPUs in the current host
cpuceems_cpu_per_core_counthostnameNumber of logical CPUs per physical CPU
cpuceems_cpu_seconds_totalhostname, modeNumber of seconds spent in each mode
meminfoceems_meminfo_MemTotal_byteshostnameTotal memory in the current host. As reported in /proc/meminfo
meminfoceems_meminfo_MemFree_byteshostnameTotal free memory in the current host. As reported in /proc/meminfo
meminfoceems_meminfo_MemAvailable_byteshostnameTotal available memory in the current host. As reported in /proc/meminfo
ipmiceems_ipmi_dcmi_current_wattshostnameCurrent power consumption reported by IPMI DCMI
ipmiceems_ipmi_dcmi_avg_wattshostnameAverage power consumption reported by IPMI DCMI within sampling period
ipmiceems_ipmi_dcmi_min_wattshostnameMinimum power consumption reported by IPMI DCMI within sampling period
ipmiceems_ipmi_dcmi_max_wattshostnameMaximum power consumption reported by IPMI DCMI within sampling period
raplceems_rapl_package_joules_totalpath, indexCurrent RAPL package energy value. Labels index and path gives info about package details.
raplceems_rapl_dram_joules_totalpath, indexCurrent RAPL DRAM energy value. Labels index and path gives info about package details.
raplceems_rapl_core_joules_totalpath, indexCurrent RAPL core energy value. Labels index and path gives info about package details.
slurmceems_compute_unit_cpusmanager, uuidNumber of CPUs allocated for compute unit identified by label uuid.
slurmceems_compute_unit_cpu_user_seconds_totalmanager, uuidNumber of CPU seconds in user space for compute unit identified by label uuid.
slurmceems_compute_unit_cpu_system_seconds_totalmanager, uuidNumber of CPU seconds in kernel space for compute unit identified by label uuid.
slurmceems_compute_unit_memory_total_bytesmanager, uuidTotal memory allocated for compute unit identified by label uuid.
slurmceems_compute_unit_memory_used_bytesmanager, uuidCurrent total memory used by compute unit identified by label uuid.
slurmceems_compute_unit_memory_rss_bytesmanager, uuidCurrent RSS memory used by compute unit identified by label uuid.
slurmceems_compute_unit_memory_fail_countmanager, uuidCurrent number of memory limit hits by compute unit identified by label uuid.
slurmceems_compute_unit_memsw_fail_countmanager, uuidCurrent number of memory + swap limit hits by compute unit identified by label uuid.
slurmceems_compute_unit_memory_cache_bytesmanager, uuidCurrent cached memory by compute unit identified by label uuid.
slurmceems_compute_unit_cpu_psi_secondsmanager, uuidCurrent number of CPU PSI seconds of compute unit identified by label uuid.
slurmceems_compute_unit_memory_psi_secondsmanager, uuidCurrent number of memory PSI seconds of compute unit identified by label uuid.
slurmceems_compute_unit_rdma_hca_handlesmanager, uuidCurrent number of allocated RDMA HCA handles for compute unit identified by label uuid.
slurmceems_compute_unit_rdma_hca_objectsmanager, uuidCurrent number of allocated RDMA HCA objects for compute unit identified by label uuid.
slurmceems_compute_unit_gpu_index_flaggpuuuid, indexGPU identified by label index or gpuuuid is allocated to job identified by label uuid.
perfceems_perf_cpucycles_totalmanager, uuidTotal number of CPU cycles for compute unit identified by label uuid. Hardware event reported by perf subsystem.
perfceems_perf_instructions_totalmanager, uuidTotal number of CPU instructions for compute unit identified by label uuid. Hardware event reported by perf subsystem.
perfceems_perf_branch_instructions_totalmanager, uuidTotal number of CPU branch instructions for compute unit identified by label uuid. Hardware event reported by perf subsystem.
perfceems_perf_branch_misses_totalmanager, uuidTotal number of CPU branch misses for compute unit identified by label uuid. Hardware event reported by perf subsystem.
perfceems_perf_cache_refs_totalmanager, uuidTotal number of cache references for compute unit identified by label uuid. Hardware event reported by perf subsystem.
perfceems_perf_cache_misses_totalmanager, uuidTotal number of cache misses for compute unit identified by label uuid. Hardware event reported by perf subsystem.
perfceems_perf_ref_cpucycles_totalmanager, uuidTotal number of CPU reference CPU cycles for compute unit identified by label uuid. Hardware event reported by perf subsystem.
perfceems_perf_page_faults_totalmanager, uuidTotal number of page faults for compute unit identified by label uuid. Software event reported by perf subsystem.
perfceems_perf_context_switches_totalmanager, uuidTotal number of context switches for compute unit identified by label uuid. Software event reported by perf subsystem.
perfceems_perf_cpu_migrations_totalmanager, uuidTotal number of CPU migrations for compute unit identified by label uuid. Software event reported by perf subsystem.
perfceems_perf_minor_faults_totalmanager, uuidTotal number of minor page faults for compute unit identified by label uuid. Software event reported by perf subsystem.
perfceems_perf_major_faults_totalmanager, uuidTotal number of major page faults for compute unit identified by label uuid. Software event reported by perf subsystem.
perfceems_perf_cache_l1d_read_hits_totalmanager, uuidTotal number of L1 cache read hits for compute unit identified by label uuid. Hardware cache event reported by perf subsystem.
perfceems_perf_cache_l1d_read_misses_totalmanager, uuidTotal number of L1 cache read misses for compute unit identified by label uuid. Hardware cache event reported by perf subsystem.
perfceems_perf_cache_l1d_write_hits_totalmanager, uuidTotal number of L1 cache write hits for compute unit identified by label uuid. Hardware cache event reported by perf subsystem.
perfceems_perf_cache_l1_instr_read_misses_totalmanager, uuidTotal number of L1 instruction read misses for compute unit identified by label uuid. Hardware cache event reported by perf subsystem.
perfceems_perf_cache_tlb_instr_read_hits_totalmanager, uuidTotal number of TLB cache instruction read hits for compute unit identified by label uuid. Hardware cache event reported by perf subsystem.
perfceems_perf_cache_tlb_instr_read_misses_totalmanager, uuidTotal number of TLB cache instructions read misses for compute unit identified by label uuid. Hardware cache event reported by perf subsystem.
perfceems_perf_cache_ll_read_hits_totalmanager, uuidTotal number of LL cache read hits for compute unit identified by label uuid. Hardware cache event reported by perf subsystem.
perfceems_perf_cache_ll_read_misses_totalmanager, uuidTotal number of LL cache read misses for compute unit identified by label uuid. Hardware cache event reported by perf subsystem.
perfceems_perf_cache_ll_write_hits_totalmanager, uuidTotal number of LL cache write hits for compute unit identified by label uuid. Hardware cache event reported by perf subsystem.
perfceems_perf_cache_ll_write_misses_totalmanager, uuidTotal number of LL cache write misses for compute unit identified by label uuid. Hardware cache event reported by perf subsystem.
perfceems_perf_cache_bpu_read_hits_totalmanager, uuidTotal number of BPU cache read hits for compute unit identified by label uuid. Hardware cache event reported by perf subsystem.
perfceems_perf_cache_bpu_read_misses_totalmanager, uuidTotal number of BPU cache read misses for compute unit identified by label uuid. Hardware cache event reported by perf subsystem.
ebpfceems_ebpf_write_bytes_totalmanager, uuid, mountpointTotal number of bytes written by compute unit identified by label uuid to different mounts identified by mountpoint.
ebpfceems_ebpf_write_requests_totalmanager, uuid, mountpointTotal number of write requests by compute unit identified by label uuid to different mounts identified by mountpoint.
ebpfceems_ebpf_write_errors_totalmanager, uuid, mountpointTotal number of write errors by compute unit identified by label uuid to different mounts identified by mountpoint.
ebpfceems_ebpf_read_bytes_totalmanager, uuid, mountpointTotal number of bytes read by compute unit identified by label uuid to different mounts identified by mountpoint.
ebpfceems_ebpf_read_requests_totalmanager, uuid, mountpointTotal number of read requests by compute unit identified by label uuid to different mounts identified by mountpoint.
ebpfceems_ebpf_read_errors_totalmanager, uuid, mountpointTotal number of read errors by compute unit identified by label uuid to different mounts identified by mountpoint.
ebpfceems_ebpf_open_requests_totalmanager, uuidTotal number of open requests by compute unit identified by label uuid to different mounts identified by mountpoint.
ebpfceems_ebpf_open_errors_totalmanager, uuidTotal number of open request errors by compute unit identified by label uuid to different mounts identified by mountpoint.
ebpfceems_ebpf_create_requests_totalmanager, uuidTotal number of create requests by compute unit identified by label uuid to different mounts identified by mountpoint.
ebpfceems_ebpf_create_errors_totalmanager, uuidTotal number of create request errors by compute unit identified by label uuid to different mounts identified by mountpoint.
ebpfceems_ebpf_unlink_requests_totalmanager, uuidTotal number of unlink/remove requests by compute unit identified by label uuid to different mounts identified by mountpoint.
ebpfceems_ebpf_unlink_errors_totalmanager, uuidTotal number of unlink/remove request errors by compute unit identified by label uuid to different mounts identified by mountpoint.
ebpfceems_ebpf_ingress_packets_totalmanager, uuid, proto, familyTotal number of ingress packets of protocol proto and family family by compute unit identified by label uuid.
ebpfceems_ebpf_ingress_bytes_totalmanager, uuid, proto, familyTotal number of ingress bytes of protocol proto and family family by compute unit identified by label uuid.
ebpfceems_ebpf_egress_packets_totalmanager, uuid, proto, familyTotal number of egress packets of protocol proto and family family by compute unit identified by label uuid.
ebpfceems_ebpf_egress_bytes_totalmanager, uuid, proto, familyTotal number of egress bytes of protocol proto and family family by compute unit identified by label uuid.
ebpfceems_ebpf_retrans_packets_totalmanager, uuid, proto, familyTotal number of retransmission packets of protocol proto and family family by compute unit identified by label uuid (Only for TCP).
ebpfceems_ebpf_retrans_bytes_totalmanager, uuid, proto, familyTotal number of retransmission bytes of protocol proto and family family by compute unit identified by label uuid.
rdmaceems_rdma_port_constraint_errors_received_totalmanager, device, portTotal number of packets received on the switch physical port that are discarded (system-wide metric).
rdmaceems_rdma_port_constraint_errors_transmitted_totalmanager, device, portTotal number of packets not transmitted from the switch physical port (system-wide metric).
rdmaceems_rdma_port_data_received_bytes_totalmanager, device, portTotal number of data octets received on all links (system-wide metric).
rdmaceems_rdma_port_data_transmitted_bytes_totalmanager, device, portTotal number of data octets transmitted on all links (system-wide metric).
rdmaceems_rdma_port_discards_received_totalmanager, device, portTotal number of inbound packets discarded by the port because the port is down or congested (system-wide metric).
rdmaceems_rdma_port_discards_transmitted_totalmanager, device, portTotal number of outbound packets discarded by the port because the port is down or congested (system-wide metric).
rdmaceems_rdma_port_errors_received_totalmanager, device, portTotal number of packets containing an error that were received on this port (system-wide metric).
rdmaceems_rdma_port_packets_received_totalmanager, device, portTotal number of packets received on all VLs by this port (including errors) (system-wide metric).
rdmaceems_rdma_port_packets_transmitted_totalmanager, device, portTotal number of packets transmitted on all VLs from this port (including errors).
rdmaceems_rdma_state_idmanager, device, portState of the InfiniBand port (0: no change, 1: down, 2: init, 3: armed, 4: active, 5: act defer).
rdmaceems_rdma_rx_write_requestsmanager, uuid, device, portTotal number of received write requests for the associated QPs for device device and compute unit identified by label uuid.
rdmaceems_rdma_rx_read_requestsmanager, uuid, device, portTotal number of Number of received read requests for the associated QPs for device device and compute unit identified by label uuid.
rdmaceems_rdma_rx_atomic_requestsmanager, uuid, device, portTotal number of received atomic request for the associated QPs for device device and compute unit identified by label uuid.
rdmaceems_rdma_req_cqe_errormanager, uuid, device, portTotal number of times requester detected CQEs completed with errors for device device and compute unit identified by label uuid.
rdmaceems_rdma_req_cqe_flush_errormanager, uuid, device, portTotal number of times requester detected CQEs completed with flushed errors for device device and compute unit identified by label uuid.
rdmaceems_rdma_req_remote_access_errorsmanager, uuid, device, portTotal number of times requester detected remote access errors for device device and compute unit identified by label uuid.
rdmaceems_rdma_req_remote_invalid_requestmanager, uuid, device, portTotal number of times requester detected remote invalid request errors for device device and compute unit identified by label uuid.
rdmaceems_rdma_resp_cqe_errormanager, uuid, device, portTotal number of times responder detected CQEs completed with errors for device device and compute unit identified by label uuid.
rdmaceems_rdma_resp_cqe_flush_errormanager, uuid, device, portTotal number of times responder detected CQEs completed with flushed errors for device device and compute unit identified by label uuid.
rdmaceems_rdma_resp_local_length_errormanager, uuid, device, portTotal number of times responder detected local length errors for device device and compute unit identified by label uuid.
rdmaceems_rdma_resp_remote_access_errorsmanager, uuid, device, portTotal number of times responder detected remote access errors for device device and compute unit identified by label uuid.
rdmaceems_rdma_qps_activemanager, uuid, device, portTotal number of active QPs for device device and compute unit identified by label uuid.
rdmaceems_rdma_cqs_activemanager, uuid, device, portTotal number of active CQs for device device and compute unit identified by label uuid.
rdmaceems_rdma_mrs_activemanager, uuid, device, portTotal number of active MRs for device device and compute unit identified by label uuid.
rdmaceems_rdma_cqe_len_activemanager, uuid, device, portTotal Length of active CQEs for device device and compute unit identified by label uuid.
rdmaceems_rdma_mrs_len_activemanager, uuid, device, portTotal Length of active MRs for device device and compute unit identified by label uuid.