CEEMS Exporter Metrics
The CEEMS exporter ships multiple collectors, some of which are enabled by default.
Enabled by default
The following collectors are enabled by default:
- cpu
- meminfo
- rapl
Disabled by default
The rest of the collectors and sub-collectors are disabled by default. Collectors disabled by default are:
- ipmi_dcmi
- emissions
- slurm
- libvirt
Sub-collectors disabled by default are:
- ebpf.io-metrics
- ebpf.network-metrics
- perf.hardware-events
- perf.software-events
- perf.hardware-cache-events
- rdma.stats
Metrics list
The following is a list of metrics exposed by the CEEMS exporter along with the labels for each metric and its description. The first column shows the collector that the metric belongs to.
Collector | Name | Labels | Description |
---|---|---|---|
cpu | ceems_cpu_count | hostname | Number of CPUs in the current host |
cpu | ceems_cpu_per_core_count | hostname | Number of logical CPUs per physical CPU |
cpu | ceems_cpu_seconds_total | hostname, mode | Number of seconds spent in each mode |
meminfo | ceems_meminfo_MemTotal_bytes | hostname | Total memory in the current host. As reported in /proc/meminfo |
meminfo | ceems_meminfo_MemFree_bytes | hostname | Total free memory in the current host. As reported in /proc/meminfo |
meminfo | ceems_meminfo_MemAvailable_bytes | hostname | Total available memory in the current host. As reported in /proc/meminfo |
ipmi_dcmi | ceems_ipmi_dcmi_current_watts | hostname | Current power consumption reported by IPMI DCMI |
ipmi_dcmi | ceems_ipmi_dcmi_avg_watts | hostname | Average power consumption reported by IPMI DCMI within sampling period |
ipmi_dcmi | ceems_ipmi_dcmi_min_watts | hostname | Minimum power consumption reported by IPMI DCMI within sampling period |
ipmi_dcmi | ceems_ipmi_dcmi_max_watts | hostname | Maximum power consumption reported by IPMI DCMI within sampling period |
redfish | ceems_redfish_current_watts | hostname | Current power consumption reported by Redfish within sampling period |
redfish | ceems_redfish_avg_watts | hostname | Average power consumption reported by Redfish within sampling period |
redfish | ceems_redfish_min_watts | hostname | Minimum power consumption reported by Redfish within sampling period |
redfish | ceems_redfish_max_watts | hostname | Maximum power consumption reported by Redfish within sampling period |
cray_pm_counters | ceems_cray_pm_counters_energy_joules | hostname, domain | Current energy value in joules |
cray_pm_counters | ceems_cray_pm_counters_power_watts | hostname, domain | Current power value in watts |
cray_pm_counters | ceems_cray_pm_counters_power_limit_watts | hostname, domain | Current power limit value in watts |
cray_pm_counters | ceems_cray_pm_counters_temp_celsius | hostname, domain | Current temperature value in celsius |
rapl | ceems_rapl_package_joules_total | path, index | Current RAPL package energy value. Labels index and path gives info about package details. |
rapl | ceems_rapl_dram_joules_total | path, index | Current RAPL DRAM energy value. Labels index and path gives info about package details. |
rapl | ceems_rapl_core_joules_total | path, index | Current RAPL core energy value. Labels index and path gives info about package details. |
rapl | ceems_rapl_package_power_limit_watts_total | path, index | Current RAPL power limit value. Labels index and path gives info about package details. |
slurm, libvirt | ceems_compute_unit_cpus | manager, uuid | Number of CPUs allocated for compute unit identified by label uuid . |
slurm, libvirt | ceems_compute_unit_cpu_user_seconds_total | manager, uuid | Number of CPU seconds in user space for compute unit identified by label uuid . |
slurm, libvirt | ceems_compute_unit_cpu_system_seconds_total | manager, uuid | Number of CPU seconds in kernel space for compute unit identified by label uuid . |
slurm, libvirt | ceems_compute_unit_memory_total_bytes | manager, uuid | Total memory allocated for compute unit identified by label uuid . |
slurm, libvirt | ceems_compute_unit_memory_used_bytes | manager, uuid | Current total memory used by compute unit identified by label uuid . |
slurm, libvirt | ceems_compute_unit_memory_rss_bytes | manager, uuid | Current RSS memory used by compute unit identified by label uuid . |
slurm, libvirt | ceems_compute_unit_memory_fail_count | manager, uuid | Current number of memory limit hits by compute unit identified by label uuid . |
slurm, libvirt | ceems_compute_unit_memsw_fail_count | manager, uuid | Current number of memory + swap limit hits by compute unit identified by label uuid . |
slurm, libvirt | ceems_compute_unit_memory_cache_bytes | manager, uuid | Current cached memory by compute unit identified by label uuid . |
slurm, libvirt | ceems_compute_unit_cpu_psi_seconds | manager, uuid | Current number of CPU PSI seconds of compute unit identified by label uuid . |
slurm, libvirt | ceems_compute_unit_memory_psi_seconds | manager, uuid | Current number of memory PSI seconds of compute unit identified by label uuid . |
slurm | ceems_compute_unit_rdma_hca_handles | manager, uuid | Current number of allocated RDMA HCA handles for compute unit identified by label uuid . |
slurm | ceems_compute_unit_rdma_hca_objects | manager, uuid | Current number of allocated RDMA HCA objects for compute unit identified by label uuid . |
slurm,libvirt | ceems_compute_unit_gpu_index_flag | manager, gpuuuid, index | GPU identified by label index or gpuuuid is allocated to job identified by label uuid . |
libvirt | ceems_compute_unit_blkio_read_total_bytes | manager, device | Total block IO bytes read by instance identified by label uuid . |
libvirt | ceems_compute_unit_blkio_write_total_bytes | manager, device | Total block IO bytes written by instance identified by label uuid . |
libvirt | ceems_compute_unit_blkio_read_total_requests | manager, device | Total block IO read requests by instance identified by label uuid . |
libvirt | ceems_compute_unit_blkio_write_total_requests_ | manager, device | Total block IO write requests by instance identified by label uuid . |
perf | ceems_perf_cpucycles_total | manager, uuid | Total number of CPU cycles for compute unit identified by label uuid . Hardware event reported by perf subsystem. |
perf | ceems_perf_instructions_total | manager, uuid | Total number of CPU instructions for compute unit identified by label uuid . Hardware event reported by perf subsystem. |
perf | ceems_perf_branch_instructions_total | manager, uuid | Total number of CPU branch instructions for compute unit identified by label uuid . Hardware event reported by perf subsystem. |
perf | ceems_perf_branch_misses_total | manager, uuid | Total number of CPU branch misses for compute unit identified by label uuid . Hardware event reported by perf subsystem. |
perf | ceems_perf_cache_refs_total | manager, uuid | Total number of cache references for compute unit identified by label uuid . Hardware event reported by perf subsystem. |
perf | ceems_perf_cache_misses_total | manager, uuid | Total number of cache misses for compute unit identified by label uuid . Hardware event reported by perf subsystem. |
perf | ceems_perf_ref_cpucycles_total | manager, uuid | Total number of CPU reference CPU cycles for compute unit identified by label uuid . Hardware event reported by perf subsystem. |
perf | ceems_perf_page_faults_total | manager, uuid | Total number of page faults for compute unit identified by label uuid . Software event reported by perf subsystem. |
perf | ceems_perf_context_switches_total | manager, uuid | Total number of context switches for compute unit identified by label uuid . Software event reported by perf subsystem. |
perf | ceems_perf_cpu_migrations_total | manager, uuid | Total number of CPU migrations for compute unit identified by label uuid . Software event reported by perf subsystem. |
perf | ceems_perf_minor_faults_total | manager, uuid | Total number of minor page faults for compute unit identified by label uuid . Software event reported by perf subsystem. |
perf | ceems_perf_major_faults_total | manager, uuid | Total number of major page faults for compute unit identified by label uuid . Software event reported by perf subsystem. |
perf | ceems_perf_cache_l1d_read_hits_total | manager, uuid | Total number of L1 cache read hits for compute unit identified by label uuid . Hardware cache event reported by perf subsystem. |
perf | ceems_perf_cache_l1d_read_misses_total | manager, uuid | Total number of L1 cache read misses for compute unit identified by label uuid . Hardware cache event reported by perf subsystem. |
perf | ceems_perf_cache_l1d_write_hits_total | manager, uuid | Total number of L1 cache write hits for compute unit identified by label uuid . Hardware cache event reported by perf subsystem. |
perf | ceems_perf_cache_l1_instr_read_misses_total | manager, uuid | Total number of L1 instruction read misses for compute unit identified by label uuid . Hardware cache event reported by perf subsystem. |
perf | ceems_perf_cache_tlb_instr_read_hits_total | manager, uuid | Total number of TLB cache instruction read hits for compute unit identified by label uuid . Hardware cache event reported by perf subsystem. |
perf | ceems_perf_cache_tlb_instr_read_misses_total | manager, uuid | Total number of TLB cache instructions read misses for compute unit identified by label uuid . Hardware cache event reported by perf subsystem. |
perf | ceems_perf_cache_ll_read_hits_total | manager, uuid | Total number of LL cache read hits for compute unit identified by label uuid . Hardware cache event reported by perf subsystem. |
perf | ceems_perf_cache_ll_read_misses_total | manager, uuid | Total number of LL cache read misses for compute unit identified by label uuid . Hardware cache event reported by perf subsystem. |
perf | ceems_perf_cache_ll_write_hits_total | manager, uuid | Total number of LL cache write hits for compute unit identified by label uuid . Hardware cache event reported by perf subsystem. |
perf | ceems_perf_cache_ll_write_misses_total | manager, uuid | Total number of LL cache write misses for compute unit identified by label uuid . Hardware cache event reported by perf subsystem. |
perf | ceems_perf_cache_bpu_read_hits_total | manager, uuid | Total number of BPU cache read hits for compute unit identified by label uuid . Hardware cache event reported by perf subsystem. |
perf | ceems_perf_cache_bpu_read_misses_total | manager, uuid | Total number of BPU cache read misses for compute unit identified by label uuid . Hardware cache event reported by perf subsystem. |
ebpf | ceems_ebpf_write_bytes_total | manager, uuid, mountpoint | Total number of bytes written by compute unit identified by label uuid to different mounts identified by mountpoint . |
ebpf | ceems_ebpf_write_requests_total | manager, uuid, mountpoint | Total number of write requests by compute unit identified by label uuid to different mounts identified by mountpoint . |
ebpf | ceems_ebpf_write_errors_total | manager, uuid, mountpoint | Total number of write errors by compute unit identified by label uuid to different mounts identified by mountpoint . |
ebpf | ceems_ebpf_read_bytes_total | manager, uuid, mountpoint | Total number of bytes read by compute unit identified by label uuid to different mounts identified by mountpoint . |
ebpf | ceems_ebpf_read_requests_total | manager, uuid, mountpoint | Total number of read requests by compute unit identified by label uuid to different mounts identified by mountpoint . |
ebpf | ceems_ebpf_read_errors_total | manager, uuid, mountpoint | Total number of read errors by compute unit identified by label uuid to different mounts identified by mountpoint . |
ebpf | ceems_ebpf_open_requests_total | manager, uuid | Total number of open requests by compute unit identified by label uuid to different mounts identified by mountpoint . |
ebpf | ceems_ebpf_open_errors_total | manager, uuid | Total number of open request errors by compute unit identified by label uuid to different mounts identified by mountpoint . |
ebpf | ceems_ebpf_create_requests_total | manager, uuid | Total number of create requests by compute unit identified by label uuid to different mounts identified by mountpoint . |
ebpf | ceems_ebpf_create_errors_total | manager, uuid | Total number of create request errors by compute unit identified by label uuid to different mounts identified by mountpoint . |
ebpf | ceems_ebpf_unlink_requests_total | manager, uuid | Total number of unlink/remove requests by compute unit identified by label uuid to different mounts identified by mountpoint . |
ebpf | ceems_ebpf_unlink_errors_total | manager, uuid | Total number of unlink/remove request errors by compute unit identified by label uuid to different mounts identified by mountpoint . |
ebpf | ceems_ebpf_ingress_packets_total | manager, uuid, proto, family | Total number of ingress packets of protocol proto and family family by compute unit identified by label uuid . |
ebpf | ceems_ebpf_ingress_bytes_total | manager, uuid, proto, family | Total number of ingress bytes of protocol proto and family family by compute unit identified by label uuid . |
ebpf | ceems_ebpf_egress_packets_total | manager, uuid, proto, family | Total number of egress packets of protocol proto and family family by compute unit identified by label uuid . |
ebpf | ceems_ebpf_egress_bytes_total | manager, uuid, proto, family | Total number of egress bytes of protocol proto and family family by compute unit identified by label uuid . |
ebpf | ceems_ebpf_retrans_packets_total | manager, uuid, proto, family | Total number of retransmission packets of protocol proto and family family by compute unit identified by label uuid (Only for TCP). |
ebpf | ceems_ebpf_retrans_bytes_total | manager, uuid, proto, family | Total number of retransmission bytes of protocol proto and family family by compute unit identified by label uuid . |
rdma | ceems_rdma_port_constraint_errors_received_total | manager, device, port | Total number of packets received on the switch physical port that are discarded (system-wide metric). |
rdma | ceems_rdma_port_constraint_errors_transmitted_total | manager, device, port | Total number of packets not transmitted from the switch physical port (system-wide metric). |
rdma | ceems_rdma_port_data_received_bytes_total | manager, device, port | Total number of data octets received on all links (system-wide metric). |
rdma | ceems_rdma_port_data_transmitted_bytes_total | manager, device, port | Total number of data octets transmitted on all links (system-wide metric). |
rdma | ceems_rdma_port_discards_received_total | manager, device, port | Total number of inbound packets discarded by the port because the port is down or congested (system-wide metric). |
rdma | ceems_rdma_port_discards_transmitted_total | manager, device, port | Total number of outbound packets discarded by the port because the port is down or congested (system-wide metric). |
rdma | ceems_rdma_port_errors_received_total | manager, device, port | Total number of packets containing an error that were received on this port (system-wide metric). |
rdma | ceems_rdma_port_packets_received_total | manager, device, port | Total number of packets received on all VLs by this port (including errors) (system-wide metric). |
rdma | ceems_rdma_port_packets_transmitted_total | manager, device, port | Total number of packets transmitted on all VLs from this port (including errors). |
rdma | ceems_rdma_state_id | manager, device, port | State of the InfiniBand port (0: no change, 1: down, 2: init, 3: armed, 4: active, 5: act defer). |
rdma | ceems_rdma_rx_write_requests | manager, uuid, device, port | Total number of received write requests for the associated QPs for device device and compute unit identified by label uuid . |
rdma | ceems_rdma_rx_read_requests | manager, uuid, device, port | Total number of Number of received read requests for the associated QPs for device device and compute unit identified by label uuid . |
rdma | ceems_rdma_rx_atomic_requests | manager, uuid, device, port | Total number of received atomic request for the associated QPs for device device and compute unit identified by label uuid . |
rdma | ceems_rdma_req_cqe_error | manager, uuid, device, port | Total number of times requester detected CQEs completed with errors for device device and compute unit identified by label uuid . |
rdma | ceems_rdma_req_cqe_flush_error | manager, uuid, device, port | Total number of times requester detected CQEs completed with flushed errors for device device and compute unit identified by label uuid . |
rdma | ceems_rdma_req_remote_access_errors | manager, uuid, device, port | Total number of times requester detected remote access errors for device device and compute unit identified by label uuid . |
rdma | ceems_rdma_req_remote_invalid_request | manager, uuid, device, port | Total number of times requester detected remote invalid request errors for device device and compute unit identified by label uuid . |
rdma | ceems_rdma_resp_cqe_error | manager, uuid, device, port | Total number of times responder detected CQEs completed with errors for device device and compute unit identified by label uuid . |
rdma | ceems_rdma_resp_cqe_flush_error | manager, uuid, device, port | Total number of times responder detected CQEs completed with flushed errors for device device and compute unit identified by label uuid . |
rdma | ceems_rdma_resp_local_length_error | manager, uuid, device, port | Total number of times responder detected local length errors for device device and compute unit identified by label uuid . |
rdma | ceems_rdma_resp_remote_access_errors | manager, uuid, device, port | Total number of times responder detected remote access errors for device device and compute unit identified by label uuid . |
rdma | ceems_rdma_qps_active | manager, uuid, device, port | Total number of active QPs for device device and compute unit identified by label uuid . |
rdma | ceems_rdma_cqs_active | manager, uuid, device, port | Total number of active CQs for device device and compute unit identified by label uuid . |
rdma | ceems_rdma_mrs_active | manager, uuid, device, port | Total number of active MRs for device device and compute unit identified by label uuid . |
rdma | ceems_rdma_cqe_len_active | manager, uuid, device, port | Total Length of active CQEs for device device and compute unit identified by label uuid . |
rdma | ceems_rdma_mrs_len_active | manager, uuid, device, port | Total Length of active MRs for device device and compute unit identified by label uuid . |