CEEMS Exporter

Background

ceems_exporter is the Prometheus exporter that exposes individual compute unit metrics, RAPL energy, IPMI power consumption, emission factor, GPU to compute unit mapping, performance metrics, IO and network metrics. Besides, the exporter supports eBPF based continuous profiling of compute units.

ceems_exporter collectors can be categorized as follows:

Resource manager collectors

These collectors export metrics from different resource managers.

Slurm collector: Exports SLURM job metrics like CPU, memory and GPU indices to job ID maps
Libvirt collector: Exports libvirt managed VMs metrics like CPU, memory, IO, etc.
k8s collector: Exports k8s managed pods metrics like CPU, memory, IO, etc.

These collectors export energy-related metrics from different sources on the compute node.

IPMI collector: Exports power usage reported by ipmi tools
Redfish collector: Exports power usage reported by the Redfish API
Cray PM counter collector: Exports power usage reported by Cray's PM counters
HWMon collector: Exports power and energy values reported by HWMON
RAPL collector: Exports RAPL energy metrics

This collector exports emissions-related metrics that are used in estimating carbon footprint:

Emissions collector: Exports emission factors (g eCO2/kWh)

Node Metrics Collectors

These collectors export node-level metrics:

CPU collector: Exports CPU time in different modes (at node level)
Meminfo collector: Exports memory-related statistics (at node level)

In addition to the above-stated collectors, there are common "sub-collectors" that can be reused with different collectors. These sub-collectors provide auxiliary metrics like IO, networking, performance etc. Currently available sub-collectors are:

Perf sub-collector: Exports hardware, software, and cache performance metrics
eBPF sub-collector: Exports IO and network-related metrics
RDMA sub-collector: Exports selected RDMA stats

These sub-collectors are not meant to work alone and can only be enabled when a main collector that monitors the resource manager's compute units is activated.

Sub-collectors

Perf sub-collector

The Perf sub-collector exports performance-related metrics fetched from Linux's perf subsystem. Currently, it supports hardware, software, and hardware cache events. More advanced details on perf events can be found in Brendan Gregg's blogs. Currently supported events are listed as follows:

Hardware Events

Total cycles
Retired instructions
Cache accesses. Usually this indicates Last Level Cache accesses but this may vary depending on your CPU
Cache misses. Usually this indicates Last Level Cache misses; this is intended to be used in conjunction with the PERF_COUNT_HW_CACHE_REFERENCES event to calculate cache miss rates
Retired branch instructions
Mis-predicted branch instructions

Software Events

Number of page faults
Number of context switches
Number of CPU migrations
Number of minor page faults (these do not require disk I/O to handle)
Number of major page faults (these require disk I/O to handle)

Hardware Cache Events

Number of L1 data cache read hits
Number of L1 data cache read misses
Number of L1 data cache write hits
Number of instruction L1 instruction read misses
Number of instruction TLB read hits
Number of instruction TLB read misses
Number of last level read hits
Number of last level read misses
Number of last level write hits
Number of last level write misses
Number of Branch Prediction Units (BPU) read hits
Number of Branch Prediction Units (BPU) read misses

eBPF sub-collector

The eBPF sub-collector uses eBPF to monitor network and IO statistics. More details on eBPF are outside the scope of the current documentation. This sub-collector loads various BPF programs that trace several kernel functions that are relevant to network and IO.

IO Metrics

The core concept for gathering IO metrics is based on the Linux kernel Virtual File System layer. From the documentation, VFS can be defined as:

The Virtual File System (also known as the Virtual Filesystem Switch) is the software layer in the kernel that provides the filesystem interface to userspace programs. It also provides an abstraction within the kernel which allows different filesystem implementations to coexist.

Thus, all IO activity must go through the VFS layer. By tracing appropriate functions, we can monitor IO metrics. At the same time, these VFS kernel functions have process context readily available, so it is possible to attribute each IO operation to a given cgroup. By leveraging these two ideas, it is possible to gather IO metrics for each cgroup. The following functions are traced in this sub-collector:

vfs_read
vfs_write
vfs_open
vfs_create
vfs_mkdir
vfs_unlink
vfs_rmdir

All the above kernel functions are exported and have a fairly stable API. By tracing these functions, we can monitor:

Number of read bytes
Number of write bytes
Number of read requests
Number of write requests
Number of read errors
Number of write errors
Number of open requests
Number of open errors
Number of create requests
Number of create errors
Number of unlink requests
Number of unlink errors

Read and write statistics are aggregated based on mount points. Most of the production workloads use high performance network file systems which are mounted on compute nodes at specific mount points. Different filesystems may offer different QoS, IOPS capabilities and hence, it is beneficial to expose the IO stats on per mountpoint basis instead of aggregating statistics from different types of file systems. It is possible to configure CEEMS exporter to provide a list of mount points to monitor at runtime.

Rest of the metrics are aggregated globally due to complexity in retrieving the mount point information from kernel function arguments.

NOTE

Total aggregate statistics should be very accurate for each cgroup. However, if underlying file system uses async IO, the IO rate statistics might not reflect true rate as kernel functions return immediately after submitting IO task to the driver of underlying filesystem. In the case of sync IO, kernel function blocks until IO operation has finished and thus, we get accurate rate statistics.

IO data path is highly complex with a lot of caching involved for several filesystem drivers. The statistics reported by these bpf programs are the ones "observed" by the user's workloads rather than from the filesystem's perspective. The advantage of this approach is that we can use these bpf programs to monitor different types of filesystems in a unified manner without having to support different filesystems separately.

Network metrics

The eBPF sub-collector traces kernel functions that monitor the following types of network events:

TCP with IPv4 and IPv6
UDP with IPv4 and IPv6

Most of the production workloads use TCP/UDP for communication and hence, only these two protocols are supported. This is done by tracing the following kernel functions:

tcp_sendmsg
tcp_sendpage for kernels < 6.5
tcp_recvmsg
udp_sendmsg
udp_sendpage for kernels < 6.5
udp_recvmsg
udpv6_sendmsg
udpv6_recvmsg

The following metrics are provided by tracing above functions. All the metrics are provided per protocol (TCP/UDP) and per IP family (IPv4/IPv6).

Number of egress bytes
Number of egress packets
Number of ingress bytes
Number of ingress packets
Number of retransmission bytes (only for TCP)
Number of retransmission packets (only for TCP)

RDMA sub-collector

Data transfer in RDMA happens directly between RDMA NIC and remote machine memory bypassing CPU. Thus, it is hard to trace RDMA data transfer on a compute unit granularity. However, the system wide data transfer metrics are readily available at /sys/class/infiniband pseudo-filesystem. Thus, this sub-collector exports important system wide RDMA stats along with few low-level metrics on a compute unit level.

System wide RDMA stats

Number of data octets received on all links
Number of data octets transmitted on all links
Number of packets received on all VLs by this port (including errors)
Number of packets transmitted on all VLs from this port (including errors)
Number of packets received on the switch physical port that are discarded
Number of packets not transmitted from the switch physical port
Number of inbound packets discarded by the port because the port is down or congested
Number of outbound packets discarded by the port because the port is down or congested
Number of packets containing an error that were received on this port
State of the InfiniBand port

Per compute unit RDMA stats

Number of active Queue Pairs (QPs)
Number of active Completion Queues (CQs)
Number of active Memory Regions (MRs)
Length of active CQs
Length of active MRs

In the case of Mellanox devices, the following metrics are available for each compute unit:

Number of received write requests for the associated QPs
Number of received read requests for the associated QPs
Number of received atomic request for the associated QPs
Number of times requester detected CQEs completed with errors
Number of times requester detected CQEs completed with flushed errors
Number of times requester detected remote access errors
Number of times requester detected remote invalid request errors
Number of times responder detected CQEs completed with errors
Number of times responder detected CQEs completed with flushed errors
Number of times responder detected local length errors
Number of times responder detected remote access errors

In order to interpret these metrics, please take a look at this very nice blog which explains internals of RDMA very well.

Collectors

Slurm collector

Slurm collector exports the job related metrics like usage of CPU, DRAM, RDMA, etc. This is done by walking through the cgroups created by SLURM daemon on compute node on every scrape request. As walking through the cgroups pseudo file system is very cheap, this will have zero to negligible impact on the actual job. The exporter has been heavily inspired by cgroups_exporter and it supports both cgroups v1 and v2.

WARNING

For SLURM collector to work properly, SLURM needs to be configured well to use all the available cgroups controllers. At least cpu and memory controllers must be enabled, if not cgroups will not contain any accounting information. Without cpu and memory accounting information, it is not possible to estimate energy consumption of the job.

More details on how to configure SLURM to get accounting information from cgroups can be found in Configuration section.

For jobs with GPUs, we must have the GPU ordinals allocated to each job so that we can match GPU metrics scraped by either dcgm-exporter or amd-smi-exporter to jobs. Unfortunately, this information is not available post-mortem of the job and hence, the CEEMS exporter exports a metric that maps the job ID to GPU ordinals.

Currently, the list of job related metrics exported by SLURM exporter are as follows:

Job current CPU time in user and system mode
Job CPUs limit (Number of CPUs allocated to the job)
Job current total memory usage
Job total memory limit (Memory allocated to the job)
Job current RSS memory usage
Job current cache memory usage
Job current number of memory usage hits limits
Job current memory and swap usage
Job current memory and swap usage hits limits
Job total memory and swap limit
Job CPU and memory pressures
Job maximum RDMA HCA handles
Job maximum RDMA HCA objects
Job to GPU ordinal mapping (when GPUs found on the compute node)
Current number of jobs on the compute node

More information on the metrics can be found in kernel documentation of cgroups v1 and cgroups v2.

Slurm collector supports perf and eBPF sub-collectors. Hence, in addition to the above stated metrics, all the metrics available in the sub-collectors can also be reported for each cgroup.

Libvirt collector

Similar to slurm collector, libvirt collector exports metrics of VMs managed by libvirt. This collector is useful monitor Openstack clusters where nova uses libvirt to manage lifecycle of the VMs. The exported metrics include usage of CPU, DRAM, block IO retrieved from cgroups. The collector supports both cgroups v1 and v2.

When GPUs are present on the compute node, like in the case of Slurm, we will need information on which GPU is used by which VM. This information can be obtained in libvirt's XML file that keeps the state of the VM.

NVIDIA's MIG instances uses a similar approach to vGPU to expose GPUs inside guests and hence, similar limitations apply.

Thus, currently it is not possible to reliably monitor the energy and usage metrics of libvirt instances with GPUs. In any case, the exporter will always export the GPU UUID to instance UUID to keep track of which instance is using which GPU. If the above stated limitations are addressed upstream, CEEMS will allow us to track usage metrics of GPU instances as well.

Currently, the list of metrics exported by Libvirt exporter are as follows:

Instance current CPU time in user and system mode
Instance CPUs limit (Number of CPUs allocated to the job)
Instance current total memory usage
Instance total memory limit (Memory allocated to the job)
Instance current RSS memory usage
Instance current cache memory usage
Instance current number of memory usage hits limits
Instance current memory and swap usage
Instance current memory and swap usage hits limits
Instance total memory and swap limit
Instance block IO read and write bytes
Instance block IO read and write requests
Instance CPU, memory and IO pressures
Instance to GPU ordinal mapping (when GPUs found on the compute node)
Current number of instances on the compute node

Similar to Slurm, libvirt exporter supports perf and eBPF sub-collectors.

WARNING

Libvirt will have no information about the guest running inside the cgroup and hence, it is not possible to profile individual processes inside the guest. Therefore, metrics exported by perf are for entire VM and it is not possible to have more fine grained control on which processes inside the guest can be profiled.

k8s collector

Kubelet manages the creation and deletion of k8s pods and the current exporter is capable of exporting metrics of each pod. Currently Kubelet supports two different types of cgroup drivers: cgroupfs and systemd and the collector supports both of them for cgroups v1 and v2.

When GPUs are present on the compute node, the collector finds the devices that are attached to pods by querying Kubelet using Pod Resources API.

Currently, the list of metrics exported by Libvirt exporter are as follows:

Pod current CPU time in user and system mode
Pod CPUs limit (Number of CPUs allocated to the job)
Pod current total memory usage
Pod total memory limit (Memory allocated to the job)
Pod current RSS memory usage
Pod current cache memory usage
Pod current number of memory usage hits limits
Pod current memory and swap usage
Pod current memory and swap usage hits limits
Pod total memory and swap limit
Pod block IO read and write bytes
Pod block IO read and write requests
Pod CPU, memory and IO pressures
Pod to GPU ordinal mapping (when GPUs found on the compute node)
Current number of pods on the compute node

Similar to Slurm and libvirt collectors, k8s collector supports perf and eBPF sub-collectors.

IPMI collector

The IPMI collector reports the current power usage by the node reported by IPMI DCMI command specification. Generally IPMI DCMI is available on all types of nodes and manufacturers as it is needed for BMC control. There are several IPMI implementation available like FreeIPMI, OpenIPMI, IPMIUtil, etc. As IPMI DCMI specification is standardized, different implementations must report the same power usage value of the node.

Currently, the metrics exposed by IPMI collector are:

Current power consumption
Minimum power consumption in the sampling period
Maximum power consumption in the sampling period
Average power consumption in the sampling period

Current exporter is capable of auto detecting the IPMI implementation and using the one that is found.

Redfish collector

The Redfish collector reports the current power usage by the node reported by Redfish Chassis Power specification. Redfish is a newer server management protocol that succeeds IPMI. If IPMI DCMI is not available (or the vendor chose to disable it in favor of Redfish), this collector can be used to fetch the total power consumption of the server.

Redfish reports the power consumption stats for each chassis, and the collector exports power readings for all the different types of chassis using the chassis label. For each chassis, the metrics exposed by the Redfish collector are:

Current power consumption
Minimum power consumption in the sampling period
Maximum power consumption in the sampling period
Average power consumption in the sampling period

HWMon collector

The HWMon collector reports the power and energy consumption of hardware components when available. Metrics are read from the /sys/class/hwmon directory. Each metric has a chip name to indicate what component is being monitored and a sensor name if there are multiple sensors monitoring the component. The list of supported metrics presented by the HWMon collector are:

Current power consumption
Minimum power consumption in the sampling period
Maximum power consumption in the sampling period
Average power consumption in the sampling period
Current energy usage

Cray's PM Counters Collector

Cray's PM counters collector reports the power consumption of CPU, DRAM, and accelerators like GPUs (when available) using Cray's internal in-band measurements.

The list of metrics exported by Cray's PM counters is:

Node-level energy, power, and power limit measurements
CPU and memory energy, power, and power limit measurements
Accelerators' energy, power, and power limit measurements (when available)

RAPL collector

The RAPL collector reports the power consumption of CPU and DRAM (when available) using the Running Average Power Limit (RAPL) framework. The exporter uses powercap to fetch the energy counters.

The list of metrics exported by the RAPL collector are:

RAPL package counters
RAPL DRAM counters (when available)
RAPL package power limits (when available)

If the CPU architecture supports more RAPL domains other than CPU and DRAM, they will be exported as well.

Emissions collector

The Emissions collector exports emission factors from different sources. Depending on the source, these factors can be static or dynamic, i.e., varying in time. Currently, the different sources supported by the exporter are:

Electricity Maps which is capable of providing real time emission factors for different countries.
Watt Time provides real time emission factors for different regions.
RTE eCO2 Mix provides real time emission factor for only France.
OWID provides a static emission factors for different countries based on historical data.
A world average value that is based on the available data of world countries.

The exporter will provide emission factors of all available countries from different sources.

CPU, meminfo, netdev and infiniband collectors

These collectors export node level metrics. CPU collector exports CPU time in different modes by parsing /proc/stat file. Similarly, meminfo collector exports memory usage statistics by parsing /proc/meminfo file. netdev collector exports network metrics from different network devices. Finally, infiniband collector exports Infiniband metrics from different IB devices. These collectors are heavily inspired by node_exporter.

These metrics are mainly used to estimate the proportion of CPU and memory usage by individual compute units and to estimate the energy consumption of compute units based on these proportions.

eBPF based continuous profiling

CEEMS exporter supports continuously profiling compute units using Grafana Pyroscope. The continuously profiling is only relevant for SLURM and k8s resource managers as profiling Openstack VMs from the hypervisor is not practical. Based on a configurable discovery interval, the exporter will find all the new process PIDs of a given compute unit and starts profiling them and send those profiling samples to Grafana Pyroscope server for aggregation.

Metrics

Please look at Metrics that lists all the metrics exposed by the CEEMS exporter.

Background​

Resource manager collectors​

Energy-Related Collectors​

Emissions-Related Collectors​

Node Metrics Collectors​

Perf related collectors​

Sub-collectors​

Perf sub-collector​

Hardware Events​

Software Events​

Hardware Cache Events​

eBPF sub-collector​

IO Metrics​

Network metrics​

RDMA sub-collector​

System wide RDMA stats​

Per compute unit RDMA stats​

Collectors​

Slurm collector​

Libvirt collector​

k8s collector​

IPMI collector​

Redfish collector​

HWMon collector​

Cray's PM Counters Collector​

RAPL collector​

Emissions collector​

CPU, meminfo, netdev and infiniband collectors​

eBPF based continuous profiling​

Metrics​