CEEMS Exporter
Background
ceems_exporter
is the Prometheus exporter that exposes individual compute unit
metrics, RAPL energy, IPMI power consumption, emission factor, GPU to compute unit
mapping, performance metrics, IO and network metrics. Besides, the exporter supports
a HTTP discovery component
that can provide a list of targets to Grafana Alloy.
ceems_exporter
collectors can be categorized as follows:
Resource manager collectors
These collectors exports metrics from different resource managers.
- Slurm collector: Exports SLURM job metrics like CPU, memory and GPU indices to job ID maps
- Libvirt collector: Exports libvirt managed VMs metrics like CPU, memory, IO, etc.
Energy related collectors
These collectors exports energy related metrics from different sources on compute node.
- IPMI collector: Exports power usage reported by
ipmi
tools - RAPL collector: Exports RAPL energy metrics
Emissions related collectors
This collector exports emissions related metrics that are used in estimating carbon footprint
- Emissions collector: Exports emission factor (g eCO2/kWh)
Node metrics collectors
These collectors exports node level metrics
- CPU collector: Exports CPU time in different modes (at node level)
- Meminfo collector: Exports memory related statistics (at node level)
Perf related collectors
In addition to above stated collectors, there are common "sub-collectors" that can be reused with different collectors. These sub-collectors provide auxiliary metrics like IO, networking, performance etc. Currently available sub-collectors are:
- Perf sub-collector: Exports hardware, software and cache performance metrics
- eBPF sub-collector: Exports IO and network related metrics
- RDMA sub-collector: Exports selected RDMA stats
These sub-collectors are not meant to work alone and they can enabled only when a main collector that monitors resource manager's compute units is activated.