Compute Energy & Emissions Monitoring Stack (CEEMS)


CI/CD
Docs
Package
Meta

WARNING

CEEMS is in early development phase, thus subject to breaking changes with no guarantee of backward compatibility.

Features

Monitors energy, performance, IO and network metrics for different types of resource managers (SLURM, Openstack, k8s)
Supports different energy sources like RAPL, HWMON, Cray's PM Counters and BMC via IPMI or Redfish
Supports NVIDIA (MIG, time sharing, MPS and vGPU) and AMD GPUs (Partition like CPX, QPX, TPX, DPX)
Supports zero instrumentation eBPF based continuous profiling using Grafana Pyroscope as backend
Realtime access to metrics via Grafana dashboards or a simple CLI tool
Access control to Prometheus and Pyroscope datasources in Grafana
Stores aggregated metrics in a separate DB that can be retained for long time
CEEMS apps are capability aware

Components

CEEMS provide a set of components that enable operators and end users to monitor the consumption of resources of the compute units of different resource managers like SLURM, Openstack and Kubernetes.

CEEMS Prometheus exporter is capable of exporting compute unit metrics including energy consumption, performance, IO and network metrics from different resource managers in a unified manner. In addition, CEEMS exporter is capable of continuous profiling of compute units using eBPF
CEEMS API server can store the aggregate metrics and metadata of each compute unit originating from different resource managers.
CEEMS load balancer provides basic access control on TSDB and Pyroscope so that compute unit metrics from different projects/tenants/namespaces are isolated.

"Compute Unit" in the current context has a wider scope. It can be a batch job in HPC, a VM in cloud, a pod in k8s, etc. The main objective of the stack is to quantify the energy consumed and estimate emissions by each "compute unit". The repository itself does not provide any frontend apps to show dashboards and it is meant to use along with Grafana and Prometheus to show statistics to users.

Features​

Components​

Features

Components