Skip to main content

Compute Energy & Emissions Monitoring Stack (CEEMS)

CI/CDci CircleCI Coverage
Docsdocs
PackageRelease
MetaGitHub License Go Report Card code style
WARNING

CEEMS is in early development phase, thus subject to breaking changes with no guarantee of backward compatibility.

Features

  • Monitors energy, performance, IO and network metrics for different types of resource managers (SLURM, Openstack, k8s)
  • Supports different energy sources like RAPL, HWMON, Cray's PM Counters and BMC via IPMI or Redfish
  • Supports NVIDIA (MIG and vGPU) and AMD GPUs
  • Provides targets using HTTP Discovery Component to Grafana Alloy to continuously profile compute units
  • Realtime access to metrics via Grafana dashboards
  • Access control to Prometheus and Pyroscope datasources in Grafana
  • Stores aggregated metrics in a separate DB that can be retained for long time
  • CEEMS apps are capability aware

Components

CEEMS provide a set of components that enable operators and end users to monitor the consumption of resources of the compute units of different resource managers like SLURM, Openstack and Kubernetes.

  • CEEMS Prometheus exporter is capable of exporting compute unit metrics including energy consumption, performance, IO and network metrics from different resource managers in a unified manner. In addition, CEEMS exporter is capable of providing targets to Grafana Alloy for continuously profiling compute units using eBPF

  • CEEMS API server can store the aggregate metrics and metadata of each compute unit originating from different resource managers.

  • CEEMS load balancer provides basic access control on TSDB and Pyroscope so that compute unit metrics from different projects/tenants/namespaces are isolated.

"Compute Unit" in the current context has a wider scope. It can be a batch job in HPC, a VM in cloud, a pod in k8s, etc. The main objective of the stack is to quantify the energy consumed and estimate emissions by each "compute unit". The repository itself does not provide any frontend apps to show dashboards and it is meant to use along with Grafana and Prometheus to show statistics to users.

Note

Currently, only SLURM and Openstack are supported as a resource managers. In future support for Kubernetes will be added.