Compute Energy & Emissions Monitoring Stack (CEEMS)
CI/CD | |
Docs | |
Package | |
Meta |
CEEMS is in early development phase, thus subject to breaking changes with no guarantee of backward compatibility.
Features
- Monitor energy, performance, IO and network metrics for different types of resource managers (SLURM, Openstack, k8s)
- Support NVIDIA (MIG and vGPU) and AMD GPUs
- Provides targets using HTTP Discovery Component to Grafana Alloy to continuously profile compute units
- Realtime access to metrics via Grafana dashboards
- Access control to Prometheus datasource in Grafana
- Stores aggregated metrics in a separate DB that can be retained for long time
- CEEMS apps are capability aware
Components
CEEMS provide a set of components that enable operators to monitor the consumption of resources of the compute units of different resource managers like SLURM, Openstack and Kubernetes.
-
CEEMS Prometheus exporter is capable of exporting compute unit metrics including energy consumption, performance, IO and network metrics from different resource managers in a unified manner.
-
CEEMS API server can store the aggregate metrics and metadata of each compute unit originating from different resource managers.
-
CEEMS load balancer provides basic access control on TSDB so that compute unit metrics from different projects/tenants/namespaces are isolated.
"Compute Unit" in the current context has a wider scope. It can be a batch job in HPC, a VM in cloud, a pod in k8s, etc. The main objective of the stack is to quantify the energy consumed and estimate emissions by each "compute unit". The repository itself does not provide any frontend apps to show dashboards and it is meant to use along with Grafana and Prometheus to show statistics to users.
Currently, only SLURM and Openstack are supported as a resource managers. In future support for Kubernetes will be added.