Skip to main content

Prometheus

In order to use the dashboards provided in the repository, a minor metric_relabel_configs must be provided for all the target groups that have NVIDIA GPUs where dcgm-exporter is exporting metrics of the GPUs to Prometheus.

The following shows an example scrape configs where the target nodes contains NVIDIA GPUs:

scrape_configs:
- job_name: "gpu-node-group"
metric_relabel_configs:
- source_labels: [UUID,GPU_I_ID]
separator: '/'
target_label: gpuuuid
- regex: UUID
action: labeldrop
- regex: modelName
action: labeldrop
static_configs:
- targets: ["http://gpu-0:9400", "http://gpu-1:9400", ...]

The metric_relabel_configs is merges labels UUID and GPU_I_ID which are the UUID and MIG instance ID of GPU, respectively and sets it to gpuuuid which is compatible with CEEMS exporter. Moreover the config also drops unused UUID and modelName labels to reduce storage and cardinality.