Skip to main content

Prometheus

In order to use the dashboards provided in the repository, a minor metric_relabel_configs must be provided for all the target groups that have NVIDIA GPUs where dcgm-exporter is exporting metrics of the GPUs to Prometheus.

The following shows an example scrape configs where the target nodes contains NVIDIA GPUs:

scrape_configs:
- job_name: "gpu-node-group"
metric_relabel_configs:
- source_labels: [UUID]
regex: (.*)
target_label: gpuuuid
replacement: $1
action: replace
- source_labels: [GPU_I_ID]
regex: (.*)
target_label: gpuiid
replacement: $1
action: replace
- regex: UUID
action: labeldrop
- regex: GPU_I_ID
action: labeldrop
- regex: modelName
action: labeldrop
static_configs:
- targets: ["http://gpu-0:9400", "http://gpu-1:9400", ...]

The metric_relabel_configs renames UUID and GPU_I_ID which are the UUID and MIG instance ID of GPU, respectively and sets it to gpuuuid and gpuiid which are compatible with CEEMS exporter. Moreover the config also drops unused UUID, GPU_I_ID and modelName labels to reduce storage and cardinality.