Skip to main content

Prometheus

In order to use the dashboards provided in the repository, minor metric_relabel_configs configuration must be provided for all target groups that have NVIDIA GPUs where the dcgm-exporter exports GPU metrics to Prometheus.

The following example shows scrape configurations where the target nodes contain NVIDIA GPUs:

scrape_configs:
# Scrape job containing NVIDIA DCGM exporter targets
- job_name: <job-name>
metric_relabel_configs:
- source_labels:
- modelName
- UUID
target_label: gpuuuid
regex: NVIDIA(.*);(.*)
replacement: $2
action: replace
- source_labels:
- modelName
- GPU_I_ID
target_label: gpuiid
regex: NVIDIA(.*);(.*)
replacement: $2
action: replace
- regex: UUID
action: labeldrop
- regex: GPU_I_ID
action: labeldrop

# Scrape job containing AMD SMI exporter targets
- job_name: <job-name>
metric_relabel_configs:
- source_labels:
- gpu_power
target_label: index
regex: (.*)
replacement: $1
action: replace
- source_labels:
- index
- gpu_use_percent
target_label: index
regex: ;(.+)
replacement: $1
action: replace
- source_labels:
- index
- gpu_memory_use_percent
target_label: index
regex: ;(.+)
replacement: $1
action: replace
- regex: gpu_power
action: labeldrop
- regex: gpu_use_percent
action: labeldrop
- regex: gpu_memory_use_percent
action: labeldrop

The metric_relabel_configs section renames the UUID and GPU_I_ID labels (which represent the UUID and MIG instance ID of the NVIDIA GPU, respectively) to gpuuuid and gpuiid, making them compatible with the CEEMS exporter. Moreover, the configuration also drops the unused UUID and GPU_I_ID labels to reduce storage usage.

Similarly, for AMD SMI exporter targets, the metric_relabel_configs section extracts the GPU index from the gpu_power, gpu_use_percent, and gpu_memory_use_percent labels and maps it to the index label, which is compatible with the CEEMS exporter.