Prometheus
In order to use the dashboards provided in the repository, a minor
metric_relabel_configs
must be provided for all the target groups that have NVIDIA GPUs where
dcgm-exporter
is exporting metrics of the GPUs to Prometheus.
The following shows an example scrape configs where the target nodes contains NVIDIA GPUs:
scrape_configs:
# Scrape job containing NVIDIA DCGM exporter targets
- job: <job-name>
metric_relabel_configs:
- source_labels:
- modelName
- UUID
target_label: gpuuuid
regex: NVIDIA(.*);(.*)
replacement: $2
action: replace
- source_labels:
- modelName
- GPU_I_ID
target_label: gpuiid
regex: NVIDIA(.*);(.*)
replacement: $2
action: replace
- regex: UUID
action: labeldrop
- regex: GPU_I_ID
action: labeldrop
# Scrape job containing AMD SMI exporter targets
- job: <job-name>
metric_relabel_configs:
- source_labels:
- gpu_power
target_label: index
regex: (.*)
replacement: $1
action: replace
- source_labels:
- index
- gpu_use_percent
target_label: index
regex: ;(.+)
replacement: $1
action: replace
- source_labels:
- index
- gpu_memory_use_percent
target_label: index
regex: ;(.+)
replacement: $1
action: replace
- regex: gpu_power
action: labeldrop
- regex: gpu_use_percent
action: labeldrop
- regex: gpu_memory_use_percent
action: labeldrop
The metric_relabel_configs
renames UUID
and GPU_I_ID
which are
the UUID and MIG instance ID of NVIDIA GPU, respectively and sets it to gpuuuid
and
gpuiid
which are compatible with CEEMS exporter. Moreover the config also drops unused
UUID
and GPU_I_ID
labels to reduce storage.
Similarly, for AMD SMI exporter targets, metric_relabel_configs
gpu_power
,
gpu_use_percent
and gpu_memory_use_percent
labels, which provides GPU index,
to index
that is compatible with CEEMS exporter.