Energy consumption estimation

CPU

TODO

GPU

CEEMS leverages dcgm-exporter and amd-smi-exporter to get power consumption of GPUs. When the resource manager uses full physical GPU, estimating the power consumption of each compute unit is straightforward as CEEMS exporter already exports a metric that maps the compute unit ID to GPU ordinal. However, NVIDIA GPUs support sharing of one physical GPU among different compute units using Multi Instance GPU (MIG) and GRID vGPU strategies. Currently, dcgm-exporter does not estimate power consumption of each MIG instance or vGPU. Thus, CEEMS uses the following approximation to estimate power consumption of shared GPU instances.

MIG

CEEMS exporter uses an approximation based on number of Streaming Multiprocessors (SM) for each MIG instance profile. For instance, in a typical A100 40GB card, a full GPU can be split into following profiles:

$ nvidia-smi mig -lgi
+----------------------------------------------------+
| GPU instances:                                     |
| GPU   Name          Profile  Instance   Placement  |
|                       ID       ID       Start:Size |
|====================================================|
|   0  MIG 1g.5gb       19       13          6:1     |
+----------------------------------------------------+
|   0  MIG 2g.10gb      14        5          4:2     |
+----------------------------------------------------+
|   0  MIG 4g.20gb       5        1          0:4     |
+----------------------------------------------------+

It means MIG instance 4g.20gb has 4/7 of SMs, 2g.10gb has 2/7 of SMs and 1g.5gb has 1/7 of SMs. Consequently, the power consumed by entire GPU is divided amongst the different MIG instances in the ratio of their SMs and their usage respectively. For example, if the physical GPU's power consumption is P Watts, and if there are two other compute units using instance 1g.5gb and 4g.20gb, the power consumption of each MIG profile will be estimated as follows:

1g.5gb: P * (1 * SM_USAGE(1g.5gb) * SM_OCC(1g.5gb))/((1 * SM_USAGE(1g.5gb) * SM_OCC(1g.5gb) + (7 * SM_USAGE(4g.20gb) * SM_OCC(4g.20gb))))
4g.20gb: P * (7 * SM_USAGE(4g.20gb) * SM_OCC(4g.20gb))/((1 * SM_USAGE(1g.5gb) * SM_OCC(1g.5gb) + (7 * SM_USAGE(4g.20gb) * SM_OCC(4g.20gb))))

where SM_USAGE and SM_OCC are SMs usage and occupancy, respectively. The above formula saying that the total power usage of each instance is the effective usage of SMs by one instance relative to all the instances that are being used at a given time.

The exporter will export the coefficient for each MIG instance which can be used along with power consumption metric of dcgm-exporter to estimate power consumption of individual MIG instances.

vGPU

In the case of Libvirt, besides MIG it supports GRID vGPU time sharing. The following scenarios are possible when GPUs are present on the compute node:

PCI pass through of NVIDIA and AMD GPUs to the guest VMs
Virtualization of full GPUs using NVIDIA Grid vGPU
Virtualization of MIG GPUs using NVIDIA Grid vGPU

If a GPU is added to VM using PCI pass through, this GPU will not be available for the hypervisor and hence, it cannot be queried or monitored. This is due to the fact that the GPU will be unbound from the hypervisor and bound to guest. Thus, energy consumption and GPU metrics for GPUs using PCI passthrough will only be available in the guest.

NVIDIA's vGPU uses mediated devices to expose GPUs in the guest and thus, GPUs can be queried and monitored from both hypervisor and guest. However, CEEMS rely on dcgm-exporter to export GPU energy consumption and usage metrics and it does not support usage and energy consumption metrics for vGPUs. Thus, CEEMS exporter uses the following approximation method to estimate energy consumption of each vGPU which in-turn gives energy consumption of each guest VM.

NVIDIA Grid vGPU time slicing divides the GPU resources equally among all the active vGPUs at any given time and schedule the work on the given physical GPU. Thus, if there are 4 vGPUs active on a given physical GPU, each vGPU will get 25% of the full GPU compute power. Thus, a reasonable approximation would be to split the current physical GPU power consumption equally among all vGPUs. The same applies when using vGPU on the top of MIG partition. MIG already divides the physical GPU into different profiles by assigning a given number of SMs for each profile as discussed in SLURM collector above. When multiple vGPUs are running on the top of MIG instance, this coefficient is further divided by number of active vGPUs. For instance, if there are 4 vGPUs scheduled on a MIG profile 4g.20gb, the power estimated by the above formula will be further divided by 4 to attribute the power usage by each VM.

CPU​

GPU​

MIG​

vGPU​

CPU

GPU

MIG

vGPU