cacct
cacct
is a CLI client that can be used instead of Grafana when operators cannot or do not wish to maintain a Grafana instance. This CLI client communicates with both the CEEMS API server and the TSDB server to fetch energy, usage, and performance metrics for a given compute unit, project, and/or user. It has been largely inspired by SLURM's sacct
tool, and the API resembles that of sacct
.
cacct
identifies the current username from their Linux UID. Thus, for cacct
to work correctly, the user's UID must be the same on the machine where cacct
is executed and in the CEEMS API server database.
This tool has been specifically designed for HPC platforms where there is a common login node that users can access via SSH. The tool must be installed on such login nodes along with its configuration file. The cacct
configuration file contains the HTTP client configuration details needed to connect to the CEEMS API and TSDB servers. Consequently, this configuration file might contain secrets for communicating with these servers, making it crucial to protect this file on a multi-tenant system like HPC login nodes. This will be discussed further in the following sections. First, let's examine the available configuration sections for cacct
:
# cacct configuration skeleton
ceems_api_server: <CEEMS API SERVER CONFIG>
tsdb: <TSDB CONFIG>
cacct
always looks for its configuration file at /etc/ceems/config.yml
or /etc/ceems/config.yaml
. Therefore, the configuration file must be installed in one of these locations.
A sample configuration file with only the CEEMS API Server configuration is presented below:
ceems_api_server:
cluster_id: slurm-0
user_header_name: X-Grafana-User
web:
url: http://ceems-api-server:9020
basic_auth:
username: ceems
password: supersecretpassword
The above configuration assumes that the target cluster has slurm-0
as its cluster ID, as configured in the CEEMS API server configuration. By default, the CEEMS API server expects the username in the X-Grafana-User
header, so cacct
sets the value for this header with the username making the request. Finally, the web
section contains the HTTP client configuration for the CEEMS API server. In this example, the CEEMS API server is reachable at host ceems-api-server
on port 9020
, and basic authentication is configured.
cacct
can pull time series data from the TSDB server for the requested compute units. This is possible only when the tsdb
section is configured. A sample configuration file including both CEEMS API server and TSDB server configurations is shown below:
ceems_api_server:
cluster_id: slurm-0
user_header_name: X-Grafana-User
web:
url: http://ceems-api-server:9020
basic_auth:
username: ceems
password: supersecretpassword
tsdb:
web:
url: http://tsdb:9090
basic_auth:
username: prometheus
password: anothersupersecretpassword
queries:
# CPU utilization
cpu_usage: uuid:ceems_cpu_usage:ratio_irate{uuid=~"%s"}
# CPU Memory utilization
cpu_mem_usage: uuid:ceems_cpu_memory_usage:ratio{uuid=~"%s"}
# Host power usage in Watts
host_power_usage: uuid:ceems_host_power_watts:pue{uuid=~"%s"}
# Host emissions in g/s
host_emissions: uuid:ceems_host_emissions_g_s:pue{uuid=~"%s"}
# GPU utilization
avg_gpu_usage: uuid:ceems_gpu_usage:ratio{uuid=~"%s"}
# GPU memory utilization
avg_gpu_mem_usage: uuid:ceems_gpu_memory_usage:ratio{uuid=~"%s"}
# GPU power usage in Watts
gpu_power_usage: uuid:ceems_gpu_power_watts:pue{uuid=~"%s"}
# GPU emissions in g/s
gpu_emissions: uuid:ceems_gpu_emissions_g_s:pue{uuid=~"%s"}
# Read IO bytes/s
io_read_bytes: irate(ceems_ebpf_read_bytes_total{uuid=~"%s"}[1m])
# Write IO bytes/s
io_write_bytes: irate(ceems_ebpf_write_bytes_total{uuid=~"%s"}[1m])
Similar to the CEEMS API server configuration, this example assumes the TSDB server is reachable at tsdb:9090
and basic authentication is configured on the HTTP server. The tsdb.queries
section is where operators configure the queries to pull time series data for each metric. If operators used ceems_tool
to generate recording rules for the TSDB, the queries in the sample configuration above will work out-of-the-box. The keys in the queries
object can be chosen freely; they are provided for configuration file maintainability. The placeholder %s
will be replaced by the compute unit UUIDs at runtime before executing the queries on the TSDB server.
There is no risk of injection here, as the UUID values provided by the end-user are first sanitized and then verified with the CEEMS API server to check if the user is the owner of the compute unit before passing them to the TSDB server.
A complete reference can be found in the Reference section. A valid sample configuration file can be found in the repository.
Securing configuration file
As evident from the previous section, the cacct
configuration file contains secrets that should not be accessible to end-users. At the same time, the cacct
executable must be accessible to end-users so they can fetch their usage statistics. This means cacct
must be able to read the configuration file at runtime, but the user executing it should not. This can be achieved using the Sticky bit.
By using the SETUID or SETGID bit on the executable, the binary executes as the user or group that owns the file, not the user who invokes the execution. For instance, imagine a case where a system user/group ceems
is created on an HPC login node. The SETGID sticky bit can be set on cacct
as follows:
chown ceems:ceems /usr/local/bin/cacct
chmod g+s /usr/local/bin/cacct
# Ensure others can execute cacct
chmod o+x /usr/local/bin/cacct
# Use the same user/group as owner:group for the cacct configuration file
chown ceems:ceems /etc/ceems/config.yml
# Revoke all permissions for others
chmod o-rwx /etc/ceems/config.yml
Now, every time cacct
is invoked, it runs as the ceems
user/group instead of the user who invoked it. Since the same user/group owns /etc/ceems/config.yml
, cacct
can read the file. Simultaneously, the user who invoked the cacct
binary cannot access /etc/ceems/config.yml
because their permissions have been revoked.
When cacct
is installed using the RPM/DEB file provided by the CEEMS Releases, cacct
is already installed with the sticky bit set. Operators only need to populate the configuration file at /etc/ceems/config.yml
.