cacct
cacct
is a CLI client that can be used in the place of Grafana when the
operators cannot/do not wish to maintain a Grafana instance. This CLI client
talks to both CEEMS API server and TSDB server to fetch the energy, usage and
performance metrics of a given compute unit and/or project and/or user. This
has been largely inspired from SLURM's sacct
tool and the API resembles that of sacct
.
cacct
identifies the current username from their Linux's UID. Thus, for cacct
to work correctly, the user's UID must be the same on the machine where cacct
is being executed and in the CEEMS API server DB.
This tool has been specifically designed for HPC platforms where there a common
login node that users can access via SSH. Th tool must be installed on such
login nodes along with its configuration file. The cacct
's configuration file
contains the HTTP client configuration details to connect to CEEMS API and
TSB servers. Thus, this configuration file would potentially contains secrets to
talk to these servers and it is very important to protect this file on a multi-tenant
system like HPC login nodes. This will be discussed more in the following sections. First,
let's take a look at the available configuration sections for cacct
:
# cacct configuration skeleton
ceems_api_server: <CEEMS API SERVER CONFIG>
tsdb: <TSDB CONFIG>
cacct
always looks for the configuration file at /etc/ceems/config.yml
or
/etc/ceems/config.yaml
. Thus, configuration file must be installed in one
of these locations.
A sample configuration file with only CEEMS API Server config is presented below:
ceems_api_server:
cluster_id: slurm-0
user_header_name: X-Grafana-User
web:
url: http://ceems-api-server:9020
basic_auth:
username: ceems
password: supersecretpassword
The above configuration assumes that the target cluster has slurm-0
as cluster ID configured
in the configuration of CEEMS API server. By default,
CEEMS API server expects the username in the header X-Grafana-User
so that cacct
sets the value for
this header with the username that is making the request. Finally, section web
contains the HTTP client
configuration of the CEEMS API server. In the above example, CEEMS API server is reachable at host
ceems-api-server
and on port 9020
and basic auth is configured to the CEEMS API server.
cacct
is capable of pulling the time series data from TSDB server of the requested compute units and
it is possible to do so only when tsdb
section has been configured. A sample configuration file
with CEEMS API server and TSDB server configs is:
ceems_api_server:
cluster_id: slurm-0
user_header_name: X-Grafana-User
web:
url: http://ceems-api-server:9020
basic_auth:
username: ceems
password: supersecretpassword
tsdb:
web:
url: http://tsdb:9090
basic_auth:
username: prometheus
password: anothersupersecretpassword
queries:
# CPU utilisation
cpu_usage: uuid:ceems_cpu_usage:ratio_irate{uuid=~"%s"}
# CPU Memory utilisation
cpu_mem_usage: uuid:ceems_cpu_memory_usage:ratio{uuid=~"%s"}
# Host power usage in Watts
host_power_usage: uuid:ceems_host_power_watts:pue{uuid=~"%s"}
# Host emissions in g/s
host_emissions: uuid:ceems_host_emissions_g_s:pue{uuid=~"%s"}
# GPU utilization
avg_gpu_usage: uuid:ceems_gpu_usage:ratio{uuid=~"%s"}
# GPU memory utilization
avg_gpu_mem_usage: uuid:ceems_gpu_memory_usage:ratio{uuid=~"%s"}
# GPU power usage in Watts
gpu_power_usage: uuid:ceems_gpu_power_watts:pue{uuid=~"%s"}
# GPU emissions in g/s
gpu_emissions: uuid:ceems_gpu_emissions_g_s:pue{uuid=~"%s"}
# Read IO bytes
io_read_bytes: irate(ceems_ebpf_read_bytes_total{uuid=~"%s"}[1m])
# Write IO bytes
io_write_bytes: irate(ceems_ebpf_write_bytes_total{uuid=~"%s"}[1m])
Just like in the case of CEEMS API server, the above configuration assumes
that TSDB server is reachable at tsdb:9090
and basic auth has been configured
on the HTTP server. The section tsdb.queries
is where the operators need
to configure the queries to pull the time series data of each metric. If
the operators have used ceems_tool
to generate
recording rules for TSDB, the queries used in the above configuration sample
file will work out-of-the-box. The key of queries
object can be chosen
freely and it is provided for the maintainability of the configuration file.
The placeholder %s
will be replaced by the compute unit UUIDs at runtime
before executing queries on TSDB server.
There is no risk of injection here as the UUID values provided by the end-user are first sanitized and then verified with CEEMS API server to check if the user is owner of the compute unit before passing them to TSDB server.
A complete reference can be found in Reference section. A valid sample configuration file can be found in the repo
Securing configuration file
As evident from the previous section, the configuration file of cacct
will contain
secrets that should be accessible to the end users. At the same time, the executable
cacct
must be accessible to the end-users to be able to fetch their usage statistics.
This means, cacct
must be able to read the configuration file at the runtime but not
the user who is executing it. This can be done using Sticky bit.
By using SETUID or SETGID bit on the executable, the binary will execute as the user or
group that owns the file and not the user who invokes the execution. For instance, imagine
a case where a system user/group ceems
is created on a HPC login node. The sticky bit SETGID
can be set on the cacct
as follows:
chown ceems:ceems /usr/local/bin/cacct
chmod g+s /usr/local/bin/cacct
# Ensure others can execute cacct
chmod o+x /usr/local/bin/cacct
# Use the same user/group as owner:group to cacct configuration file
chown ceems:ceems /etc/ceems/config.yml
# Revoke all the permissions for others
chmod o-rwx /etc/ceems/config.yml
Now everytime cacct
has been invoked, it runs as ceems
user instead of user who invoked
it. As the same user/group is the owner to the file /etc/ceems/config.yml
, cacct
will be
able to read the file. At the same time, the user who invoked the cacct
binary will not be
able to access /etc/ceems/config.yml
as the permission have been revoked.
When cacct
has been installed using the RPM/DEB file provided by the
CEEMS Releases, cacct
will be already
installed with sticky bit and the operators only need to populate configuration file at
/etc/ceems/config.yml
.