Expose command stats (rusage) metrics via prometheus (!4464) · Merge requests · GitLab.org / gitaly

Igor requested to merge command-stats-metrics into master Apr 05, 2022

Background

We currently track rusage metrics for shelled-out commands in logs on a per-RPC basis. This allows us to get a very fine-grained view into resource attribution (though aggregated by RPC).

However, logs often do not lend themselves to corse-grained and long-term analysis. For this reason it is useful to expose metrics via prometheus.

Problem

By aggregating that data as metrics, we aim to partially close an observability gap that exists for short-lived processes. The existing process-exporter metrics are severely under-reporting the utilization of short-lived processes, which gitaly spawns many of.

In an extreme case, this was reporting git archive processes using 1 CPU, when they were in fact using almost 20 CPUs (source).

This impacts our ability to diagnose performance issues on these hosts, as well as our ability to capacity plan and decide where to spend engineering resources.

Solution

We introduce metrics which are not based on sampling, and instead make use of rusage data reported by the OS on process exit.

This much better captures short-lived processes, but may not include long-lived processes until they exit, see pitfalls below.

Implementation

This patch introduces a set of gitaly_command_* metrics which provide aggregated resource attribution along the following dimensions:

cmd - the basename of the command being executed.
subcmd - an optional subcommand, e.g. archive for git archive
grpc_service - the grpc service caller
grpc_method - the grpc method caller

The newly introduced metrics are:

gitaly_command_cpu_seconds_total Sum of CPU time spent by shelling out
gitaly_command_real_seconds_total Sum of real time spent by shelling out
gitaly_command_minor_page_faults_total Sum of minor page faults ...
gitaly_command_major_page_faults_total Sum of major page faults ...
gitaly_command_signals_received_total Sum of signals received ...
gitaly_command_context_switches_total Sum of context switches performed ...

All of the metrics are counters.

Pitfalls

There are a few gotchas with these metrics:

Scope: Gitaly can only report what it controls. So we will only include short-lived processes created by gitaly itself. Other short-lived processes on the box will not be tracked.
Late reporting: Because we collect the cumulatively spent resources only once a child process exits, we may potentially see large jumps in the metrics for long-lived processes. And we'll also see those late, so a long-lived process that has not yet exited will not yet show up.

Rollout considerations

This feature is being introduced behind a feature flag. However, since metrics are sticky, once the metric has been defined, it will be returned by the process until the next restart.

The cardinality of the metrics should be relatively well-bounded in any case.

Acknowledgements

Thanks to @msmiley for discovering the difference in reporting, and for providing early input on this patch.

Edited Apr 05, 2022 by Igor

Expose command stats (rusage) metrics via prometheus