Connect metrics collected with Prometheus on Runner's machine with job
Recently I was asked to describe if it's possible to check how many memory (or CPU) was used during a job execution. The case was that user's script, which was executed on GitLab.com's Shared Runners, was failing in some moment and everything suggested that it was caused by hitting a available memory limit.
The fact is, that with our Shared Runners we're able to check this. We have a fleet of Prometheus servers that are tracking node_exporter
metrics of machines, created by our autoscalers to handle the jobs. The servers are distributed across GCP regions that we're using and are tracking machines existing in their own regions. Knowing which Runner is creating machines in which region, how to connect the job with specific autoscaled machine and knowing when the job was executed, one is able to get some useful metrics.
When talking about this I thought that it should be quite easy to change this to a feature provided by GitLab:
-
In Runner configuration on GitLab side we allow user to define which Prometheus server tracks metrics for machines that this Runner is using. Like with the other options, we allow to pass this through registration, so it can be fully automated (e.g. look on our K8S deployment of Runner Managers plan - no place for manual managers creation!)
-
When the job is started, we get the hostname of the machine that handles the job. We can get this either from the trace or try to discover this with Runner (e.g. when using
docker+machine
executor, we get the hostname from the machine metadata). Next we send the hostname to GitLab with other metadata when updating the job/patching the trace. -
If Prometheus server is set for the Runner that handles the job, we're starting requesting it for few metrics like: used RAM, used CPU, load, network throughput, used disk space.
-
We create a
system metrics
tab (but only when metrics for the job are available) and at any moment user can switch there and check how the system resources utilization is changing during the job. -
When the job is finished, we query Prometheus for the metrics in the time scope of the job and save them locally in some format usable for graphing this in GitLab's UI. Thanks to this we can show historical data on the job page and we're not affected by data retention configured for the Prometheus server. Such resource usage profile could be also downloaded by user and analyzed by his own tools or could be used by GitLab administrators to analyze the overall usage trends which may be useful for infrastructure capacity planing.
-
In future, when we will finally have usable sections in job's trace UI, we could then add a connection between them and the metric tab. E.g. user is looking at the trace on a command that - in his opinion - took too much time. He clicks on some
section bar
for that command, he chooses somesystem metrics
button on that bar and he is moved to the metrics tab, where specified time frame (the time frame of the command section) is highlighted.
Such resources usage profile gives users very useful information. With this one can answer questions like:
- Is the environment used for building/testing the software enough powerful? Or maybe it's too powerful and some cost saving could be done?
- Is the change adding performance improvements or making things worse? I remember that recently we were also talking about tracking such information in time - this could be a good source of the data.
- Is the job working properly? Does it hangs because of a bug in the code or because of lack of resources?