GitLab Runner Metrics granularity improvements

Description

I am writing this from the perspective of a GitLab admin, who is running a build farm.

Currently, the Prometheus metrics coming from the GitLab Runner process is process-wide. This means that if there are several runners registered, all metrics are aggregated. Furthermore, there are no metrics which can allow me to distinguish projects/fork trees of builds (tagging by project ID alone is insufficient, since a fork appears as a separate project ID).

This has several consequences:

If I have several runners registered to cater to different kinds of build machines, I only see the overall statistics of the entire build farm. This is a problem if the build farm is a heterogeneous set of machines.
If a large project with many forks are performing many builds in parallel, I cannot determine if a fork-tree of projects is creating a disproportionate number of jobs to be run in parallel.

With this in mind, the following scenarios may occur, and it is not possible to easily determine what's going on

If a project requires a runner with a specific tag, and that build farm cannot be scaled out (think jobs requiring physical hardware), can I figure out who/which project is using up the entire build farm?
Can I determine the job queueing times per runner, since demand among them will not be the same in the general case?
Can I determine if a user is using too many instances at a time, or if a user's MR accidentally causes a build to take much longer than it ought to be?
How many jobs are in the queue for a given project/fork tree?

Proposal

The simplest currently is to annotate the Prometheus metrics with the project/base project ID. This will likely cause an explosion in the cardinality of the metrics, alternatives welcome.

Links to related issues and merge requests / references

Edited Aug 14, 2020 by 🤖 GitLab Bot 🤖