GitLab Runner Metrics granularity improvements
Description
I am writing this from the perspective of a GitLab admin, who is running a build farm.
Currently, the Prometheus metrics coming from the GitLab Runner process is process-wide. This means that if there are several runners registered, all metrics are aggregated. Furthermore, there are no metrics which can allow me to distinguish projects/fork trees of builds (tagging by project ID alone is insufficient, since a fork appears as a separate project ID).
This has several consequences:
- If I have several runners registered to cater to different kinds of build machines, I only see the overall statistics of the entire build farm. This is a problem if the build farm is a heterogeneous set of machines.
- If a large project with many forks are performing many builds in parallel, I cannot determine if a fork-tree of projects is creating a disproportionate number of jobs to be run in parallel.
With this in mind, the following scenarios may occur, and it is not possible to easily determine what's going on
- If a project requires a runner with a specific tag, and that build farm cannot be scaled out (think jobs requiring physical hardware), can I figure out who/which project is using up the entire build farm?
- Can I determine the job queueing times per runner, since demand among them will not be the same in the general case?
- Can I determine if a user is using too many instances at a time, or if a user's MR accidentally causes a build to take much longer than it ought to be?
- How many jobs are in the queue for a given project/fork tree?
Proposal
The simplest currently is to annotate the Prometheus metrics with the project/base project ID. This will likely cause an explosion in the cardinality of the metrics, alternatives welcome.