Add queue duration histogram metric
One of important metrics from GitLab Runner and GitLab CI/CD maintenance perspective is the duration of job being queued in the pending
state.
Currently we export a histogram metric from GitLab, with general summary partitioned only by Runner type: instance, group or project. This gives a general view but doesn't allow more detailed analysis on per-runner basis. And because metrics are exported from GitLab, they are not available for group/project runners owned by GitLab users who are not GitLab instance administrators.
To fix that I propose to add two changes:
- In GitLab - update the job scheduling code to use the same data that feed GitLab metrics and send them with job payload to the Runner. This is being done in Expose queue duration related metrics in job pa... (gitlab!90653 - merged) • Tomasz Maczukin • 16.4.
- In GitLab Runner - consume these new information from the job payload and expose a histogram metric based on it. This is being done in Expose queueing duration histogram metric (!3499 - merged) • Tomasz Maczukin • 16.4.
When both changes will be merged and released, then anyone who already tracks metrics of runner will get the histogram of how long it took for jobs that finally landed on that runner to be transitioned from pending
to running
.
This will allow self-hosted runner owners (in case of GitLab.com: users who manage group and/or project runners) to finally get these metrics and adjust their configuration to fit their expectations. In case of instance owned runners (like we do for SaaS) it will allow to analyze queuing performance per each runner that we have and not globally mixing all instance, group or project runners together.