Skip to content

Expose queue duration related metrics in job payload sent to the runner

Tomasz Maczukin requested to merge add-queued-for-to-job-payload into master

What does this MR do and why?

This MR adds a queued_for and project_jobs_running_on_instance_runners_count fields sent in the job_info section of job payload sent to GitLab Runner.

queued_for, representing the difference between time of response generation and time of scheduling the job for queueing (which is set with the queued_at field in the database record), Runner will know how long the job was being in the queue.

This value will be next used on the GitLab Runner side to generate a histogram metric, representing queueing times for each specific runner and [[runners]] worker that asks for jobs. This will allow runner administrators to track queueing times of jobs targeting their runners.

Similarly project_jobs_running_on_instance_runners, for jobs targeting instance runners (and only instance type runners!) will allow the runner administrators to see how scheduled jobs fit into the fair scheduling algorithm.

We will use such information, for example, to improve our SLI definitions for SaaS runners on GitLab.com. As currently our apdex is the same for all different runner types that we manage, as it's calculated from a general metric exposed from GitLab, which by design doesn't partition such information per runner.

With this change and the planned GitLab Runner change, each runner owner will be able to track such data with other runner metrics. And for us this means that we will be able to define different SLIs for each SaaS runner shard that we maintain.

The biggest value of this change is however in the fact, that this metric would become usable for self-hosted GitLab Runner instances. As for GitLab installations like GitLab.com, individual users who self-host their runners and would like to track queuing performance are unable to do that, as GitLab internal metrics are... well... internal 😉 Available only to GitLab instance administrators (so us in the case of GitLab.com). And the global metric like job_queue_duration_seconds exposed by GitLab can't be partitioned by the individual Runner ID, as such cardinality of data would quickly kill any Prometheus server.

By passing this data to Runner and exposing it there, each Runner owner can track the queuing timing of their own runner instances. Without a need for GitLab administrators to expose GitLab's metric and with the data being partitioned by each individual runner.

GitLab Runner update at gitlab-runner!3499 (merged).

Screenshots or screen recordings

N/A

How to set up and validate locally

Numbered steps to set up and validate the change are strongly suggested.

MR acceptance checklist

This checklist encourages us to confirm any changes have been analyzed to reduce risks in quality, performance, reliability, security, and maintainability.

Edited by Grzegorz Bizon

Merge request reports