Use Prometheus to Query Runner Metrics Linked to Each Job (!1545) · Merge requests · GitLab.org / gitlab-runner

Alex Groleau requested to merge collect-prometheus into master Aug 20, 2019

What does this MR do?

This MR causes the gitlab-runner, after each job/build, to pull a time range of metrics from an available Prometheus server that is set to scrape metrics from runner instances. Metrics over this time range and json-ified and sent to GitLab as a raw artifact, associated with the job.

Why was this MR needed?

Metrics are a distilled version of logs. Much like traces, metrics, play an essential role in determining how a particular CI/CD job performed. This MR is needed to link a job to the metrics generated by runner nodes using Prometheus servers in the same production environment. These saved metrics can be used to display performance graphs to end users, as covered in https://gitlab.com/gitlab-org/gitlab-ce/issues/58921. They can also be used to detect various forms of abuse within the GitLab security team, as covered in https://gitlab.com/gitlab-com/gl-security/abuse/issues/83.

Why was this design chosen?

Gitlab-runner is currently responsible for running jobs, updating job status, and collecting traces from each run. This MR adds metrics collection alongside log collection to provide a complete picture of what happened on the runner instance. Implementing this feature in the gitlab-runner golang codebase with queries to Prometheus infrastructure made sense for the following reasons:

We are already collecting and shipping logs in gitlab-runner, metrics is a pretty similar concept; does it make sense to have job artifacts coming from gitlab-runner or multiple sources?
Prometheus is well understood at GitLab and we already have production Prometheus instances that are scraping metrics from our shared runners
Go provides go-routines, making it easy to query these metrics and ship them to GitLab as artifacts without slowing down running jobs
The artifact upload API was already implemented in the network code of gitlab-runner; this API is standard to GitLab CE and works well with metrics json data
The Prometheus client API dependency was already present in gitlab-runner as it is bundled with their DIY exporter libraries
Gitlab-runner is well aware of what job is running on which node for which specific time range, making it easy to query this information precisely from Prometheus
What is a go-routine with access to all necessary querying data in gitlab-runner becomes a Ruby sidekiq job that would need to be populated with runner-data if this were to be implemented in GitLab CE

What executors does this support?

This MR currently supports metrics querying for docker-machine only, for now, with an easy path forward to support other executors. To add support to another executor, simply add the GetMetricsLabelName() and GetMetricsLabelValue() functions to it.

GetMetricsLabelName() returns the Prometheus label name (node parameter name) for the PromQL queries (eg. "instance" for docker-machine)
GetMetricsLabelValue() returns the Prometheus label value (node instance identifier) for the PromQL queries (eg. "shared-runner-1234" for docker-machine)

Are there points in the code the reviewer needs to double check?

All of the committed changes.

Does this MR meet the acceptance criteria?

Documentation created/updated
Added tests for this feature/bug
In case of conflicts with master - branch was rebased

What are the relevant issue numbers?

Edited Nov 21, 2019 by Alex Groleau

Use Prometheus to Query Runner Metrics Linked to Each Job