Use Prometheus to Query Runner Metrics Linked to Each Job
What does this MR do?
This MR causes the gitlab-runner, after each job/build, to pull a time range of metrics from an available Prometheus server that is set to scrape metrics from runner instances. Metrics over this time range and json-ified and sent to GitLab as a raw artifact, associated with the job.
Why was this MR needed?
Metrics are a distilled version of logs. Much like traces, metrics, play an essential role in determining how a particular CI/CD job performed. This MR is needed to link a job to the metrics generated by runner nodes using Prometheus servers in the same production environment. These saved metrics can be used to display performance graphs to end users, as covered in https://gitlab.com/gitlab-org/gitlab-ce/issues/58921. They can also be used to detect various forms of abuse within the GitLab security team, as covered in https://gitlab.com/gitlab-com/gl-security/abuse/issues/83.
Why was this design chosen?
Gitlab-runner is currently responsible for running jobs, updating job status, and collecting traces from each run. This MR adds metrics collection alongside log collection to provide a complete picture of what happened on the runner instance. Implementing this feature in the gitlab-runner golang codebase with queries to Prometheus infrastructure made sense for the following reasons:
- We are already collecting and shipping logs in gitlab-runner, metrics is a pretty similar concept; does it make sense to have job artifacts coming from gitlab-runner or multiple sources?
- Prometheus is well understood at GitLab and we already have production Prometheus instances that are scraping metrics from our shared runners
- Go provides go-routines, making it easy to query these metrics and ship them to GitLab as artifacts without slowing down running jobs
- The artifact upload API was already implemented in the network code of gitlab-runner; this API is standard to GitLab CE and works well with metrics json data
- The Prometheus client API dependency was already present in gitlab-runner as it is bundled with their DIY exporter libraries
- Gitlab-runner is well aware of what job is running on which node for which specific time range, making it easy to query this information precisely from Prometheus
- What is a go-routine with access to all necessary querying data in gitlab-runner becomes a Ruby sidekiq job that would need to be populated with runner-data if this were to be implemented in GitLab CE
What executors does this support?
This MR currently supports metrics querying for docker-machine only, for now, with an easy path forward to support other executors. To add support to another executor, simply add the GetMetricsLabelName() and GetMetricsLabelValue() functions to it.
- GetMetricsLabelName() returns the Prometheus label name (node parameter name) for the PromQL queries (eg. "instance" for docker-machine)
- GetMetricsLabelValue() returns the Prometheus label value (node instance identifier) for the PromQL queries (eg. "shared-runner-1234" for docker-machine)
Are there points in the code the reviewer needs to double check?
All of the committed changes.
Does this MR meet the acceptance criteria?
-
Documentation created/updated -
Added tests for this feature/bug -
In case of conflicts with master
- branch was rebased
What are the relevant issue numbers?
Merge request reports
Activity
Hi @agroleau,
Please add labels to your merge request, this helps triage community merge requests.
Thanks for your help!
You are welcome to help improve this comment.
added auto updated label
added devopsverify grouprunner labels
added 66 commits
-
1c964fb6...600f92c6 - 65 commits from branch
master
- 2a0e757e - merged in master
-
1c964fb6...600f92c6 - 65 commits from branch
assigned to @ayufan
- Resolved by Alex Groleau
Hello @tmaczukin and @gitlab-rmitchell :)
I made another round of quick changes before considering a move of metrics querying to the GitLab CE codebase (using sidekiq jobs) that may alleviate some of the concerns of this process occurring in the runner managers:
- All references to "metrics collection" have been renamed to "metrics querying" to reduce confusion in the codebase about what's happening
- All prometheus metrics queries have been moved to the yaml config and are defined by the user; this removes the need to upgrade the version of gitlab-runner in production; instead the runner configuration can be modified whenever new metrics are desired
- Metrics configured in the runner yaml now support full PromQL and functional queries, making it easy to calculate metric rates and sums
- Interface functions for prometheus label names and values have been added to Executor and ExecutorProvider to gate querying executors that do not yet support metrics querying (which is everything but docker-machine for now)
- The Prometheus server address is now defined at the runner level in the config file to support our production shared-runner architecture
Thanks, Alex
mentioned in merge request !1455 (closed)
- Resolved by Alex Groleau
- Resolved by Alex Groleau
- Resolved by Alex Groleau
- Resolved by Alex Groleau
unassigned @ayufan
- Resolved by Alex Groleau
@ayufan Apologies for the lack of context. This MR seeks to solve two issues: https://gitlab.com/gitlab-org/gitlab-ce/issues/58921 and https://gitlab.com/gitlab-com/gl-security/abuse/issues/83. By saving metrics during job runs, we can ultimately display them to the user (alongside job traces) and use them for security related purposes. These metrics are stored as simple metric => values[] json blob and sent to GitLab as job artifacts. They could be stored with gzip, but that might not be compatible with a yet to be built frontend for customers that graphs metrics for each job.
There are other security related issues as well, such as the use of this data in various machine learning models (https://gitlab.com/gitlab-com/gl-security/engineering/issues/499), which are currently limited to project attributes. Additionally, we would love to create a Suricata exporter to Prometheus to collect network traffic specific metrics (https://gitlab.com/gitlab-com/gl-security/engineering/issues/605). These runner changes would support these future enhancements through configuration changes to add the additionally desired metrics.
Edited by Alex Groleau