Skip to content

Expose a duration histogram for the runner prepare stage

Status update: 2024-11-13

  • Development work on this feature is now slated for the 17.9 (Feb 2025) release.

Overview

This feature adds a new Prometheus histogram metric that counts the duration for preparing the CI/CD job environment - the prepare stage

Problem(s) to solve

  • A customer that uses Kubernetes to host the CI/CD build environment (runners) and runs ~200k CI/CD jobs per day have found that the duration of the pod provisioning step (prepare environment) can be > 3 minutes. The estimate is that this impacts ~ 10% of the daily CI/CD jobs. Therefore, this customer needs visibility into the duration trends for the preparation stage to determine adjustments to the compute resources allocated to the Kubernetes cluster(s).

Proposal

  • Add a histogram metric that counts the duration for preparing the CI/CD job environment.

Technical implementation considerations

  • Should we expand this to cover all pre-defined stages?
  • When implementing this feature, we will need to track the histogram of job durations and move that to a different object in the code.
  • We can export as one histogram metric label pointing to the step.
  • PROBLEM- This new metric will create a ton of new time series. Today we have a job duration histogram with hard coded buckets (6 or 7) so there is already 7 predefined time series for each runner manager. So if we add a duration histogram for all the pre-defined steps (run, finish, cleanup), then we wil export ~70 different time series. Using GitLab SaaS as an example, we are creating 1000s of time series. So how many runner instances will generate the histogram? How many buckets? 100s of time series will become too heavy to handle.
  • Based on the problem outlined above, the current proposal is to only add this metric for the prepare stage to start and add to other stages on a case by case basis.
  • Make this an opt-in so that the metric is not exported by default.
  • Long term (another issue) we should work on making the metric buckets configurable. For example - the current histogram for overall job duration. is hard coded.

Note:

  • We already partition number of jobs by execution step and executor stage, and that can be in some way extrapolated to time differences of the different pre-defined steps.
Edited by Darren Eastman