Track 60s and 5m buckets on job timing for GitLab.com

Problem Statement

There are times when because of abuse or other long-running operations that GitLab.com shared runners are delayed in starting by a long period of time. This can be a minute, 5 minutes, 10 minutes. However, the metrics we collect today are only bucketed in up to 30 seconds.

Because of this, it can be hard to see where there are legitmate system slowdowns vs. small disruptions. It would be better to have finer-grain information on understanding what the current state of the .com shared runners are.

Existing Data

The existing CI Grafana dashboard contains a histogram of job queue durations.

However, given the limits mentioned above, this data is not very valuable as the 30s 99th percentile is often maxed out.

3 hour period

7 day period

Solution

Add buckets for 1 minute and 5 minutes to the existing Prometheus metrics framework here: https://gitlab.com/gitlab-org/gitlab-ce/blob/d6022e9deac732df62c2907fd43ae8646ffa43f7/app/services/ci/register_job_service.rb#L9.

(from https://gitlab.slack.com/archives/CB3LSMEJV/p1558464202238800)

Edited May 21, 2019 by Brendan O'Leary