[Experimental] Define monitoring threshold for job queue duration (!4480) · Merge requests · GitLab.org / gitlab-runner

Tomasz Maczukin requested to merge define-monitoring-threshold-for-job-queue-duration into main Nov 17, 2023

What does this MR do?

This is an experimental feature that may be removed without deprecation!

Taking jobs from the pending queue is one of the most popular indicators of whether the Runner works as expected.

Since a while we're exporting a histogram metric gitlab_runner_job_queue_duration_seconds_* that provides information about what are the queuing durations of jobs landing on the runner.

This is a nice metric that allows to analyse the behavior and brings data for things like capacity and configuration changes planning.

For basic monitoring and especially for defining an SLI (Service Level Indicator) based on the job queuing duration - if that is the important factor for the user - we could leverage the same data received from GitLab, but export it in a much simpler form.

And this is what this commit brings. Apart of updating the histogram (which we still do!) we're now able to define a threshold in seconds. If the queuing duration of the received job is longer than configured threshold, a dedicated counter metric gitlab_runner_acceptable_job_queuing_duration_exceeded is increased.

In the monitoring layer that counter can be next analysed with the rate() function and, for example, alert when the rate of exceeding the threshold is higher than an acceptable value.

Such configuration could look as follows:

[[runners]]
  name = "example-runner"

    [runners.monitoring]
      [[runners.monitoring.job_queuing_durations]]
        periods = ["* * * * * * *"]
        timezone = "UTC"
        threshold = "1m30s"

In this case jobs queued for less or equal than 1 minute 30 seconds will be counted as acceptable. But a job that was queued for 1 minute and 31 seconds will exceed the threshold and therefore it will increase the counter.

Optionally, a second value sent from GitLab can be used - the ProjectJobsRunningOnInstanceRounnerCount. This one is usable only in case of instance runners, as for other types (group or project) it's always set to Inf+.

But for instance runners it will ifnrom how many jobs a particular project is already executing on the instance runners at the moment of job scheduling. That number is checked up to a limit which is hardcoded in GitLab. If a project runs from 0 to INSTANCE_RUNNER_RUNNING_JOBS_MAX_BUCKET it will be set to the specific number. If it exceedes the limit - it will be set to INSTANCE_RUNNER_RUNNING_JOBS_MAX_BUCKET+ (where the limit value is placed instead of the constant name). This number allows to analyse Fair Scheduling Algorithm that is built into GitLab CI/CD and used for instance runners.

To include that field in the threshold exceeding calculation, the jobs_running_for_project entry should be configured with a regexp to match against the value sent by GitLab. That could look as follows:

[[runners]]
  name = "example-runner"

    [runners.monitoring]
      [[runners.monitoring.job_queuing_durations]]
        periods = ["* * * * * * *"]
        timezone = "UTC"
        threshold = "1m30s"
        jobs_running_for_project = "^[0-3]$"

In this case the metric will be increased when job queuing duration exceeds 1 minute and 30 seconds but only when GitLab reported that the project, at the moment of job scheduling, was already running from 0 to 3 jobs on any existing instance runner. If that project have been running 4 jobs or more on the instance runners, the threshold is ignored and expectation is counted as met.

For a single runner configuration we can define multiple configurations for job_queuing_durations, matched by different time periods. This allows to define different thresholds for dedicated times. The periods field is evaluated using a cron syntax within the configured time zone. If timezone field is not defined, the Local one is assumed which should use the time zone set for the runner process in the OS.

Entries are evaluated in the order of definition, and the last matching configuration is applied for a given time.

Example of the periods usage could look like:

[[runners]]
  name = "example-runner"

    [runners.monitoring]
      [[runners.monitoring.job_queuing_durations]]
        periods = ["* * * * * * *"]
        timezone = "UTC"
        threshold = "1m"
      [[runners.monitoring.job_queuing_durations]]
        periods = ["* * * * * sat,sun *"]
        timezone = "UTC"
        threshold = "5m"

With this configuration, a 1 minute threshold would be used as a default for most of the time, but during the weekend (sat,sun) that threshold would be increased to 5 minutes.

When merged, this change will add a metric exported as:

. HELP gitlab_runner_acceptable_job_queuing_duration_exceeded Increased each time when the queuing duration was longer than the configured threshold
. TYPE gitlab_runner_acceptable_job_queuing_duration_exceeded counter
gitlab_runner_acceptable_job_queuing_duration_exceeded_total{runner="9_F4bzrV3",system_id="s_b5a2f9de542e"} 0

Why was this MR needed?

To provide a generalized way of defining an SLI in GitLab Runner based on the job queuing duration (which is one of the most popular factors of defining whether the Runner setup is healthy and works as expected).

Idea was taken from the discussion at gitlab-com/runbooks!6225 (comment 1642599654)

What's the best way to test this MR?

What are the relevant issue numbers?

https://gitlab.com/gitlab-org/ci-cd/shared-runners/infrastructure/-/issues/194

Edited Mar 13, 2024 by Tomasz Maczukin

[Experimental] Define monitoring threshold for job queue duration

What does this MR do?

Why was this MR needed?

What's the best way to test this MR?

What are the relevant issue numbers?

Merge request reports