Concurrency Tuning / Runner Metrics

This idea came about after a conversation with @gdoyle. Through auto-tuning we help remove some of the complexity of configuring runner concurrency.

Summary

Runner limits the number of jobs running on a machine with the concurrency parameter (or capacity_per_instance for autoscaled runners). This prevents the runner from being overloaded with work. Overloading a runner with CPU and IO work, which is compressible, will just slow down job execution. But overloading a runner with memory work might cause outright job failure. On the other hand, no one wants to pay a lot of money for CI/CD infrastructure jobs. So there is pressure to maximize utilization by tuning concurrency as high as possible while staying below a failure threshold. Runner is in an ideal position to support concurrency tuning because it has access to the job execution environment and a mechanism for providing metrics and backpressure to GitLab.

Design

Metric Collection

Runner will collect CPU, Memory and IO metrics from job execution environments. In autoscaled scenarios this will be metrics from the ephemeral (worker) machines. These metrics will be returned through the fleeting interface. Each fleeting plugin will determine how to collect those metrics from the underlying cloud provider. For non-autoscaled scenarios, these metrics will be collected directly by the runner from its operating system.

Runner will then aggregate the metrics and surface a utilization metric for each resource (CPU, memory and IO) as well as a concurrency utilization metric. Concurrency utilization is how many jobs on average are running at a given moment as a percentage of the configured maximum concurrency. Concurrency is related to resource utilization but not directly connected. Therefore concurrency must be tuned in order represent how much of a slice of resource each job requires on average.

Concurrency Tuning UI

When resource utilization and concurrency utilization are placed on the same graph, then tuning decisions become obvious. If peak concurrency is reached without reaching high utilization, then there is an opportunity to tune concurrency upward. If peak utilization is reached without reaching high concurrency, then concurrency may need to be tuned down to prevent job failure. The ideal concurrency setting will depend on the distribution of job resource footprints. If job sizes vary wildly, then there must always be enough room for catching several large jobs which will happen from time-to-time. If job sizes are very similar, then concurrency utilization can be pushed very high because concurrency is a good proxy for resource consumption.

These metrics can be surfaced in the GitLab UI via Prometheus query. Then the user can make runner configuration changes to adjust concurrency accordingly.

Concurrency Auto-tuning

Because tuning decisions are obvious when all the data is laid out properly, they are a good candidate for automation. A concurrency auto-tuner will maintain a histogram of resource usage during each job execution. It will then select the 90th percentile of resource usage to represent the resources required for each “job”. Prior art in the Kubernetes Vertical Pod Autoscaler target-cpu-percentile gives us a reasonable starting value. Then the available resources of the machine are divided by the “job” resource requirements (rounding down) to yield the ideal concurrency setting.

This calculation can actually be performed in the runner itself where all the data is available. The runner is also where the actuation of the concurrency setting takes place so the runner can auto-tune its own concurrency. However the user will likely want to observe the utilization metrics and recommendations in the UI before selectively applying them.

Concurrency Auto-clamp

There are situations where the runner knows something about resource consumption and should take immediate action on concurrency, regardless of the configured concurrency setting. If an additional job is likely to overload a machine and cause job failure, then the runner should refuse to accept more jobs by artifically reducing the effective concurrency setting (clamping). For example, if 3 jobs are currently executing and memory utilization is at 95% then runner should not accept any more jobs until a job completes and resource utilization drops, regardless of the configured concurrency setting. Because runner doesn’t have an internal queue this “backpressure” is immediately applied to GitLab which can route the job to another runner (if one is available).

Concurrency auto-clamping acts like a guardrail for concurrency tuning and can allow users to be more aggressive with their configuration. Through backpressure it also enables more balanced job distribution. If job sizes vary wildly then concurrency can be automatically adjusted on-the-fly (via clamping) by each runner. This allows an unlucky runner who picked a rare, huge job to push additional work onto other runners. More fair work distribution results in high average resource utilization and cost savings.

Kubernetes Usage

Given a pod’s resource requests and limits, concurrency can be tuned to maximize resource utilization. However the pod/node abstraction of Kubernetes also allows tuning pod resource requests to a fixed concurrency. Vertical Pod Autoscaler does support Kubernetes Jobs so it could be used in principal to accomplish the inverse of concurrency tuning. However because runner creates pod directly, tuning resource requests in runner would allow bypassing VPAs admission controller-based actuation.

Whether tuning concurrency or resources, the auto-tuning approach extends to Kubernetes as well as other execution environments.

Tuning Runner Manager Concurrency

In autoscaling scenarios the actual work is done on workers. We have started supporting multiple jobs per worker with the capacity_per_instance parameter which we can auto-tune as described above. However we can also auto-tune rummer manager concurrency as well, at the same time, using the same mechanism. The resource consumption would be the overhead of routing jobs to workers and logs back to GitLab.

Implementation

Auto-clamp Spike

A good way to explore this idea would be to implement a simple auto-clamp. This would put in place the metric collection which is the common core. And it would allow experimentation with various target percentiles and job distributions.

It would not require plumbing metrics into the UI and the decision back to runner, since the auto-clamp is entirely local. This makes it good candidate for a short sprint of experimentation (a spike).

Edited Mar 13, 2023 by Joe Burnett