Improve GitLab CI queueing mechanism

Everyone can contribute. Help move this issue forward while earning points, leveling up and collecting rewards.

  • Close this issue

Problem to solve

GitLab CI uses a greedy approach for CI queueing:

  1. This is implemented by https://gitlab.com/gitlab-org/gitlab/blob/a76b3bd591527fff9e8cef8cdc393ed3c3b790e3/app/services/ci/register_job_service.rb#L9
  2. On each request to /api/v4/jobs/request it executes an expensive SQL query to find ALL matching builds that can be picked by runner
  3. Only single job is being picked from that list
  4. We perform Fair Scheduling as part of this query: prefer projects that do not have any CI jobs running first
  5. This is very inefficient as we have to recalculate the whole state each time, and perform all DB calculations each time, especially visible if there's a many pending jobs in queue, resulting in performance degradation

Intended users

  • Sidney (Systems Administrator)

Proposal

Change the current queueing from being greedy to being cached.

The flow

  1. We would use Redis, where Redis would provide a concept of a dynamic queue for a given Runner
  2. Request hitting /api/v4/jobs/request would try to check the Redis queue, and get a next build from it
  3. If there's no queue it would simply schedule RunnerQueueWorker that would generate this dynamic queue for a given Runner
  4. /api/v4/jobs/request would always only do LPOP from Redis, and would not perform expensive DB for matching
  5. The RunnerQueueWorker would execute our previous /api/v4/jobs/request SQL query to perform matching and would put these results in a dynamic queue, we would not try to generate queue for all builds, rather push to dynamic queue at most between 10 to 100
  6. The queue key could be build from matching criteria of Runner (tags,protected,etc.) allowing us to have THE same dynamic queue for all runners having the same config that is used for matching CI builds

Performance

  1. We would introduce more a stateful mechanism
  2. We would not recalculate the whole world each time on /api/v4/jobs/request
  3. Multiple runners would use the same dynamically provisioned dynamic queue
  4. The dynamic queue would retain Fair Scheduling, but be limited to a sensible number of CI jobs that can be picked
  5. /api/v4/jobs/request would perform a more exhaustive verification of the build making: like checking quotas limits if something should be skipped
  6. We would effectively multiple multiple Runners and multiple calls to /api/v4/jobs/request into a single short living Worker that prefill dynamic queue

Outcome

  1. I expect the impact of doing that be at least 10x speed improvement without any impact on CI processing
  2. Very minimal impact on Redis, as we would be storing up limited number of CI jobs, but also group many runners based on matching criteria
  3. Being easy to implement

Steps

I consider the following steps to achieve the above improvement:

  1. Currently /api/v4/jobs/request use DB replicas to reduce load on primary, I would expect to retain that aspect, it means we would have to extend Sidekiq to support read-only replicas: #215647 (moved)
  2. We would split RegisterJobService into two services: one filtering builds (RunnerJobsQueue via expensive SQL query) and an actual AssignJobToRunnerService that does acceptance of a given job to a given runner
  3. We would implement and additional service DynamicRunnerJobsQueue that uses Redis and uses RunnerJobsQueue to fill and fetch latest job
  4. We would change /api/v4/jobs/request to have a switchable strategy of fetching job to be between: RunnerJobsQueue and DynamicRunnerJobsQueue

Of course this does not discuss all technical details, but shows a general direction of how this change would be done.

Remarks

It does not try to solve all aspects of https://gitlab.com/gitlab-org/gitlab-ce/issues/37695. It tries to improve a very specific aspect of CI to make it significantly more performant and give us a significant amount of time decide about CI Daemon.

This change should bring a significant headroom for CI queue and scale very well with 10x higher CI pending queues without a noticeable impact. It will likely scale between 2x-4x as for Runner connectivity. However, to improve Runner connectivity we should use a different protocol than REST API, likely gRPC/ActionCable/or anything else.

The numbers thrown here are just my expectations. I did not yet validate them :)

Links / references

Edited Aug 29, 2025 by 🤖 GitLab Bot 🤖
Assignee Loading
Time tracking Loading