Improve GitLab CI queueing mechanism
Everyone can contribute. Help move this issue forward while earning points, leveling up and collecting rewards.
Problem to solve
GitLab CI uses a greedy approach for CI queueing:
- This is implemented by https://gitlab.com/gitlab-org/gitlab/blob/a76b3bd591527fff9e8cef8cdc393ed3c3b790e3/app/services/ci/register_job_service.rb#L9
- On each request to
/api/v4/jobs/requestit executes an expensive SQL query to find ALL matching builds that can be picked by runner - Only single job is being picked from that list
- We perform Fair Scheduling as part of this query: prefer projects that do not have any CI jobs running first
- This is very inefficient as we have to recalculate the whole state each time, and perform all DB calculations each time, especially visible if there's a many pending jobs in queue, resulting in performance degradation
Intended users
Proposal
Change the current queueing from being greedy to being cached.
The flow
- We would use Redis, where Redis would provide a concept of a dynamic queue for a given Runner
- Request hitting
/api/v4/jobs/requestwould try to check the Redis queue, and get a next build from it - If there's no queue it would simply schedule
RunnerQueueWorkerthat would generate this dynamic queue for a given Runner -
/api/v4/jobs/requestwould always only doLPOPfrom Redis, and would not perform expensive DB for matching - The
RunnerQueueWorkerwould execute our previous/api/v4/jobs/requestSQL query to perform matching and would put these results in a dynamic queue, we would not try to generate queue for all builds, rather push to dynamic queue at most between 10 to 100 - The queue key could be build from matching criteria of Runner (tags,protected,etc.) allowing us to have THE same dynamic queue for all runners having the same config that is used for matching CI builds
Performance
- We would introduce more a stateful mechanism
- We would not recalculate the whole world each time on
/api/v4/jobs/request - Multiple runners would use the same dynamically provisioned dynamic queue
- The dynamic queue would retain Fair Scheduling, but be limited to a sensible number of CI jobs that can be picked
-
/api/v4/jobs/requestwould perform a more exhaustive verification of the build making: like checking quotas limits if something should be skipped - We would effectively multiple multiple Runners and multiple calls to
/api/v4/jobs/requestinto a single short living Worker that prefill dynamic queue
Outcome
- I expect the impact of doing that be at least 10x speed improvement without any impact on CI processing
- Very minimal impact on Redis, as we would be storing up limited number of CI jobs, but also group many runners based on matching criteria
- Being easy to implement
Steps
I consider the following steps to achieve the above improvement:
- Currently
/api/v4/jobs/requestuse DB replicas to reduce load on primary, I would expect to retain that aspect, it means we would have to extend Sidekiq to support read-only replicas: #215647 (moved) - We would split
RegisterJobServiceinto two services: one filtering builds (RunnerJobsQueuevia expensive SQL query) and an actualAssignJobToRunnerServicethat does acceptance ofa given jobtoa given runner - We would implement and additional service
DynamicRunnerJobsQueuethat usesRedisand usesRunnerJobsQueueto fill and fetch latest job - We would change
/api/v4/jobs/requestto have a switchable strategy of fetching job to be between:RunnerJobsQueueandDynamicRunnerJobsQueue
Of course this does not discuss all technical details, but shows a general direction of how this change would be done.
Remarks
It does not try to solve all aspects of https://gitlab.com/gitlab-org/gitlab-ce/issues/37695. It tries to improve a very specific aspect of CI to make it significantly more performant and give us a significant amount of time decide about CI Daemon.
This change should bring a significant headroom for CI queue and scale very well with 10x higher CI pending queues without a noticeable impact. It will likely scale between 2x-4x as for Runner connectivity. However, to improve Runner connectivity we should use a different protocol than REST API, likely gRPC/ActionCable/or anything else.
The numbers thrown here are just my expectations. I did not yet validate them :)