Spike: Research: CI/CD fair scheduling algorithm - capacity buckets
Description
NOTE: This issue is to do required research, create a proof of concept, then benchmark against production data before #341027 can take place. The POC will provide a solid understanding of how projects that rarely see pipelines created in them will take precedence in the builds queuing.
Our current fair scheduling mechanism is not flexible and efficient enough. We need a better mechanism that will be more flexible and will not negatively impact our availability metrics. The fair scheduling algorithm prioritizes projects that do not create many builds. This can lead to cases where CI/CD processing is delayed for projects that do create many builds.
Problem(s)
Problem | Data to quantify the problem scope | Notes |
---|---|---|
The fair scheduling mechanism is not flexible. | TBD | |
The fair scheduling mechanism is not efficient. | TBD |
Proposal
New Architecture blueprint with recommendations/guidelines for the scheduling Algorithm
How will the new algorithm be measured?
There are four objectives we want to eventually achieve by refactoring the RegisterJobService. Of these three, we'd like to achieve as many as possible at once.
- Allow for a running-build-per-project limit to be implemented in Ruby
- Make the build-selection SQL query agnostic to the project a build belongs to (it won't be needed if we can implement Objective 1)
- Remove
ORDER BY
from the build-selection query made by theRegisterJobService
(likely an implementation detail of Objective 2) - Move as much business logic as possible out of SQL and into Ruby. (This is more of a guiding principle)
We have two restraining criteria. While figuring out how many of the above objectives we can achieve at once, we need to make sure both of these performance objectives are satisfied. Ultimately, none of our refactoring can come at the expense of real-world performance.
- Effects on our performance against the job-picking SLO should be neutral-to-positive (no degradation)
- The change in the
EXPLAIN ANALYZE
query plan performance for job registration should be neutral-to-positive (no degradation)
To be clear, the goal of this issue is to investigate the listed objectives and come up with a reasonable plan for achieving them, or as many of them as we decide are possible, at once. Delivering the actual changes will be tracked by a separate issue.