Spike: Research: CI/CD fair scheduling algorithm - capacity buckets

Description

NOTE: This issue is to do required research, create a proof of concept, then benchmark against production data before #341027 can take place. The POC will provide a solid understanding of how projects that rarely see pipelines created in them will take precedence in the builds queuing.

Our current fair scheduling mechanism is not flexible and efficient enough. We need a better mechanism that will be more flexible and will not negatively impact our availability metrics. The fair scheduling algorithm prioritizes projects that do not create many builds. This can lead to cases where CI/CD processing is delayed for projects that do create many builds.

Problem(s)

Problem	Data to quantify the problem scope	Notes
The fair scheduling mechanism is not flexible.	TBD
The fair scheduling mechanism is not efficient.	TBD

Proposal

New Architecture blueprint with recommendations/guidelines for the scheduling Algorithm

How will the new algorithm be measured?

There are four objectives we want to eventually achieve by refactoring the RegisterJobService. Of these three, we'd like to achieve as many as possible at once.

Allow for a running-build-per-project limit to be implemented in Ruby
Make the build-selection SQL query agnostic to the project a build belongs to (it won't be needed if we can implement Objective 1)
Remove ORDER BY from the build-selection query made by the RegisterJobService (likely an implementation detail of Objective 2)
Move as much business logic as possible out of SQL and into Ruby. (This is more of a guiding principle)

We have two restraining criteria. While figuring out how many of the above objectives we can achieve at once, we need to make sure both of these performance objectives are satisfied. Ultimately, none of our refactoring can come at the expense of real-world performance.

Effects on our performance against the job-picking SLO should be neutral-to-positive (no degradation)
The change in the EXPLAIN ANALYZE query plan performance for job registration should be neutral-to-positive (no degradation)

To be clear, the goal of this issue is to investigate the listed objectives and come up with a reasonable plan for achieving them, or as many of them as we decide are possible, at once. Delivering the actual changes will be tracked by a separate issue.

Edited Mar 20, 2023 by Darren Eastman