Spike: Research: CI/CD fair scheduling algorithm - capacity buckets

Everyone can contribute. Help move this issue forward while earning points, leveling up and collecting rewards.

Description

NOTE: This issue is to do required research, create a proof of concept, then benchmark against production data before #341027 can take place. The POC will provide a solid understanding of how projects that rarely see pipelines created in them will take precedence in the builds queuing.

Our current fair scheduling mechanism is not flexible and efficient enough. We need a better mechanism that will be more flexible and will not negatively impact our availability metrics. The fair scheduling algorithm prioritizes projects that do not create many builds. This can lead to cases where CI/CD processing is delayed for projects that do create many builds.

Problem(s)

Problem Data to quantify the problem scope Notes
The fair scheduling mechanism is not flexible. TBD
The fair scheduling mechanism is not efficient. TBD

Proposal

New Architecture blueprint with recommendations/guidelines for the scheduling Algorithm

How will the new algorithm be measured?

There are four objectives we want to eventually achieve by refactoring the RegisterJobService. Of these three, we'd like to achieve as many as possible at once.

  1. Allow for a running-build-per-project limit to be implemented in Ruby
  2. Make the build-selection SQL query agnostic to the project a build belongs to (it won't be needed if we can implement Objective 1)
  3. Remove ORDER BY from the build-selection query made by the RegisterJobService (likely an implementation detail of Objective 2)
  4. Move as much business logic as possible out of SQL and into Ruby. (This is more of a guiding principle)

We have two restraining criteria. While figuring out how many of the above objectives we can achieve at once, we need to make sure both of these performance objectives are satisfied. Ultimately, none of our refactoring can come at the expense of real-world performance.

  1. Effects on our performance against the job-picking SLO should be neutral-to-positive (no degradation)
  2. The change in the EXPLAIN ANALYZE query plan performance for job registration should be neutral-to-positive (no degradation)

To be clear, the goal of this issue is to investigate the listed objectives and come up with a reasonable plan for achieving them, or as many of them as we decide are possible, at once. Delivering the actual changes will be tracked by a separate issue.

Edited by 🤖 GitLab Bot 🤖