Improve load balancing over multiple CI runners by less aggressive consumation of jobs
Everyone can contribute. Help move this issue forward while earning points, leveling up and collecting rewards.
Description
We have a handful of CI runners with a concurrency limit 2. Most of the time the distribution of CI jobs is uneven such that some CI runners are running 2 jobs while other CI runners have 0 jobs.
I think the basic problem is in how jobs are scheduled by GitLab, or rather that they aren't scheduled, they are picked off from a central list by the CI runners. If the concurrency limit is set to 2 for a runner it will also pick 2 jobs off the queue.
Note how I have not looked at the code, I've drawn this conclusion about the behavior of the runner based on observing it.
Proposal
By making the CI runners less aggressive in how they pick new jobs we could allow for more runners to consume jobs thus achieving a more even load balancing of jobs across runners.
Any static value for how many jobs it should pick of the queue is bound to be wrong so I suggest to make it a configuration option with a default value of 1. This should allow for the other CI runners to also pick some jobs before we do the next poll of gitlab to get jobs so overall we get a more even load balancing of jobs.
The downside is that it will take slightly longer to pick up multiple jobs but I think this time is negligible overhead in comparison to what a normal CI job takes (prolly minutes). If the value is made into a config knob, then people could optimise for their environments.