Implement custom job scheduling through a user defined script to increase utilisation of CI runner resource usage
Description
We have a problem with inefficient use of available resources in our CI environment where we run a mix of small and large jobs.
There are a number of x86 servers, with like ~24 (48 HT threads) cores, ~400GB of RAM etc. Some of our CI jobs consume nearly all resources on a machine, so for this you need a runner with concurrency=1 to effectively reserve the entire machine for that one job. Other jobs consist mostly of putting a few packages together and building a docker container, which is IO bound most of the time and so it won't consume more than one core allowing for as many CI jobs to run in parallel as there are CPU cores in the machine. There are other jobs with something in between, like it wants a handful of cores. With the large amount of RAM available (16GB per physical core) we are effectively always CPU bound. We can optimise for large jobs or small ones but I don't see a way of configuring the runners in such a way that we can cater to both.
Our current setup means one physical server runs large jobs while another runs small jobs. Looking at CPU usage over time the machines are mostly idle since most of the time there are no CI jobs to process. When we have jobs to process however, we want to use all available resources and the course granularity currently offered by the GitLab Runner configuration doesn't seem to offer a way of achieving that.
Proposal
By declaring the resources required for a CI job we can know ahead of taking on the job if we have those resources available. I think the easiest way of doing this is using the tags already available for CI jobs.
While we are CPU bound there might be other situations where other resource limits apply so I don't want to limit this to just counting number of CPU cores. I would suggest allowing the user to implement this logic themselves by essentially delegating the job of accepting jobs to a user specified script.
I will admit that I'm not intimately familiar with how the scheduling of jobs happens in GitLab so this might not be as straight forward as I'd like to think. Anyway, I envisage that the runner can fetch a list of available jobs in the GitLab queue, these are then fed to a user specified script over STDIN, probably just as a JSON blob, the user script can do whatever it wants with it and then it writes back the JSON blob to its STDOUT which the runner reads. Included in the output is whether a job should be picked up by the runner or not. In addition to new available jobs we should include the currently running jobs.
Runner gets this from GitLab:
{ "jobs":
[ "job_id": 4, "tag": "cpu=4", ... ],
[ "job_id": 5, "tag": "cpu=4", ... ],
[ "job_id": 6, "tag": "cpu=10", ... ]
}
it takes this together with the information about currently running jobs and send to the script over STDIN:
{ "jobs": [
{ "job_id": 4, "tag": "cpu=4", ... },
{ "job_id": 5, "tag": "cpu=4", ... },
{ "job_id": 6, "tag": "cpu=10", ... }
],
"running_jobs": [
{ "job_id": 1, "tag": "cpu=4", ... },
{ "job_id": 2, "tag": "cpu=4", ... },
{ "job_id": 3, "tag": "cpu=10", ... }
]
}
the script runs its logic, which in this case counts the number of CPUs consumed by the currently running jobs (10+4+4 = 18) and since we have 24 cores we can fit some more in. Thus our script returns the jobs but adds in a new key action
set to pick
for picking up a job or none
to leave it be. We pick up the first job that consumes another 4 cpus, thus reaching 22 and we can't accept any other jobs.
{ "jobs": [
{ "job_id": 4, "tag": "cpu=4", ..., "action": "pick" },
{ "job_id": 5, "tag": "cpu=4", ..., "action": "none" },
{ "job_id": 6, "tag": "cpu=10", ..., "action": "none" }
]
}
Again, this might not be a great fit with how things actually work but I hope you understand the concept through these examples.
Note how the concept of counting CPUs is entirely kept within the custom script and in the tags but GitLab just treats it as opaque values and passes it along without understanding it. Also note how there are no actual limitations imposed by the runner so a job could have a tag cpu=4
but when running it consumes all CPUs on the box. Thus we assume well behaving jobs to be honest about their resource requirements and keep to that. Actually limiting resource usage is left as a separate exercise.
I believe placing this as a customer script that can be implemented by the user makes it more flexible and at the same time keeps down the size of the implementation in the runner. I am however open to other ideas that solve the same issue.
Links / references
Documentation blurb
Overview
What is it? Why should someone use this feature? What is the underlying (business) problem? How do you use this feature?
Use cases
This should be used by anyone that have a mix of large and small jobs ("elephants and mice") and suffer from inefficient use of the CI machines due to the slicing of runners and how they occupy resources.
Feature checklist
Make sure these are completed before closing the issue, with a link to the relevant commit.
-
Feature assurance -
Documentation -
Added to features.yml