CI job request API returns 409 Conflict when there is a large number of CI jobs
Summary
When working with a customer on an emergency (internal), it appeared that runners could not get new jobs, as the API returned a 409 on each attempt, for all runners. We suspected this was because of a huge amount of jobs (we did not attempt to count, but they were created mistakenly via an infinite loop of triggered pipelines). Upon removing all pipelines from the offending project, the API acted as expected, and jobs could be processed immediately (with no changes to the runners).
Therefore, it seems like a huge number of jobs causes the application to return a 409, though I don't see the reason why there would be any conflicts when fetching new jobs.
Others have seen the 409 error when their runners were misconfigured. For an example, see gitlab-runner#29466 (closed). We ruled this out, as the workaround did not change the behavior, but the number of pending CI jobs did.
Steps to reproduce
I am not certain that this can be reproduced (haven't tried it myself yet). The simple case for reproducing this would be to trigger an infinite number of pipelines via the API, and see if it affects how a runner can pick up new jobs.
What is the current bug behavior?
A huge number of pending CI jobs makes the api/v4/jobs/request path return HTTP status code 409, until the number of jobs is reduced.
What is the expected correct behavior?
Jobs can still be given to runners despite the number of pending jobs.
Relevant logs and/or screenshots
GitLab team members can see a GitLabSOS from the customer in the ticket associated with the emergency.
Possible fixes
I was looking into the code and the logs, I could not find any solid evidence of what exactly causes this behavior. However, I could see that the jobs API library uses a conflict helper. Maybe something related to this is causing the 409. In any case, I am guessing it is a false positive, as I wouldn't think that getting jobs could cause a conflict just because there are many.