Orphan rows in ci_running_builds table lead to delays in scheduling of new jobs on runners
Summary
A customer reported (ZD internal link, internal issue) that intermittently new jobs from particular projects were not being picked up by a runner for long periods of time, despite jobs from other projects being picked up and run and no runner capacity issues being apparent.
After investigation it was determined the projects experiencing the issue had a number of rows in the ci_running_builds
table that were associated with jobs that were not actually in the running
state. These jobs had varying creation dates and ranged from months to weeks old. The version of GitLab running at the time of the report was 16.5.8 though older versions of GitLab were in use when some of the orphan records were created.
The count of a project's running builds from the ci_running_builds
table is used by the fair usage algorithm to prevent individual projects with a lot of jobs preventing other jobs getting timely access to runners. Rows should only persist in the table while a job is actually running, and they should be removed when the job transitions to a non-running state.
After deleting all ci_running_builds
rows that did not have a corresponding ci_builds
row in the running
state job scheduling was seen to return to expected patterns. And no further occurrences have been seen, suggesting the root cause of the discrepancies is no longer occurring.
This issue has been created to document the workaround that was applied and to provide a place for anyone else encountering this issue to record details of their experience, to assist with identifying the root cause.
In the event this issue continues to be encountered without a root cause being identified consideration should be given to implementing a clean up process to periodically check for and remove orphan rows from ci_running_builds
to prevent unexpected job scheduling behaviour.
Workaround
The following SQL query deletes rows from ci_running_builds
which don't have corresponding ci_builds
rows in the running
state:
delete from ci_running_builds where build_id not in (select id from ci_builds where status='running');
This is a relatively low-risk intervention as the ci_running_builds
table is only used for job scheduling prioritisation and some runner activity metrics.
Steps to reproduce
We have not been able to reproduce this issue.
Example Project
What is the current bug behavior?
New jobs from particular projects remain in pending
state while other jobs are picked up immediately.
What is the expected correct behavior?
All jobs should be processed according to the fair usage algorithm
Relevant logs and/or screenshots
Output of checks
Results of GitLab environment info
Expand for output related to GitLab environment info
(For installations with omnibus-gitlab package run and paste the output of: \`sudo gitlab-rake gitlab:env:info\`) (For installations from source run and paste the output of: \`sudo -u git -H bundle exec rake gitlab:env:info RAILS_ENV=production\`)
Results of GitLab application Check
Expand for output related to the GitLab application check
(For installations with omnibus-gitlab package run and paste the output of: `sudo gitlab-rake gitlab:check SANITIZE=true`) (For installations from source run and paste the output of: `sudo -u git -H bundle exec rake gitlab:check RAILS_ENV=production SANITIZE=true`) (we will only investigate if the tests are passing)