Query timeouts while tracking running jobs in Runner Queue
Problem
This issue was reported internally as a pipeline had jobs stuck and then timed out.
In that case an Error: canceling statement due to statement timeout
was raised after a job had been succesfully assigned to a Runner (on GitLab Rails) but then the request retuned 500
error so the Runner moved on with another job request.
This caused the jobs to be stuck in running
state until the jobs were killed by StuckCiJobsWorker
after the job timeout.
Root cause
After we assign the job to the runner we track some metrics. One of these, jobs_running_for_project
raised the query timeout error in:
running_jobs_count = job.project.builds.running.where(runner: ::Ci::Runner.instance_type)
.limit(JOBS_RUNNING_FOR_PROJECT_MAX_BUCKET + 1).count - 1
Proposal
Can we use the new ci_running_builds
table which should be much faster?
/cc @grzesiek