Skip to content

Fix allowed_plans handling for instance runners

allowed_plans feature was added to make it possible to limit some instance runners only to a specified SaaS plans.

By it's nature it works similarly to how other filters like tags, run_untagged or protected are handled: GitLab iterates over the basic list of jobs applicable for a runner that asked for a job and excludes jobs not matching that runner. One of the matchers is the allowed_plans one. If it's defined and plan related to the job (job -> project -> namespace -> plan attached to the namespace) doesn't match, job is dropped from the list.

To make responses for /api/v4/jobs/request be returned in reasonable times, we've defined an arbitrary MAX_QUEUE_DEPTH limit equal to 45. If after 45 iterations a matching job is not found, we return back to the Runner with 409 Conflict response.

Usually this works as expected. The problem happens when a lot of not applicable jobs are targeting a runner with allowed_plans configuration.

As in that case, an applicable job may be not handled at all, because there will be always 45+ more not applicable jobs already existing before it in the queue. Not applicable jobs will then hang in pending with the stuck label until the stuck builds cleaner cancels them. And at the same time applicable jobs will hang in a normal pending for same long time. As cleaning up stale jobs is done in an order more or less similar to their creation (automatic cancel decision is based on the pending duration), not applicable jobs will be blocking the applicable jobs for their entire lifetime.

We need to update the allowed_plans mechanism to prevent such locking.

One of the ideas proposed by @mbobin is to update the Ci::PipelineCreation::DropNotRunnableBuildsService to handle also this case, just like it was done for pipeline minutes. With that, job matching against allowed_plans would be done once, at the job creation time. And if job would not match the criteria, it would be canceled immediately. We would not need to check for it later, and most importantly - it would not generate a backlog of not matching jobs that would extend the queue length and clog it for applicable jobs.

Edited by Tomasz Maczukin