Capstone issue: Introduce new API for Runner to transition jobs to 'running' explicitly
Problems
Problem 1
A job, once picked up by GitLab-Runner, is moved to the "running" state immediately, even if GitLab-Runner isn't yet executing the job, but doing prep work.
- Depending on the executor and configuration, finding capacity is done ad-hoc, meaning that even though we're in the
runningstate, we're actually looking for/provisioning an environment to execute the job in. - Even for executors that are configured to first have an environment ready before requesting a job there can still be some prep time required.
- This prep time is misleading, and in some setups, can be counted towards compute minutes.
Problem 2
Jobs are assigned to runners which go offline before starting work on the job, and so stay pending until timeout. There have been more and more reports of this as time goes by, including a severity2 bug and a recent incident being raised.
Proposal
A solution to this would be to introduce a new Runner feature where when a job is picked up by Runner the job would remain in the pending state.
Only when GitLab-Runner has finished preparation tasks and is ready to actually execute the job would it notify GitLab to transition the job to the running state.
With the new feature enabled:
- when
build.runner_id == nilthe job ispending(as it is today) - when
build.runner_id != nilthe job remainspending, but no other runner will be assigned (and trace data can be submitted). - Runner would then call a new API endpoint to transition the job state to
running.
See #464048 (comment 1942871742) for changes to the database that are anticipated. (UPDATE: this should be doable without database changes, by temporarily removing the entry from ci_pending_builds while GitLab waits for the runner to accept the job).
NOTE: Any changes to the flow show be reflected in the documentation, including (but not limited to):
Smart routing
There's been some previous discussions about a Runner daemon/router in the past, but these relied on significant changes to GitLab to achieve.
By fixing the above problem a side effect is that we can seamlessly introduce a smart Runner router/daemon for distributing jobs more efficiently. The router would use the exact same API as GitLab-Runner does.
When a job is picked up by the router, the state remains pending. At this point, finding a Runner to execute the job is delegated to the router. When a Runner is found and executes the job, it instructs GitLab to transition the job to running.
We go from this:
┌──────► GitLab-Runner 1
│
GitLab ├──────► GitLab-Runner 2
│
└──────► GitLab-Runner 3
To this:
┌──────► GitLab-Runner 1
│
GitLab ───► Smart router ├──────► GitLab-Runner 2
│
└──────► GitLab-Runner 3
This doesn't solve the whole spectrum of problems we have with the existing job queue mechanism, but helps for certain setups and customers.
GitLab distributing to Runners at scale is complicated.
GitLab distributing to routers instead frees up some of the responsibility, and this is where a router, with it now having its own queue of jobs, can probably be smarter.
Some scenarios it can help with:
- Rather than multiple runner managers asking the GitLab instance for jobs, only the daemon needs to. All the Runners are asking the daemon for a job. This can help with a large fleet of self-hosted Runners hitting GitLab.com and can tighten network permissions.
- Have customers implement what they deem as "fair scheduling" to their own fleet of Runners
Implementation plan
Current workflow (old runner or feature flag is disabled)
- Pending job is added to
ci_pending_builds(Ci::UpdateBuildQueueService#push). - Runners poll for jobs (
POST /jobs/request-> Ci::RegisterJobService#execute). - Once the pending job is assigned to a specific runner,
Ci::Build.run!is called which will causeci_builds.statusfield to be updated from:pendingto:running(or:canceling) and the state machine to callCi::UpdateBuildQueueService#popto pop the job fromci_pending_builds.
New workflow (new runner with feature flag enabled info.features)
-
Pending job is added to
ci_pending_builds(Ci::UpdateBuildQueueService#push). -
Runners poll for jobs (
POST /jobs/request-> Ci::RegisterJobService#execute). -
Once the pending job is assigned to a specific runner,
Ci::RegisterJobServiceshould callCi::UpdateBuildQueueService#popto pop it fromci_pending_builds, so other runners are not assigned this job. -
A new REST API endpoint (e.g.
POST /api/v4/jobs/{job_id}/runner_provisioning) will allow the runner to inform the GitLab instance about 3 scenarios:status meaning handled by result pendingRunner manager is still preparing the runner. Rails app (could be Workhorse) This call will occur regularly as a keep-alive to the GitLab instance. If the GitLab instance doesn't hear from the runner for let's say 5 minutes, it can return the job to the queue and assign it to another runner as in a declinedscenario. Workhorse could handle this entirely if needed by simply setting at a Redis key (build:pending_runner_queue:#{build_id}), so we keep a list of pending jobs.acceptedThe job has been accepted and the runner has started execution. Rails app Transition the job to running by calling Ci::Build.run!. The state machine will callCi::UpdateBuildQueueService#popto pop the job fromci_pending_builds, even though it is no longer present there, so it’ll just be a no-op.declinedThe job has been declined by the runner manager. Rails app Push the job again to ci_pending_buildsby callingCi::UpdateBuildQueueService#push. We’ll need to clearrunner_id,runner_manager_id, andrunner_session_attributesfrom theci_buildsrecord. Perhaps better to cancel job, and retry it. We may need a mechanism to avoid defaulting to the same runner when choosing a fallback runner for a declined job, if there's a choice for other runners. -
A cronjob worker (perhaps
StuckCiJobsWorker?) goes over Redis keys representing the pending jobs, and declines all jobs that have timed out without hearing further from the runner (any jobs matching Redis'build:pending_runner_queue:*that are too old).
Mermaid diagram
Adaptation of Long polling workflow:
sequenceDiagram
accTitle: Long polling workflow
accDescr: The flow of a single runner getting a job with long polling enabled
autonumber
participant C as Runner
participant W as Workhorse
participant Redis as Redis
participant R as Rails
participant S as Sidekiq
C->>+W: POST /api/v4/jobs/request
W->>+Redis: New job for runner A?
Redis->>+W: Unknown
W->>+R: POST /api/v4/jobs/request
R->>+Redis: Runner A: last_update = X
R->>W: 204 No job, X-GitLab-Last-Update = X
W->>C: 204 No job, X-GitLab-Last-Update = X
C->>W: POST /api/v4/jobs/request, X-GitLab-Last-Update: X
W->>Redis: Notify when last_update change
Note over W: Request held in long poll
Note over S: CI job created (ci_pending_builds)
Note over S, Redis: Update all registered runners
S->>Redis: Runner A: last_update = Z
Redis->>W: Runner: last_update changed
Note over W: Request released from long poll
W->>Rails: POST /api/v4/jobs/request
Note over R: Job removed from ci_pending_builds
Rails->>W: 201 Job was scheduled
W->>C: 201 Job was scheduled
loop Every 5 minutes
C->>+R: POST /api/v4/jobs/{job_id}/runner_provisioning?status=pending
Note over R: Redis build:pending_runner_queue:{job_id} value updated
R->>+C: 200 OK
end
C->>+R: POST /api/v4/jobs/{job_id}/runner_provisioning?status=accepted
Note over R: Job transitioned to running and<br/>Redis build:pending_runner_queue:{job_id} value deleted
R->>+C: 200 OK