Capstone issue: Introduce new API for Runner to transition jobs to 'running' explicitly

Problems

Problem 1

A job, once picked up by GitLab-Runner, is moved to the "running" state immediately, even if GitLab-Runner isn't yet executing the job, but doing prep work.

  • Depending on the executor and configuration, finding capacity is done ad-hoc, meaning that even though we're in the running state, we're actually looking for/provisioning an environment to execute the job in.
  • Even for executors that are configured to first have an environment ready before requesting a job there can still be some prep time required.
  • This prep time is misleading, and in some setups, can be counted towards compute minutes.

Problem 2

Jobs are assigned to runners which go offline before starting work on the job, and so stay pending until timeout. There have been more and more reports of this as time goes by, including a severity2 bug and a recent incident being raised.

Proposal

A solution to this would be to introduce a new Runner feature where when a job is picked up by Runner the job would remain in the pending state.

Only when GitLab-Runner has finished preparation tasks and is ready to actually execute the job would it notify GitLab to transition the job to the running state.

With the new feature enabled:

  • when build.runner_id == nil the job is pending (as it is today)
  • when build.runner_id != nil the job remains pending, but no other runner will be assigned (and trace data can be submitted).
  • Runner would then call a new API endpoint to transition the job state to running.

See #464048 (comment 1942871742) for changes to the database that are anticipated. (UPDATE: this should be doable without database changes, by temporarily removing the entry from ci_pending_builds while GitLab waits for the runner to accept the job).

NOTE: Any changes to the flow show be reflected in the documentation, including (but not limited to):

Smart routing

There's been some previous discussions about a Runner daemon/router in the past, but these relied on significant changes to GitLab to achieve.

By fixing the above problem a side effect is that we can seamlessly introduce a smart Runner router/daemon for distributing jobs more efficiently. The router would use the exact same API as GitLab-Runner does.

When a job is picked up by the router, the state remains pending. At this point, finding a Runner to execute the job is delegated to the router. When a Runner is found and executes the job, it instructs GitLab to transition the job to running.

We go from this:

       ┌──────► GitLab-Runner 1

GitLab ├──────► GitLab-Runner 2

       └──────► GitLab-Runner 3

To this:

                         ┌──────► GitLab-Runner 1 

GitLab ───► Smart router ├──────► GitLab-Runner 2 

                         └──────► GitLab-Runner 3 

This doesn't solve the whole spectrum of problems we have with the existing job queue mechanism, but helps for certain setups and customers.

GitLab distributing to Runners at scale is complicated.

GitLab distributing to routers instead frees up some of the responsibility, and this is where a router, with it now having its own queue of jobs, can probably be smarter.

Some scenarios it can help with:

  • Rather than multiple runner managers asking the GitLab instance for jobs, only the daemon needs to. All the Runners are asking the daemon for a job. This can help with a large fleet of self-hosted Runners hitting GitLab.com and can tighten network permissions.
  • Have customers implement what they deem as "fair scheduling" to their own fleet of Runners

Implementation plan

Current workflow (old runner or feature flag is disabled)

  1. Pending job is added to ci_pending_builds (Ci::UpdateBuildQueueService#push).
  2. Runners poll for jobs (POST /jobs/request -> Ci::RegisterJobService#execute).
  3. Once the pending job is assigned to a specific runner, Ci::Build.run! is called which will cause ci_builds.status field to be updated from :pending to :running (or :canceling) and the state machine to call Ci::UpdateBuildQueueService#pop to pop the job from ci_pending_builds.

New workflow (new runner with feature flag enabled info.features)

  1. Pending job is added to ci_pending_builds (Ci::UpdateBuildQueueService#push).

  2. Runners poll for jobs (POST /jobs/request -> Ci::RegisterJobService#execute).

  3. Once the pending job is assigned to a specific runner, Ci::RegisterJobService should call Ci::UpdateBuildQueueService#pop to pop it from ci_pending_builds, so other runners are not assigned this job.

  4. A new REST API endpoint (e.g. POST /api/v4/jobs/{job_id}/runner_provisioning) will allow the runner to inform the GitLab instance about 3 scenarios:

    status meaning handled by result
    pending Runner manager is still preparing the runner. Rails app (could be Workhorse) This call will occur regularly as a keep-alive to the GitLab instance. If the GitLab instance doesn't hear from the runner for let's say 5 minutes, it can return the job to the queue and assign it to another runner as in a declined scenario. Workhorse could handle this entirely if needed by simply setting at a Redis key (build:pending_runner_queue:#{build_id}), so we keep a list of pending jobs.
    accepted The job has been accepted and the runner has started execution. Rails app Transition the job to running by calling Ci::Build.run!. The state machine will call Ci::UpdateBuildQueueService#pop to pop the job from ci_pending_builds, even though it is no longer present there, so it’ll just be a no-op.
    declined The job has been declined by the runner manager. Rails app Push the job again to ci_pending_builds by calling Ci::UpdateBuildQueueService#push. We’ll need to clear runner_id, runner_manager_id, and runner_session_attributes from the ci_builds record. Perhaps better to cancel job, and retry it. We may need a mechanism to avoid defaulting to the same runner when choosing a fallback runner for a declined job, if there's a choice for other runners.
  5. A cronjob worker (perhaps StuckCiJobsWorker?) goes over Redis keys representing the pending jobs, and declines all jobs that have timed out without hearing further from the runner (any jobs matching Redis' build:pending_runner_queue:* that are too old).

Mermaid diagram

Adaptation of Long polling workflow:

sequenceDiagram
accTitle: Long polling workflow
accDescr: The flow of a single runner getting a job with long polling enabled

    autonumber
    participant C as Runner
    participant W as Workhorse
    participant Redis as Redis
    participant R as Rails
    participant S as Sidekiq
    C->>+W: POST /api/v4/jobs/request
    W->>+Redis: New job for runner A?
    Redis->>+W: Unknown
    W->>+R: POST /api/v4/jobs/request
    R->>+Redis: Runner A: last_update = X
    R->>W: 204 No job, X-GitLab-Last-Update = X
    W->>C: 204 No job, X-GitLab-Last-Update = X
    C->>W: POST /api/v4/jobs/request, X-GitLab-Last-Update: X
    W->>Redis: Notify when last_update change
    Note over W: Request held in long poll
    Note over S: CI job created (ci_pending_builds)
    Note over S, Redis: Update all registered runners
    S->>Redis: Runner A: last_update = Z
    Redis->>W: Runner: last_update changed
    Note over W: Request released from long poll
    W->>Rails: POST /api/v4/jobs/request
    Note over R: Job removed from ci_pending_builds
    Rails->>W: 201 Job was scheduled
    W->>C: 201 Job was scheduled
    loop Every 5 minutes
    C->>+R: POST /api/v4/jobs/{job_id}/runner_provisioning?status=pending
    Note over R: Redis build:pending_runner_queue:{job_id} value updated
    R->>+C: 200 OK
    end
    C->>+R: POST /api/v4/jobs/{job_id}/runner_provisioning?status=accepted
    Note over R: Job transitioned to running and<br/>Redis build:pending_runner_queue:{job_id} value deleted
    R->>+C: 200 OK
Edited by Pedro Pombeiro