Runner Job Lifecycle - Goeun Gil
module-name: "Runner Job Lifecycle"
area: "Product Knowledge"
gitlab-group: "Runner"
maintainers:
- tmike
- Overview
- Stage 0: Create Your Module
- Stage 1: Module Prerequisites
- Stage 2: Prepare test system
- Stage 3: Communication between the Runner and GitLab
- Stage 4: Understanding what jobs are
- Stage 5: Job creation
- Stage 6: Pending status
- Stage 7: Running status
- Stage 8: Update job trace log
- Stage 9: Job status
- Stage 10: Job status updates from Runner
- Stage 11: Job status update from GitLab
- Stage 12: Review
- Final Stage
Overview
Goal: You understand GitLab CI/CD jobs, and how they transition to different statuses over their lifecycle.
Objectives: What you'll get out of this module:
- Understand how the Runner communicates with GitLab
- Understand the various REST API calls the Runner makes to GitLab, and what they do
- Understand how the job progresses to different statuses
- Be aware of the various factors that influence a job's status
- Troubleshoot issues with jobs across GitLab and Runner
General Timeline and Expectations
- This module should take you 4-5 hours to complete
- Read about our Support Onboarding process
Where to ask for help
If at any time, you have questions, ask in slack: #spt_pod_runner.
Stage 0: Create Your Module
-
Create an issue using this template by making the Issue Title: <module title> - <your name>. -
Add yourself and your trainer as the assignees. -
Set milestones, if applicable, and a due date to help motivate yourself!
Stage 1: Module Prerequisites
-
Done with Stage 1
-
A single node Omnibus installation, and a GitLab Runner. -
❗ Important: To ensure compatibility with the training material, ensure that both Runner and GitLab versions are at least v18.0.0
-
-
Completed "GitLab Runner" training module -
Completed "Continuous Integration" training module
Stage 2: Prepare test system
-
Done with Stage 2
-
Ensure GitLab and Runner are both working, and network accessible to each other over HTTPS -
Register the Runner to your GitLab instance. Ensure the Runner has a tag of runner-job-lifecycle. -
Create a RUNNER_TOKENenvironment variable in your shell environment. The value should be the Runner authentication token for your registered Runner in the/etc/gitlab-runner/config.toml. -
Create a GITLAB_HOSTNAMEenvironment variable in your shell environment, with the value of your GitLab's domain name. -
Modify the urlfor the registered Runner to a dummy URL in the/etc/gitlab-runner/config.toml. This will prevent the Runner from communicating to GitLab via API and allow us to run manual API requests to see how the Runner manages jobs. -
Create a new project called runner-job-lifecyclewith the following CI/CD YAML:test: script: - echo "doing something" tags: - runner-job-lifecycle
Stage 3: Communication between the Runner and GitLab
-
Done with Stage 3
Communication between the Runner and GitLab occurs via REST API calls. The communication is almost always initiated by the Runner. Any data sent to the Runner from GitLab occurs in the context of a response to an API endpoint.
The Runner's responsibilities are to:
- Request jobs from GitLab
- Ensure the build environment is prepared for the job to execute (the executor does this)
- Execute the job's logic
- Send job trace log back to GitLab for storage
- Return the job's status
Stage 4: Understanding what jobs are
-
Done with Stage 4
In Rails, a job is represented by the Ci::Build model. This class inherits from other classes, too.
Ci::Build < Ci::Processable < CommitStatus
A job has a status. Each of the classes shown above has state machine logic to determine what states a job can have, and the rules for progressing from one state to another.
A job object belongs to a:
- Pipeline; and
- Runner
Each job has some metadata associated with it, in the Ci::BuildMetadata model.
Outside of GitLab, a job is represented as a "build environment" in which scripts are executed, with additional context such as environment variables. The build environment can be inside a shell, on a separate machine, in a Docker container, in a Kubernetes pod, etc.
Stage 5: Job creation
-
Done with Stage 5
A job is created during the creation of a pipeline. A pipeline can have one or many jobs.
The job’s initial status will be created.
Task: Get a job stuck at Created state
-
In your GitLab Rails console, disable the Ci::InitialPipelineProcessWorkerSidekiq workers as below:Feature.disable(:"run_sidekiq_jobs_Ci::InitialPipelineProcessWorker") -
Create a new pipeline for your runner-job-lifecycleproject. Notice that the job is stuck at thecreatedstate.
Stage 6: Pending status
-
Done with Stage 6
A job may enter the preparing status after creation in case there are any unmet pre-requisites. We can check if a job went to the preparing status, as the Ci::BuildPrepareWorker asynchronous job would execute after the status transitions to preparing.
Ultimately, a job will then enter the pending status via a state machine event. We can verify that the job went to the pending status, as the BuildQueueWorker asynchronous job will execute after the status transitions to pending.
Job token creation
The CI/CD job token is created before the job is set to pending. We can check the job token from the Rails Console.
Task: Get job to Pending state
-
Before we transition the job to the pendingstate, let's confirm that the CI/CD job token exists. Execute the following in your Rails console:job = Ci::Build.last job.variables.to_hash["CI_JOB_TOKEN"]- You'll see the unencrypted job token returned.
-
Re-enable the Sidekiq workers to allow the job to get to pending. In your GitLab Rails console, enable theCi::InitialPipelineProcessWorkerSidekiq worker as below:Feature.enable(:"run_sidekiq_jobs_Ci::InitialPipelineProcessWorker") -
Navigate to your GitLab instance's Admin area > Monitoring > Background jobs UI, in the Scheduled tab you'll notice the Ci::InitialPipelineProcessWorkeris scheduled to run. Once that completes, the status of the job should bepending. -
After the Sidekiq worker executes, you can confirm that the job's status is now pending. -
From the Sidekiq log, verify that the BuildQueueWorkerasynchronous Sidekiq job executed for that job (the worker'sargsshould contain the job id)grep -ir '"class":"BuildQueueWorker"' /var/log/gitlab/sidekiq/current | jq -rc '[.class, .args, .job_status]' ["BuildQueueWorker",["1"],"start"] ["BuildQueueWorker",["1"],"done"] ["BuildQueueWorker",["2"],"start"] ["BuildQueueWorker",["3"],"start"] ["BuildQueueWorker",["2"],"done"] ["BuildQueueWorker",["3"],"done"]
Stage 7: Running status
-
Done with Stage 7
A job is set to running once it has been assigned to a Runner. The Runner should be making requests for jobs periodically based on its configuration.
A Runner is assigned a job after it makes a request to GitLab via POST /api/v4/jobs/request API endpoint. GitLab will find candidate jobs, and assign one if a match is found.
Assigning jobs to Runners
When a suitable candidate job is able to be assigned to the Runner making the API request, the job is assigned to the Runner:
- The
Ci::Buildobject’srunner_idattribute is set to the Runner assigned to the job - The
Ci::Buildobject’srun!method is executed - The job and its metadata is sent to the Runner as a JSON payload
When the run! method is executed, the status of the job is changed from pending to running.
Note: The status is set to running even before GitLab sends the job and its metadata back to the Runner for execution.
Job request response
An API request to POST /api/v4/jobs/request endpoint will result one of the following HTTP codes:
-
201- Job was scheduled -
204- No job for Runner -
403- Forbidden -
409- Conflict
5xx errors can be sent, but these are the main status codes used.
Task: Assign a job to a Runner
Let's try to get a 201 response by accessing the POST /api/v4/jobs/request API endpoint. We will use the Runner authentication token to authenticate.
-
Make an API request to GitLab and save the payload to a file: curl -X POST \ "https://$GITLAB_HOSTNAME/api/v4/jobs/request" \ -d "token=$RUNNER_TOKEN" \ -v > "payload.json" -
The response code should be a 201. The job payload (JSON) should be saved aspayload.json -
The job status should now be running. -
The job should now have a runner_idassigned to it. The Rails snippet below will return the ID of the Runner assigned to the job. Ensure you change thejob_idvariable to the id of your job:job_id = 2 job = Ci::Build.find(job_id) job.runner_id => 1 -
View the payload.jsonfile. It contains the job token, Git repository information, timeout values, CI/CD variables, etc. -
Remove payload.jsonfile. This file can contain sensitive information.
Stage 8: Update job trace log
-
Done with Stage 8
The job trace log is the output of the script being executed. This is sent periodically by the Runner, but also when the job completes.
Task: Send a job trace log update to GitLab
This requires making an API request to GitLab using the PATCH /api/v4/jobs/:id/trace endpoint. It uses the CI/CD job token, instead of the Runner authentication token.
-
Get your job's token via the Rails console:
job.variables.to_hash["CI_JOB_TOKEN"] -
Ensure an environment variable exists with the name
JOB_TOKENwith the value of your job token from the previous step.JOB_TOKEN="your_token_value" -
Make an API request to GitLab. Ensure you set job_idto your specific job id:job_id=123 trace_message="Hello! This is a job log update from the Runner." content_size=${#trace_message} content_range=0-$(($content_size-1)) curl -X PATCH \ "https://$GITLAB_HOSTNAME/api/v4/jobs/$job_id/trace" \ -H "Content-Range:$content_range" \ -H "Job-Token:$JOB_TOKEN" \ -d "$trace_message" \ -v -
The response code should be 202. -
Confirm that the job log in the UI contains the message we sent.
Stage 9: Job status
-
Done with Stage 9
Completed statuses
Once a job has been set to a completed status ([:success, :failed, :canceled]), the Ci::BuildFinishedWorker Sidekiq worker is asynchronously executed. The presence of this Sidekiq worker in logs indicates the job has completed.
Code reference
We can be confident that the Ci::BuildFinishedWorker is an indicator of job completion because the Sidekiq worker is enqueued as part of the Ci::Build model's state machine logic.
The Sidekiq worker will be enqueued when the CI/CD job's status is set to a completed status.
If the job gets to a failed status, and auto retry is allowed, the job is retried.
Code reference
Job status influencing Pipeline status
When a job changes status, it causes the PipelineProcessWorker to be executed asynchronously. This eventually executes logic that will ensure the pipeline’s status is an aggregation of its job’s statuses.
Failure reason
A job can fail for a variety of reasons. The job failure reason is important in understanding why a job failed. It can also tell us where in the job lifecycle a job failed.
Failure reasons sent by Runner
The Runner will make an API request to GitLab to change the job status. One of the parameters sent in that API request is the failure_reason. The following failure reasons are set by the Runner upon failure:
script_failurerunner_system_failurejob_execution_timeoutimage_pull_failureunknown_failure
Failure reasons used by GitLab
When GitLab fails a job, it will set a failure reason. The number of failure reasons are too numerous to list here.
A full list of the failure_reasons are available in the Enums::Ci::CommitStatus module.
The fact that there are much more failure reasons specific to GitLab, and the Runner can only fail a job for a limited amount of reasons, we can use the failure reason to troubleshoot why a job has failed.
Failure reason mapping in the UI
The failure reasons might be visible in logs, however the UI maps these strings to more descriptive messages.
See the mapping between the failure reason and the message displayed in the UI.
Stage 10: Job status updates from Runner
-
Done with Stage 10
The Runner has now received the job from GitLab, and will start executing it. The Runner will update the job’s status during execution using the PUT /api/v4/jobs/:id API endpoint.
Task: Change job status via API
This requires making an API request to GitLab using the PUT /api/v4/jobs/:id endpoint. It uses the CI/CD job token, instead of the Runner authentication token.
-
Make an API request to GitLab. Ensure you set job_idto your specific job id:job_id=123 state="failed" failure_reason="script_failure" curl -X PUT \ "https://$GITLAB_HOSTNAME/api/v4/jobs/$job_id" \ -d "token=$JOB_TOKEN" \ -d "state=failed" \ -d "failure_reason=script_failure" \ -d "exit_code=1" \ -v -
Confirm the request receives a 200response. -
Confirm the job's status is now failed.
Stage 11: Job status update from GitLab
-
Done with Stage 11
The Runner is not the only thing that can influence the job’s status once it's started executing.
Jobs are dropped if they've been stuck either at pending, running, scheduled, or canceling statuses for longer than the timeout. The StuckCiJobsWorker cron-scheduled worker provides this functionality.
The StuckCiJobsWorker runs every hour at the cron schedule 0 * * * *. It will then check stuck pending jobs that are older than 1 hour.
drop is a state machine event, which takes the job from canceling to canceled, or any other status to failed.
Task: Job failure stuck at pending
Let's get a job to be stuck at pending, and allow it to have its status changed by GitLab.
-
Create a new pipeline for the runner-job-lifecycleproject. As our Runner is still mis-configured, the job's status should be stuck atpending. The Runner can't communicate with GitLab. -
In the Rails console, change the job's created and updated date to be 65 minutes ago: job_id = 123 job = Ci::Build.find(job_id) job.created_at = 25.hours.ago job.updated_at = 25.hours.ago job.save! -
Execute the StuckCiJobsWorker:StuckCiJobsWorker.perform_async -
Confirm that the job has been set to failedstatus, and check the failure reason isstuck_or_timeout_failure:irb(main):036:0> job_id = 123 => 10 irb(main):037:0> job = Ci::Build.find(job_id) => #<Ci::Build:0x00007f14f5230b48 ... irb(main):038:0> job.status => "failed" irb(main):039:0> job.failure_reason => "stuck_or_timeout_failure"
Task: Job failure stuck at running
Let's get a job to be stuck at running, and allow it to have its status changed by GitLab.
-
Create a new pipeline for the runner-job-lifecycleproject. As our Runner is still mis-configured, the job's status should be stuck atpending. The Runner can't communicate with GitLab. -
Make an API request to GitLab to get the job assigned and set to running: curl -X POST \ "https://$GITLAB_HOSTNAME/api/v4/jobs/request" \ -d "token=$RUNNER_TOKEN" -
Confirm the job is set to running. -
In the Rails console, change the job's created and updated date to be 65 minutes ago: job_id = 123 job = Ci::Build.find(job_id) job.created_at = 65.minutes.ago job.updated_at = 65.minutes.ago job.save! -
Execute the StuckCiJobsWorker:StuckCiJobsWorker.perform_async -
Navigate to your GitLab instance's Admin area > Monitoring > Background jobs UI, in the Scheduled tab you'll notice the Ci::StuckBuilds::DropRunningWorkeris scheduled to run. TheStuckCiJobsWorkerschedules additional delayed workers to run later. Once theCi::StuckBuilds::DropRunningWorkerjob completes, the status of the job should befailed. -
Wait for the scheduled Sidekiq Ci::StuckBuilds::DropRunningWorkertask to run and then confirm that the job has been set tofailedstatus, and check the failure reason isstuck_or_timeout_failure:irb(main):048:0> job_id = 123 => 11 irb(main):049:0> job = Ci::Build.find(job_id) => #<Ci::Build:0x00007f14f58c6188 ... irb(main):050:0> job.status => "failed" irb(main):051:0> job.failure_reason => "stuck_or_timeout_failure"
Task: Cancel job
-
Create a new pipeline for the runner-job-lifecycleproject. As our Runner is still mis-configured, the job's status should be stuck atpending. The Runner can't communicate with GitLab. -
Make an API request to GitLab to get the job assigned and set to running: curl -X POST \ "https://$GITLAB_HOSTNAME/api/v4/jobs/request" \ -d "token=$RUNNER_TOKEN" -
Confirm the job is set to running. -
Set the job's Runner Manager's runtime features to allow jobs to be canceled gracefully: job_id = 123 job = Ci::Build.find(job_id) runner_manager = job.runner_manager runner_manager.update!(runtime_features: { 'cancel_gracefully' => true }) runner_manager.reload job.cancel_gracefully? -
Cancel the job in the Rails console: job_id = 123 job = Ci::Build.find(job_id) job.cancel -
Confirm the job is in the cancellingstatus -
Get your job's token via the Rails console: job.variables.to_hash["CI_JOB_TOKEN"] -
Ensure an environment variable exists with the name JOB_TOKENwith the value of your job token from the previous step.JOB_TOKEN="your_token_value" -
Update the job's trace log via API: job_id=20 trace_message="Hello! This is a job log update from the Runner." content_size=${#trace_message} content_range=0-$(($content_size-1)) curl -X PATCH \ "https://$GITLAB_HOSTNAME/api/v4/jobs/$job_id/trace" \ -H "Content-Range:$content_range" \ -H "Job-Token:$JOB_TOKEN" \ -d "$trace_message" \ -v -
Confirm the response is 202. There should be a response header calledjob-status, which containscanceling. This is the mechanism that allows the Runner to cancel the job gracefully. The Runner will then stop executing the job, and update the status to GitLab.
Stage 12: Review
-
Done with Stage 12
Any updates or improvements needed? If there are any dead links, out of date or inaccurate content, missing content whether in this module's issue template or in other documentation, list them below as tasks for yourself! Once ready, have a maintainer or manager review.
-
Update ...
Final Stage
-
Have your trainer review your tickets and assessment. If you do not have a trainer, ask an expert to review.