Runner Job Lifecycle - Goeun Gil

module-name: "Runner Job Lifecycle"
area: "Product Knowledge"
gitlab-group: "Runner"
maintainers:
  - tmike

Overview
- General Timeline and Expectations
- Where to ask for help
Stage 0: Create Your Module
Stage 1: Module Prerequisites
Stage 2: Prepare test system
Stage 3: Communication between the Runner and GitLab
Stage 4: Understanding what jobs are
Stage 5: Job creation
- Task: Get a job stuck at Created state
Stage 6: Pending status
- Job token creation
- Task: Get job to Pending state
Stage 7: Running status
Stage 8: Update job trace log
- Task: Send a job trace log update to GitLab
Stage 9: Job status
Stage 10: Job status updates from Runner
- Task: Change job status via API
Stage 11: Job status update from GitLab
Stage 12: Review
Final Stage

Overview

Goal: You understand GitLab CI/CD jobs, and how they transition to different statuses over their lifecycle.

Objectives: What you'll get out of this module:

Understand how the Runner communicates with GitLab
Understand the various REST API calls the Runner makes to GitLab, and what they do
Understand how the job progresses to different statuses
Be aware of the various factors that influence a job's status
Troubleshoot issues with jobs across GitLab and Runner

General Timeline and Expectations

This module should take you 4-5 hours to complete
Read about our Support Onboarding process

Where to ask for help

If at any time, you have questions, ask in slack: #spt_pod_runner.

Stage 0: Create Your Module

Create an issue using this template by making the Issue Title: <module title> - <your name>.
Add yourself and your trainer as the assignees.
Set milestones, if applicable, and a due date to help motivate yourself!

Stage 1: Module Prerequisites

Done with Stage 1

A single node Omnibus installation, and a GitLab Runner.
- ❗ Important: To ensure compatibility with the training material, ensure that both Runner and GitLab versions are at least v18.0.0
Completed "GitLab Runner" training module
Completed "Continuous Integration" training module

Stage 2: Prepare test system

Done with Stage 2

Ensure GitLab and Runner are both working, and network accessible to each other over HTTPS
Register the Runner to your GitLab instance. Ensure the Runner has a tag of runner-job-lifecycle.
Create a RUNNER_TOKEN environment variable in your shell environment. The value should be the Runner authentication token for your registered Runner in the /etc/gitlab-runner/config.toml.
Create a GITLAB_HOSTNAME environment variable in your shell environment, with the value of your GitLab's domain name.
Modify the url for the registered Runner to a dummy URL in the /etc/gitlab-runner/config.toml. This will prevent the Runner from communicating to GitLab via API and allow us to run manual API requests to see how the Runner manages jobs.

Create a new project called runner-job-lifecycle with the following CI/CD YAML:

test:
  script:
    - echo "doing something"
  tags:
    - runner-job-lifecycle

Stage 3: Communication between the Runner and GitLab

Done with Stage 3

Communication between the Runner and GitLab occurs via REST API calls. The communication is almost always initiated by the Runner. Any data sent to the Runner from GitLab occurs in the context of a response to an API endpoint.

The Runner's responsibilities are to:

Request jobs from GitLab
Ensure the build environment is prepared for the job to execute (the executor does this)
Execute the job's logic
Send job trace log back to GitLab for storage
Return the job's status

Stage 4: Understanding what jobs are

Done with Stage 4

In Rails, a job is represented by the Ci::Build model. This class inherits from other classes, too.

Ci::Build < Ci::Processable < CommitStatus

A job has a status. Each of the classes shown above has state machine logic to determine what states a job can have, and the rules for progressing from one state to another.

A job object belongs to a:

Pipeline; and
Runner

Each job has some metadata associated with it, in the Ci::BuildMetadata model.

Outside of GitLab, a job is represented as a "build environment" in which scripts are executed, with additional context such as environment variables. The build environment can be inside a shell, on a separate machine, in a Docker container, in a Kubernetes pod, etc.

Stage 5: Job creation

Done with Stage 5

A job is created during the creation of a pipeline. A pipeline can have one or many jobs.

The job’s initial status will be created.

Task: Get a job stuck at Created state

In your GitLab Rails console, disable the Ci::InitialPipelineProcessWorker Sidekiq workers as below:
```
Feature.disable(:"run_sidekiq_jobs_Ci::InitialPipelineProcessWorker")
```
Create a new pipeline for your runner-job-lifecycle project. Notice that the job is stuck at the created state.

Stage 6: Pending status

Done with Stage 6

A job may enter the preparing status after creation in case there are any unmet pre-requisites. We can check if a job went to the preparing status, as the Ci::BuildPrepareWorker asynchronous job would execute after the status transitions to preparing.

Ultimately, a job will then enter the pending status via a state machine event. We can verify that the job went to the pending status, as the BuildQueueWorker asynchronous job will execute after the status transitions to pending.

Job token creation

The CI/CD job token is created before the job is set to pending. We can check the job token from the Rails Console.

Task: Get job to Pending state

Before we transition the job to the pending state, let's confirm that the CI/CD job token exists. Execute the following in your Rails console:
```
job = Ci::Build.last
job.variables.to_hash["CI_JOB_TOKEN"]
```
- You'll see the unencrypted job token returned.
Re-enable the Sidekiq workers to allow the job to get to pending. In your GitLab Rails console, enable the Ci::InitialPipelineProcessWorker Sidekiq worker as below:
```
Feature.enable(:"run_sidekiq_jobs_Ci::InitialPipelineProcessWorker")
```
Navigate to your GitLab instance's Admin area > Monitoring > Background jobs UI, in the Scheduled tab you'll notice the Ci::InitialPipelineProcessWorker is scheduled to run. Once that completes, the status of the job should be pending.
After the Sidekiq worker executes, you can confirm that the job's status is now pending.

From the Sidekiq log, verify that the BuildQueueWorker asynchronous Sidekiq job executed for that job (the worker's args should contain the job id)

grep -ir '"class":"BuildQueueWorker"' /var/log/gitlab/sidekiq/current | jq -rc '[.class, .args, .job_status]'
["BuildQueueWorker",["1"],"start"]
["BuildQueueWorker",["1"],"done"]
["BuildQueueWorker",["2"],"start"]
["BuildQueueWorker",["3"],"start"]
["BuildQueueWorker",["2"],"done"]
["BuildQueueWorker",["3"],"done"]

Stage 7: Running status

Done with Stage 7

A job is set to running once it has been assigned to a Runner. The Runner should be making requests for jobs periodically based on its configuration.

A Runner is assigned a job after it makes a request to GitLab via POST /api/v4/jobs/request API endpoint. GitLab will find candidate jobs, and assign one if a match is found.

Assigning jobs to Runners

When a suitable candidate job is able to be assigned to the Runner making the API request, the job is assigned to the Runner:

The Ci::Build object’s runner_id attribute is set to the Runner assigned to the job
The Ci::Build object’s run! method is executed
The job and its metadata is sent to the Runner as a JSON payload

When the run! method is executed, the status of the job is changed from pending to running.

Note: The status is set to running even before GitLab sends the job and its metadata back to the Runner for execution.

Job request response

An API request to POST /api/v4/jobs/request endpoint will result one of the following HTTP codes:

201 - Job was scheduled
204 - No job for Runner
403 - Forbidden
409 - Conflict

5xx errors can be sent, but these are the main status codes used.

Task: Assign a job to a Runner

Let's try to get a 201 response by accessing the POST /api/v4/jobs/request API endpoint. We will use the Runner authentication token to authenticate.

Make an API request to GitLab and save the payload to a file:

curl -X POST \
"https://$GITLAB_HOSTNAME/api/v4/jobs/request" \
-d "token=$RUNNER_TOKEN" \
-v > "payload.json"

The response code should be a 201. The job payload (JSON) should be saved as payload.json
The job status should now be running.
The job should now have a runner_id assigned to it. The Rails snippet below will return the ID of the Runner assigned to the job. Ensure you change the job_id variable to the id of your job:
```
job_id = 2
job = Ci::Build.find(job_id)
job.runner_id
=> 1
```
View the payload.json file. It contains the job token, Git repository information, timeout values, CI/CD variables, etc.
Remove payload.json file. This file can contain sensitive information.

Stage 8: Update job trace log

Done with Stage 8

The job trace log is the output of the script being executed. This is sent periodically by the Runner, but also when the job completes.

Task: Send a job trace log update to GitLab

This requires making an API request to GitLab using the PATCH /api/v4/jobs/:id/trace endpoint. It uses the CI/CD job token, instead of the Runner authentication token.

Get your job's token via the Rails console:
```
job.variables.to_hash["CI_JOB_TOKEN"]
```
Ensure an environment variable exists with the name JOB_TOKEN with the value of your job token from the previous step.
```
JOB_TOKEN="your_token_value"
```

Make an API request to GitLab. Ensure you set job_id to your specific job id:

job_id=123
trace_message="Hello! This is a job log update from the Runner."
content_size=${#trace_message}
content_range=0-$(($content_size-1))

curl -X PATCH \
"https://$GITLAB_HOSTNAME/api/v4/jobs/$job_id/trace" \
-H "Content-Range:$content_range" \
-H "Job-Token:$JOB_TOKEN" \
-d "$trace_message" \
-v

The response code should be 202.
Confirm that the job log in the UI contains the message we sent.

Stage 9: Job status

Done with Stage 9

Completed statuses

Once a job has been set to a completed status ([:success, :failed, :canceled]), the Ci::BuildFinishedWorker Sidekiq worker is asynchronously executed. The presence of this Sidekiq worker in logs indicates the job has completed. Code reference

We can be confident that the Ci::BuildFinishedWorker is an indicator of job completion because the Sidekiq worker is enqueued as part of the Ci::Build model's state machine logic. The Sidekiq worker will be enqueued when the CI/CD job's status is set to a completed status.

If the job gets to a failed status, and auto retry is allowed, the job is retried. Code reference

Job status influencing Pipeline status

When a job changes status, it causes the PipelineProcessWorker to be executed asynchronously. This eventually executes logic that will ensure the pipeline’s status is an aggregation of its job’s statuses.

Failure reason

A job can fail for a variety of reasons. The job failure reason is important in understanding why a job failed. It can also tell us where in the job lifecycle a job failed.

Failure reasons sent by Runner

The Runner will make an API request to GitLab to change the job status. One of the parameters sent in that API request is the failure_reason. The following failure reasons are set by the Runner upon failure:

script_failure
runner_system_failure
job_execution_timeout
image_pull_failure
unknown_failure

Failure reasons used by GitLab

When GitLab fails a job, it will set a failure reason. The number of failure reasons are too numerous to list here.

A full list of the failure_reasons are available in the Enums::Ci::CommitStatus module.

The fact that there are much more failure reasons specific to GitLab, and the Runner can only fail a job for a limited amount of reasons, we can use the failure reason to troubleshoot why a job has failed.

Failure reason mapping in the UI

The failure reasons might be visible in logs, however the UI maps these strings to more descriptive messages.

See the mapping between the failure reason and the message displayed in the UI.

Stage 10: Job status updates from Runner

Done with Stage 10

The Runner has now received the job from GitLab, and will start executing it. The Runner will update the job’s status during execution using the PUT /api/v4/jobs/:id API endpoint.

Task: Change job status via API

This requires making an API request to GitLab using the PUT /api/v4/jobs/:id endpoint. It uses the CI/CD job token, instead of the Runner authentication token.

Make an API request to GitLab. Ensure you set job_id to your specific job id:

job_id=123
state="failed"
failure_reason="script_failure"

curl -X PUT \
"https://$GITLAB_HOSTNAME/api/v4/jobs/$job_id" \
-d "token=$JOB_TOKEN" \
-d "state=failed" \
-d "failure_reason=script_failure" \
-d "exit_code=1" \
-v

Confirm the request receives a 200 response.
Confirm the job's status is now failed.

Stage 11: Job status update from GitLab

Done with Stage 11

The Runner is not the only thing that can influence the job’s status once it's started executing.

Jobs are dropped if they've been stuck either at pending, running, scheduled, or canceling statuses for longer than the timeout. The StuckCiJobsWorker cron-scheduled worker provides this functionality.

The StuckCiJobsWorker runs every hour at the cron schedule 0 * * * *. It will then check stuck pending jobs that are older than 1 hour.

drop is a state machine event, which takes the job from canceling to canceled, or any other status to failed.

Task: Job failure stuck at pending

Let's get a job to be stuck at pending, and allow it to have its status changed by GitLab.

Create a new pipeline for the runner-job-lifecycle project. As our Runner is still mis-configured, the job's status should be stuck at pending. The Runner can't communicate with GitLab.

In the Rails console, change the job's created and updated date to be 65 minutes ago:

job_id = 123
job = Ci::Build.find(job_id)
job.created_at = 25.hours.ago
job.updated_at = 25.hours.ago
job.save!

Execute the StuckCiJobsWorker:
```
StuckCiJobsWorker.perform_async
```

Confirm that the job has been set to failed status, and check the failure reason is stuck_or_timeout_failure:

irb(main):036:0> job_id = 123
=> 10
irb(main):037:0> job = Ci::Build.find(job_id)
=>
#<Ci::Build:0x00007f14f5230b48
...
irb(main):038:0> job.status
=> "failed"
irb(main):039:0> job.failure_reason
=> "stuck_or_timeout_failure"

Task: Job failure stuck at running

Let's get a job to be stuck at running, and allow it to have its status changed by GitLab.

Create a new pipeline for the runner-job-lifecycle project. As our Runner is still mis-configured, the job's status should be stuck at pending. The Runner can't communicate with GitLab.

Make an API request to GitLab to get the job assigned and set to running:

curl -X POST \
"https://$GITLAB_HOSTNAME/api/v4/jobs/request" \
-d "token=$RUNNER_TOKEN"

Confirm the job is set to running.

In the Rails console, change the job's created and updated date to be 65 minutes ago:

job_id = 123
job = Ci::Build.find(job_id)
job.created_at = 65.minutes.ago
job.updated_at = 65.minutes.ago
job.save!

Execute the StuckCiJobsWorker:
```
StuckCiJobsWorker.perform_async
```
Navigate to your GitLab instance's Admin area > Monitoring > Background jobs UI, in the Scheduled tab you'll notice the Ci::StuckBuilds::DropRunningWorker is scheduled to run. The StuckCiJobsWorker schedules additional delayed workers to run later. Once the Ci::StuckBuilds::DropRunningWorker job completes, the status of the job should be failed.

Wait for the scheduled Sidekiq Ci::StuckBuilds::DropRunningWorker task to run and then confirm that the job has been set to failed status, and check the failure reason is stuck_or_timeout_failure:

irb(main):048:0> job_id = 123
=> 11
irb(main):049:0> job = Ci::Build.find(job_id)
=>
#<Ci::Build:0x00007f14f58c6188
...
irb(main):050:0> job.status
=> "failed"
irb(main):051:0> job.failure_reason
=> "stuck_or_timeout_failure"

Task: Cancel job

Create a new pipeline for the runner-job-lifecycle project. As our Runner is still mis-configured, the job's status should be stuck at pending. The Runner can't communicate with GitLab.

Make an API request to GitLab to get the job assigned and set to running:

curl -X POST \
"https://$GITLAB_HOSTNAME/api/v4/jobs/request" \
-d "token=$RUNNER_TOKEN"

Confirm the job is set to running.

Set the job's Runner Manager's runtime features to allow jobs to be canceled gracefully:

job_id = 123
job = Ci::Build.find(job_id)
runner_manager = job.runner_manager
runner_manager.update!(runtime_features: { 'cancel_gracefully' => true })
runner_manager.reload
job.cancel_gracefully?

Cancel the job in the Rails console:

job_id = 123
job = Ci::Build.find(job_id)
job.cancel

Confirm the job is in the cancelling status
Get your job's token via the Rails console:
```
job.variables.to_hash["CI_JOB_TOKEN"]
```
Ensure an environment variable exists with the name JOB_TOKEN with the value of your job token from the previous step.
```
JOB_TOKEN="your_token_value"
```

Update the job's trace log via API:

job_id=20
trace_message="Hello! This is a job log update from the Runner."
content_size=${#trace_message}
content_range=0-$(($content_size-1))

curl -X PATCH \
"https://$GITLAB_HOSTNAME/api/v4/jobs/$job_id/trace" \
-H "Content-Range:$content_range" \
-H "Job-Token:$JOB_TOKEN" \
-d "$trace_message" \
-v

Confirm the response is 202. There should be a response header called job-status, which contains canceling. This is the mechanism that allows the Runner to cancel the job gracefully. The Runner will then stop executing the job, and update the status to GitLab.

Stage 12: Review

Done with Stage 12

Any updates or improvements needed? If there are any dead links, out of date or inaccurate content, missing content whether in this module's issue template or in other documentation, list them below as tasks for yourself! Once ready, have a maintainer or manager review.

Update ...

Final Stage

Have your trainer review your tickets and assessment. If you do not have a trainer, ask an expert to review.