Gitlab Runners and jobs get stuck due to a 403 response from the Gitlab API
Summary
Sometimes, our Gitlab jobs get stuck due to a 403 response from the Gitlab API. We don't get any response from our actual runners hosted on AWS EKS although the runners have finished (verified looking at the corresponding pod logs).
Steps to reproduce
No insights on how to reproduce the behaviour. It seems to correlate with longer running gitlab-runner pods (gitlab-runner:alpine-v17.6.0).
.gitlab-ci.yml
Shouldn't be reason for the error, but nonetheless:
# Docker images provided by https://github.com/cypress-io/cypress-docker-images
stages:
- source_check
- install_dependencies
- cleanup
- manual
- test
variables:
STAGE:
description: 'Stage the tests are running on.'
value: 'STAGING'
options:
- 'STAGING'
- 'TESTING'
SPEC_PATH:
description: 'Specifies the path to the tests to be executed. Example value: **/TheBestTest.cy.js'
CYPRESS_TAG:
description: 'Tag that is used to identify the run in cypress cloud dashboard. Example value: staging-ABC-XXXX'
PARALLEL_MACHINES:
description: 'Specifies the number of machines that are used for parallelization.'
value: '1'
options:
- '1'
- '2'
- '3'
- '4'
- '5'
- '6'
- '7'
- '8'
- '9'
- '10'
- '11'
- '12'
- '13'
- '14'
- '15'
- '16'
- '17'
- '18'
- '19'
- '20'
- '21'
- '22'
- '23'
- '24'
- '25'
- '26'
- '27'
- '28'
- '29'
- '30'
CYPRESS_CACHE_FOLDER: '$CI_PROJECT_DIR/cache/Cypress'
npm_config_cache: '$CI_PROJECT_DIR/.npm'
CYPRESS_PIPELINE_JOB_ID: '$CI_JOB_ID'
KUBERNETES_CPU_REQUEST: '2'
KUBERNETES_MEMORY_REQUEST: '4Gi'
# Dont see another workaround for "extends" and "parallel" not taking a variable directly and wanting to set the count of machines in the schedule via UI while downwards compatible
# see: https://gitlab.com/gitlab-org/gitlab/-/issues/11549
include:
- local: 'gitlab-templates/parallel1.yml'
rules:
- if: $PARALLEL_MACHINES == "1" || $PARALLEL_MACHINES == null
when: always
- local: 'gitlab-templates/parallel2.yml'
rules:
- if: $PARALLEL_MACHINES == "2"
when: always
- local: 'gitlab-templates/parallel3.yml'
rules:
- if: $PARALLEL_MACHINES == "3"
when: always
- local: 'gitlab-templates/parallel4.yml'
rules:
- if: $PARALLEL_MACHINES == "4"
when: always
- local: 'gitlab-templates/parallel5.yml'
rules:
- if: $PARALLEL_MACHINES == "5"
when: always
- local: 'gitlab-templates/parallel6.yml'
rules:
- if: $PARALLEL_MACHINES == "6"
when: always
- local: 'gitlab-templates/parallel7.yml'
rules:
- if: $PARALLEL_MACHINES == "7"
when: always
- local: 'gitlab-templates/parallel8.yml'
rules:
- if: $PARALLEL_MACHINES == "8"
when: always
- local: 'gitlab-templates/parallel9.yml'
rules:
- if: $PARALLEL_MACHINES == "9"
when: always
- local: 'gitlab-templates/parallel10.yml'
rules:
- if: $PARALLEL_MACHINES == "10"
when: always
- local: 'gitlab-templates/parallel11.yml'
rules:
- if: $PARALLEL_MACHINES == "11"
when: always
- local: 'gitlab-templates/parallel12.yml'
rules:
- if: $PARALLEL_MACHINES == "12"
when: always
- local: 'gitlab-templates/parallel13.yml'
rules:
- if: $PARALLEL_MACHINES == "13"
when: always
- local: 'gitlab-templates/parallel14.yml'
rules:
- if: $PARALLEL_MACHINES == "14"
when: always
- local: 'gitlab-templates/parallel15.yml'
rules:
- if: $PARALLEL_MACHINES == "15"
when: always
- local: 'gitlab-templates/parallel16.yml'
rules:
- if: $PARALLEL_MACHINES == "16"
when: always
- local: 'gitlab-templates/parallel17.yml'
rules:
- if: $PARALLEL_MACHINES == "17"
when: always
- local: 'gitlab-templates/parallel18.yml'
rules:
- if: $PARALLEL_MACHINES == "18"
when: always
- local: 'gitlab-templates/parallel19.yml'
rules:
- if: $PARALLEL_MACHINES == "19"
when: always
- local: 'gitlab-templates/parallel20.yml'
rules:
- if: $PARALLEL_MACHINES == "20"
when: always
- local: 'gitlab-templates/parallel21.yml'
rules:
- if: $PARALLEL_MACHINES == "21"
when: always
- local: 'gitlab-templates/parallel22.yml'
rules:
- if: $PARALLEL_MACHINES == "22"
when: always
- local: 'gitlab-templates/parallel23.yml'
rules:
- if: $PARALLEL_MACHINES == "23"
when: always
- local: 'gitlab-templates/parallel24.yml'
rules:
- if: $PARALLEL_MACHINES == "24"
when: always
- local: 'gitlab-templates/parallel25.yml'
rules:
- if: $PARALLEL_MACHINES == "25"
when: always
- local: 'gitlab-templates/parallel26.yml'
rules:
- if: $PARALLEL_MACHINES == "26"
when: always
- local: 'gitlab-templates/parallel27.yml'
rules:
- if: $PARALLEL_MACHINES == "27"
when: always
- local: 'gitlab-templates/parallel28.yml'
rules:
- if: $PARALLEL_MACHINES == "28"
when: always
- local: 'gitlab-templates/parallel29.yml'
rules:
- if: $PARALLEL_MACHINES == "29"
when: always
- local: 'gitlab-templates/parallel30.yml'
rules:
- if: $PARALLEL_MACHINES == "30"
when: always
.default_cypress_template: &default_cypress_template
tags:
- ci-public-nodepool
image: cypress/browsers:node-22.11.0-chrome-130.0.6723.69-1-ff-132.0-edge-130.0.2849.56-1
dependencies:
- install_dependencies
allow_failure: true
.default_full_build_template: &default_full_build_template
<<: *default_cypress_template
stage: test
rules:
- if: ($PREDEFINED_SCHEDULE_TYPE == "STAGING-FULL-BUILD" || $PREDEFINED_SCHEDULE_TYPE == "TESTING-FULL-BUILD") && $CI_PIPELINE_SOURCE == "schedule"
script: |
if [ "$PREDEFINED_SCHEDULE_TYPE" = "STAGING-FULL-BUILD" ]; then
npm run cy:run:staging:$SCRIPT_NAME
elif [ "$PREDEFINED_SCHEDULE_TYPE" = "TESTING-FULL-BUILD" ]; then
npm run cy:run:$SCRIPT_NAME
fi
check_format:
tags:
- ci-private-nodepool
image: node:latest
stage: source_check
rules:
- if: $CI_PIPELINE_SOURCE == "merge_request_event"
before_script:
- npm install -g prettier
script:
- npm run prettier:check
install_dependencies:
tags:
- ci-private-nodepool
image: node:latest
stage: install_dependencies
cache:
policy: pull-push
key:
files:
- package-lock.json
paths:
- .npm
- cache
- node_modules
artifacts:
paths:
- .npm
- cache
- node_modules
expire_in: 1 day
script:
- npm install
rules:
- if: $CI_PIPELINE_SOURCE == "schedule" || $CI_PIPELINE_SOURCE == "web"
manual:
extends: .parallel
<<: *default_cypress_template
stage: manual
rules:
- if: $SPEC_PATH != null && $STAGE != null && $CYPRESS_TAG != null && $PARALLEL_MACHINES != null && $PREDEFINED_SCHEDULE_TYPE == null && ($CI_PIPELINE_SOURCE == "web" || $CI_PIPELINE_SOURCE == "schedule")
script:
- |
if [ "$STAGE" = "STAGING" ]; then
./node_modules/.bin/cypress run --spec "$SPEC_PATH" --record --tag "$CYPRESS_TAG" --browser chrome --parallel --config baseUrl=https://someurl --env PortalURL=https://someurl
elif [ "$STAGE" = "TESTING" ]; then
./node_modules/.bin/cypress run --spec "$SPEC_PATH" --record --tag "$CYPRESS_TAG" --browser chrome --parallel
fi
# because cleanup shouldnt be recorded to dashboard
cleanup:
<<: *default_cypress_template
stage: cleanup
rules:
- if: ($PREDEFINED_SCHEDULE_TYPE == "STAGING-CLEANUP" || $PREDEFINED_SCHEDULE_TYPE == "TESTING-CLEANUP") && $CI_PIPELINE_SOURCE == "schedule"
script:
- |
if [ "$PREDEFINED_SCHEDULE_TYPE" = "STAGING-CLEANUP" ]; then
npm run cy:run:staging:cleanup
elif [ "$PREDEFINED_SCHEDULE_TYPE" = "TESTING-CLEANUP" ]; then
npm run cy:run:cleanup
fi
serviceName1:
parallel: 15
<<: *default_full_build_template
variables:
SCRIPT_NAME: serviceName1
serviceName2:
parallel: 3
<<: *default_full_build_template
variables:
SCRIPT_NAME: serviceName2
...
Actual behavior
We have Gitlab Runners launched approx. 5 days ago. They all share the same authentication token registered via gitlab.com.
Some of our jobs fail. Taking a look at the gitlab-runner pod logs shows errors of this kind:
WARNING: Submitting job to coordinator... job failed bytesize=20764 checksum=crc32:75754b1d code=403 job=8540930992 job-status= runner=t3_qYADwj status=PUT https://gitlab.com/api/v4/jobs/8540930992: 403 Forbidden update-interval=0s
The actual runners, corresponding to the executed jobs, show that they have terminated correctly. On the job side (gitlab.com) no logs are being shown after the 403 in the related gitlab-runner.
This behaviour is of sporadic nature across all gitlab-runner pods.
Expected behaviour
No 403 to allow our gitlab-runners registering finished runner pods.
Relevant logs and/or screenshots
job log
WARNING: Submitting job to coordinator... job failed bytesize=20764 checksum=crc32:75754b1d code=403 job=8540930992 job-status= runner=t3_qYADwj status=PUT https://gitlab.com/api/v4/jobs/8540930992: 403 Forbidden update-interval=0s
Environment description
Using AWS EKS 1.30 with gitlab-runner v17.6 (https://artifacthub.io/packages/helm/gitlab/gitlab-runner).
config.toml contents
[[runners]]
output_limit = 8192
environment = ["CI_SERVICE_HOST=127.0.0.1", "DOCKER_TLS_CERTDIR=", "FF_GITLAB_REGISTRY_HELPER_IMAGE=true", "FF_WAIT_FOR_POD_TO_BE_REACHABLE=true", "FF_TIMESTAMPS=true"]
executor = "kubernetes"
[runners.cache]
Type = "s3"
Path = "gitlab_runner"
Shared = true
[runners.cache.s3]
ServerAddress = "s3.amazonaws.com"
BucketName = "${ci_cache_bucket_name}"
BucketLocation = "${region}"
Insecure = false
[runners.kubernetes]
retry_limit = 5
namespace = "${namespace}"
image = "ubuntu:24.04"
poll_timeout = 720
pod_termination_grace_period_seconds = 60
privileged = true
cpu_request = "1"
cpu_request_overwrite_max_allowed = "16"
memory_request = "8Gi"
memory_limit = "12Gi"
memory_request_overwrite_max_allowed = "32Gi"
memory_limit_overwrite_max_allowed = "32Gi"
service_cpu_request = "500m"
service_memory_request = "512Mi"
service_memory_limit = "4Gi"
helper_cpu_request = "500m"
helper_memory_request = "256Mi"
helper_memory_limit = "2Gi"
[runners.kubernetes.pod_annotations]
"karpenter.sh/do-not-evict" = "true"
[runners.kubernetes.pod_labels]
"ci_project_name" = "$CI_PROJECT_NAME"
"ci_job_id" = "$CI_JOB_ID"
"gitlab" = "runner"
[runners.kubernetes.retry_limits]
"TLS handshake timeout" = 10
"tls: internal error" = 10
Used GitLab Runner version
gitlab-runner v17.6 (https://artifacthub.io/packages/helm/gitlab/gitlab-runner). Find other required information in the appended config.
Possible fixes
It looks like there may be a legitimate bug with expiration though where the jobs get stuck running on the gitlab side while the runner has finished (based on reading the code ).
The legacy flow for expiration due to timeouts is here:
- Gitlab gets a
failedstatus from runner when the job timeouts via the update job and append trace endpoints - Gitlab runs the
BuildFinishedWorkerwhich expires/removes the token - The job ends up in a failed state
The new flow for timeouts appears to be:
- Token is expired because of the JWT claims after the timeout
- Runner tries to update the job via the update and append endpoints
- Gitlab can't find the expired token an returns a 403
- The runner kills the job on the runner side making no more requests to gitlab
- The job is never set to failed in GitLab and is indefinitely running until the
StuckCiJobsWorkergets to it- This part is based on user reports, and reading the code and seeing no obvious status changes when forbidden is returned, a full reproduction of the problem has not been completed by me)
We need to fail the job when returning 403 for a JWT token from gitlab.