Gitlab Runners and jobs get stuck due to a 403 response from the Gitlab API

Summary

Sometimes, our Gitlab jobs get stuck due to a 403 response from the Gitlab API. We don't get any response from our actual runners hosted on AWS EKS although the runners have finished (verified looking at the corresponding pod logs).

Steps to reproduce

No insights on how to reproduce the behaviour. It seems to correlate with longer running gitlab-runner pods (gitlab-runner:alpine-v17.6.0).

.gitlab-ci.yml

Shouldn't be reason for the error, but nonetheless:

# Docker images provided by https://github.com/cypress-io/cypress-docker-images
stages:
  - source_check
  - install_dependencies
  - cleanup
  - manual
  - test

variables:
  STAGE:
    description: 'Stage the tests are running on.'
    value: 'STAGING'
    options:
      - 'STAGING'
      - 'TESTING'
  SPEC_PATH:
    description: 'Specifies the path to the tests to be executed. Example value: **/TheBestTest.cy.js'
  CYPRESS_TAG:
    description: 'Tag that is used to identify the run in cypress cloud dashboard. Example value: staging-ABC-XXXX'
  PARALLEL_MACHINES:
    description: 'Specifies the number of machines that are used for parallelization.'
    value: '1'
    options:
      - '1'
      - '2'
      - '3'
      - '4'
      - '5'
      - '6'
      - '7'
      - '8'
      - '9'
      - '10'
      - '11'
      - '12'
      - '13'
      - '14'
      - '15'
      - '16'
      - '17'
      - '18'
      - '19'
      - '20'
      - '21'
      - '22'
      - '23'
      - '24'
      - '25'
      - '26'
      - '27'
      - '28'
      - '29'
      - '30'
  CYPRESS_CACHE_FOLDER: '$CI_PROJECT_DIR/cache/Cypress'
  npm_config_cache: '$CI_PROJECT_DIR/.npm'
  CYPRESS_PIPELINE_JOB_ID: '$CI_JOB_ID'
  KUBERNETES_CPU_REQUEST: '2'
  KUBERNETES_MEMORY_REQUEST: '4Gi'

# Dont see another workaround for "extends" and "parallel" not taking a variable directly and wanting to set the count of machines in the schedule via UI while downwards compatible
# see: https://gitlab.com/gitlab-org/gitlab/-/issues/11549
include:
  - local: 'gitlab-templates/parallel1.yml'
    rules:
      - if: $PARALLEL_MACHINES == "1" || $PARALLEL_MACHINES == null
        when: always
  - local: 'gitlab-templates/parallel2.yml'
    rules:
      - if: $PARALLEL_MACHINES == "2"
        when: always
  - local: 'gitlab-templates/parallel3.yml'
    rules:
      - if: $PARALLEL_MACHINES == "3"
        when: always
  - local: 'gitlab-templates/parallel4.yml'
    rules:
      - if: $PARALLEL_MACHINES == "4"
        when: always
  - local: 'gitlab-templates/parallel5.yml'
    rules:
      - if: $PARALLEL_MACHINES == "5"
        when: always
  - local: 'gitlab-templates/parallel6.yml'
    rules:
      - if: $PARALLEL_MACHINES == "6"
        when: always
  - local: 'gitlab-templates/parallel7.yml'
    rules:
      - if: $PARALLEL_MACHINES == "7"
        when: always
  - local: 'gitlab-templates/parallel8.yml'
    rules:
      - if: $PARALLEL_MACHINES == "8"
        when: always
  - local: 'gitlab-templates/parallel9.yml'
    rules:
      - if: $PARALLEL_MACHINES == "9"
        when: always
  - local: 'gitlab-templates/parallel10.yml'
    rules:
      - if: $PARALLEL_MACHINES == "10"
        when: always
  - local: 'gitlab-templates/parallel11.yml'
    rules:
      - if: $PARALLEL_MACHINES == "11"
        when: always
  - local: 'gitlab-templates/parallel12.yml'
    rules:
      - if: $PARALLEL_MACHINES == "12"
        when: always
  - local: 'gitlab-templates/parallel13.yml'
    rules:
      - if: $PARALLEL_MACHINES == "13"
        when: always
  - local: 'gitlab-templates/parallel14.yml'
    rules:
      - if: $PARALLEL_MACHINES == "14"
        when: always
  - local: 'gitlab-templates/parallel15.yml'
    rules:
      - if: $PARALLEL_MACHINES == "15"
        when: always
  - local: 'gitlab-templates/parallel16.yml'
    rules:
      - if: $PARALLEL_MACHINES == "16"
        when: always
  - local: 'gitlab-templates/parallel17.yml'
    rules:
      - if: $PARALLEL_MACHINES == "17"
        when: always
  - local: 'gitlab-templates/parallel18.yml'
    rules:
      - if: $PARALLEL_MACHINES == "18"
        when: always
  - local: 'gitlab-templates/parallel19.yml'
    rules:
      - if: $PARALLEL_MACHINES == "19"
        when: always
  - local: 'gitlab-templates/parallel20.yml'
    rules:
      - if: $PARALLEL_MACHINES == "20"
        when: always
  - local: 'gitlab-templates/parallel21.yml'
    rules:
      - if: $PARALLEL_MACHINES == "21"
        when: always
  - local: 'gitlab-templates/parallel22.yml'
    rules:
      - if: $PARALLEL_MACHINES == "22"
        when: always
  - local: 'gitlab-templates/parallel23.yml'
    rules:
      - if: $PARALLEL_MACHINES == "23"
        when: always
  - local: 'gitlab-templates/parallel24.yml'
    rules:
      - if: $PARALLEL_MACHINES == "24"
        when: always
  - local: 'gitlab-templates/parallel25.yml'
    rules:
      - if: $PARALLEL_MACHINES == "25"
        when: always
  - local: 'gitlab-templates/parallel26.yml'
    rules:
      - if: $PARALLEL_MACHINES == "26"
        when: always
  - local: 'gitlab-templates/parallel27.yml'
    rules:
      - if: $PARALLEL_MACHINES == "27"
        when: always
  - local: 'gitlab-templates/parallel28.yml'
    rules:
      - if: $PARALLEL_MACHINES == "28"
        when: always
  - local: 'gitlab-templates/parallel29.yml'
    rules:
      - if: $PARALLEL_MACHINES == "29"
        when: always
  - local: 'gitlab-templates/parallel30.yml'
    rules:
      - if: $PARALLEL_MACHINES == "30"
        when: always

.default_cypress_template: &default_cypress_template
  tags:
    - ci-public-nodepool
  image: cypress/browsers:node-22.11.0-chrome-130.0.6723.69-1-ff-132.0-edge-130.0.2849.56-1
  dependencies:
    - install_dependencies
  allow_failure: true

.default_full_build_template: &default_full_build_template
  <<: *default_cypress_template
  stage: test
  rules:
    - if: ($PREDEFINED_SCHEDULE_TYPE == "STAGING-FULL-BUILD" || $PREDEFINED_SCHEDULE_TYPE == "TESTING-FULL-BUILD") && $CI_PIPELINE_SOURCE == "schedule"
  script: |
    if [ "$PREDEFINED_SCHEDULE_TYPE" = "STAGING-FULL-BUILD" ]; then
      npm run cy:run:staging:$SCRIPT_NAME
    elif [ "$PREDEFINED_SCHEDULE_TYPE" = "TESTING-FULL-BUILD" ]; then
      npm run cy:run:$SCRIPT_NAME
    fi

check_format:
  tags:
    - ci-private-nodepool
  image: node:latest
  stage: source_check
  rules:
    - if: $CI_PIPELINE_SOURCE == "merge_request_event"
  before_script:
    - npm install -g prettier
  script:
    - npm run prettier:check

install_dependencies:
  tags:
    - ci-private-nodepool
  image: node:latest
  stage: install_dependencies
  cache:
    policy: pull-push
    key:
      files:
        - package-lock.json
    paths:
      - .npm
      - cache
      - node_modules
  artifacts:
    paths:
      - .npm
      - cache
      - node_modules
    expire_in: 1 day
  script:
    - npm install
  rules:
    - if: $CI_PIPELINE_SOURCE == "schedule" || $CI_PIPELINE_SOURCE == "web"

manual:
  extends: .parallel
  <<: *default_cypress_template
  stage: manual
  rules:
    - if: $SPEC_PATH != null && $STAGE != null && $CYPRESS_TAG != null && $PARALLEL_MACHINES != null && $PREDEFINED_SCHEDULE_TYPE == null && ($CI_PIPELINE_SOURCE == "web" || $CI_PIPELINE_SOURCE == "schedule")
  script:
    - |
      if [ "$STAGE" = "STAGING" ]; then
        ./node_modules/.bin/cypress run --spec "$SPEC_PATH" --record --tag "$CYPRESS_TAG" --browser chrome --parallel --config baseUrl=https://someurl --env PortalURL=https://someurl
      elif [ "$STAGE" = "TESTING" ]; then
        ./node_modules/.bin/cypress run --spec "$SPEC_PATH" --record --tag "$CYPRESS_TAG" --browser chrome --parallel
      fi

# because cleanup shouldnt be recorded to dashboard
cleanup:
  <<: *default_cypress_template
  stage: cleanup
  rules:
    - if: ($PREDEFINED_SCHEDULE_TYPE == "STAGING-CLEANUP" || $PREDEFINED_SCHEDULE_TYPE == "TESTING-CLEANUP") && $CI_PIPELINE_SOURCE == "schedule"
  script:
    - |
      if [ "$PREDEFINED_SCHEDULE_TYPE" = "STAGING-CLEANUP" ]; then
        npm run cy:run:staging:cleanup
      elif [ "$PREDEFINED_SCHEDULE_TYPE" = "TESTING-CLEANUP" ]; then
        npm run cy:run:cleanup
      fi

serviceName1:
  parallel: 15
  <<: *default_full_build_template
  variables:
    SCRIPT_NAME: serviceName1

serviceName2:
  parallel: 3
  <<: *default_full_build_template
  variables:
    SCRIPT_NAME: serviceName2

...

Actual behavior

We have Gitlab Runners launched approx. 5 days ago. They all share the same authentication token registered via gitlab.com.

Some of our jobs fail. Taking a look at the gitlab-runner pod logs shows errors of this kind:

WARNING: Submitting job to coordinator... job failed  bytesize=20764 checksum=crc32:75754b1d code=403 job=8540930992 job-status= runner=t3_qYADwj status=PUT https://gitlab.com/api/v4/jobs/8540930992: 403 Forbidden update-interval=0s

The actual runners, corresponding to the executed jobs, show that they have terminated correctly. On the job side (gitlab.com) no logs are being shown after the 403 in the related gitlab-runner.

This behaviour is of sporadic nature across all gitlab-runner pods.

Expected behaviour

No 403 to allow our gitlab-runners registering finished runner pods.

Relevant logs and/or screenshots

job log
WARNING: Submitting job to coordinator... job failed  bytesize=20764 checksum=crc32:75754b1d code=403 job=8540930992 job-status= runner=t3_qYADwj status=PUT https://gitlab.com/api/v4/jobs/8540930992: 403 Forbidden update-interval=0s

Environment description

Using AWS EKS 1.30 with gitlab-runner v17.6 (https://artifacthub.io/packages/helm/gitlab/gitlab-runner).

config.toml contents
[[runners]]
      output_limit = 8192
      environment = ["CI_SERVICE_HOST=127.0.0.1", "DOCKER_TLS_CERTDIR=", "FF_GITLAB_REGISTRY_HELPER_IMAGE=true", "FF_WAIT_FOR_POD_TO_BE_REACHABLE=true", "FF_TIMESTAMPS=true"]
      executor = "kubernetes"
      [runners.cache]
        Type = "s3"
        Path = "gitlab_runner"
        Shared = true
        [runners.cache.s3]
          ServerAddress = "s3.amazonaws.com"
          BucketName = "${ci_cache_bucket_name}"
          BucketLocation = "${region}"
          Insecure = false
      [runners.kubernetes]
        retry_limit = 5
        namespace = "${namespace}"
        image = "ubuntu:24.04"
        poll_timeout = 720 
        pod_termination_grace_period_seconds = 60
        privileged = true
        cpu_request = "1"
        cpu_request_overwrite_max_allowed = "16"
        memory_request = "8Gi"
        memory_limit = "12Gi"
        memory_request_overwrite_max_allowed = "32Gi"
        memory_limit_overwrite_max_allowed = "32Gi" 
        service_cpu_request = "500m"
        service_memory_request = "512Mi"
        service_memory_limit = "4Gi"
        helper_cpu_request = "500m"
        helper_memory_request = "256Mi"
        helper_memory_limit = "2Gi"
        [runners.kubernetes.pod_annotations]
          "karpenter.sh/do-not-evict" = "true"
        [runners.kubernetes.pod_labels]
          "ci_project_name" = "$CI_PROJECT_NAME"
          "ci_job_id" = "$CI_JOB_ID"
          "gitlab" = "runner"
        [runners.kubernetes.retry_limits]
            "TLS handshake timeout" = 10
            "tls: internal error" = 10

Used GitLab Runner version

gitlab-runner v17.6 (https://artifacthub.io/packages/helm/gitlab/gitlab-runner). Find other required information in the appended config.

Possible fixes

It looks like there may be a legitimate bug with expiration though where the jobs get stuck running on the gitlab side while the runner has finished (based on reading the code ).

The legacy flow for expiration due to timeouts is here:

  • Gitlab gets a failed status from runner when the job timeouts via the update job and append trace endpoints
  • Gitlab runs the BuildFinishedWorker which expires/removes the token
  • The job ends up in a failed state

The new flow for timeouts appears to be:

  • Token is expired because of the JWT claims after the timeout
  • Runner tries to update the job via the update and append endpoints
  • Gitlab can't find the expired token an returns a 403
  • The runner kills the job on the runner side making no more requests to gitlab
  • The job is never set to failed in GitLab and is indefinitely running until the StuckCiJobsWorker gets to it
    • This part is based on user reports, and reading the code and seeing no obvious status changes when forbidden is returned, a full reproduction of the problem has not been completed by me)

We need to fail the job when returning 403 for a JWT token from gitlab.

Edited by Allison Browne