Catch external pod disruptions / terminations (!5068) · Merge requests · GitLab.org / gitlab-runner

What does this MR do?

When a pod disappears (ie. spot/preemptible instances), we treat this as system_failure.

Changes:

It brings in some infrastructure:
- a pod watcher
  This watches the build pod for terminal errors, ie. deletions, disruptions, ...
  It uses an informer to get notified on pod changes.
- a informer factory
  The pod watcher creates its informer from this informer factory.
  This factory keps track of it's own context, thus can manage it's own lifetime.
Changes to the kubernetes executor:
For now, the changes are limited, and for the most part current behavior and implementation stays as is. However, for each build a new PodWatcher, and thus a informer factory, is created. This PodWatcher is then used in
- Prepare
  creates and starts the pod watcher, thus the informer the watcher uses, thus the informer factory
- setupBuildPod
  sets the actual pod name we care about; this is important for when the executor retries on pull issues
- runWithAttach & runWithExecLegacy
  uses the pod watcher to get notified immediately about pod issues
- Finish shuts down the pod watcher and its dependants
to additionally notify the system about "issues". Currently those include
- image pull issues
- external deletion of the pod (actual deletion & setting the deletion timestamp)
- disruption events
We get those events as the APIServer sees them, and don't have to wait for the next poll to occur.
New feature flag FF_USE_INFORMERS:
As the pod watcher's informer needs additional RBAC permissions (list & watch on pods), this is a breaking change. Thus this is all hidden behind a feature flag.

This MR has its roots in the exploration over here: !5213

Why was this MR needed?

There are cases where pod disruptions are only caught late, because we poll the pod with relatively long intervals.

Example: The process in the build container talks to a service in a service container. When the pod gets disrupted (e.g. because it's running on a spot instance which is reclaimed), the service container is terminated by the kubelet, thus the service is not reachable anymore, and the build script fails and signals a build_failure.

What's the best way to test this MR?

runner.toml

listen_address = ":9252"
concurrent = 3
check_interval = 1
log_level = "debug"
shutdown_timeout = 0
[session_server]
  session_timeout = 1800
[[runners]]
  name = "dm"
  limit = 3
  url = "https://gitlab.com/"
  id = 0
  token = "glrt-NopeNopeNope"
  token_obtained_at = 0001-01-01T00:00:00Z
  token_expires_at = 0001-01-01T00:00:00Z
  executor = "kubernetes"
  shell = "bash"
  [runners.kubernetes]
    image = "ubuntu:22.04"
    privileged = true
    [[runners.kubernetes.services]]
      name = "nginx"
    [[runners.kubernetes.volumes.empty_dir]]
      name = "docker-certs"
      mount_path = "/certs/client"
      medium = "Memory"
  [runners.feature_flags]
    FF_USE_ADVANCED_POD_SPEC_CONFIGURATION = true
    FF_USE_POD_ACTIVE_DEADLINE_SECONDS = true
    FF_PRINT_POD_EVENTS = true
    FF_USE_FASTZIP = true

pipeline.yaml

stages:
  - test

variables:
  DOCKER_HOST: tcp://docker:2376
  DOCKER_TLS_CERTDIR: "/certs"
  DOCKER_TLS_VERIFY: 1
  DOCKER_CERT_PATH: "$DOCKER_TLS_CERTDIR/client"
  # FF_USE_LEGACY_KUBERNETES_EXECUTION_STRATEGY: true

default:
  image: docker
  services:
    - docker:dind
    - nginx

Test:
  stage: test
  retry:
    max: 2
    when: runner_system_failure
  script:
    - |
      while true ; do
        echo '====='
        docker info >/dev/null
        wget -O /dev/null http://nginx/
        sleep 10

        # exit 1
        # exit 0
      done

run a build
disrupt the pod
- for EKS you can e.g. use the Fault Injection Service to mimic a spot instance termination
- on GCP preemptible instances and the termination thereof can be mimicked by a shutdown / ACPI soft off
- you can evict the pod, eg. with kubectl-evict
- you can "just" delete the pod
see, that this results in a system_failure rather than a build_failure

What are the relevant issue numbers?

Edited Jan 15, 2025 by Hannes Hörl

Catch external pod disruptions / terminations

What does this MR do?

Why was this MR needed?

What's the best way to test this MR?

What are the relevant issue numbers?

Merge request reports