Catch external pod disruptions / terminations
What does this MR do?
When a pod disappears (ie. spot/preemptible instances), we treat this as system_failure.
Changes:
-
It brings in some infrastructure:
- a pod watcher
This watches the build pod for terminal errors, ie. deletions, disruptions, ...
It uses an informer to get notified on pod changes. - a informer factory
The pod watcher creates its informer from this informer factory.
This factory keps track of it's own context, thus can manage it's own lifetime.
- a pod watcher
-
Changes to the kubernetes executor:
For now, the changes are limited, and for the most part current behavior and implementation stays as is. However, for each build a new PodWatcher, and thus a informer factory, is created. This PodWatcher is then used in-
Prepare
creates and starts the pod watcher, thus the informer the watcher uses, thus the informer factory -
setupBuildPod
sets the actual pod name we care about; this is important for when the executor retries on pull issues -
runWithAttach
&runWithExecLegacy
uses the pod watcher to get notified immediately about pod issues -
Finish
shuts down the pod watcher and its dependants
to additionally notify the system about "issues". Currently those include
- image pull issues
- external deletion of the pod (actual deletion & setting the deletion timestamp)
- disruption events
We get those events as the APIServer sees them, and don't have to wait for the next poll to occur.
-
-
New feature flag
FF_USE_INFORMERS
:
As the pod watcher's informer needs additional RBAC permissions (list & watch on pods), this is a breaking change. Thus this is all hidden behind a feature flag.
This MR has its roots in the exploration over here: !5213
Why was this MR needed?
There are cases where pod disruptions are only caught late, because we poll the pod with relatively long intervals.
Example: The process in the build container talks to a service in a service container. When the pod gets disrupted (e.g. because it's running on a spot instance which is reclaimed), the service container is terminated by the kubelet, thus the service is not reachable anymore, and the build script fails and signals a build_failure.
What's the best way to test this MR?
-
runner.toml
listen_address = ":9252" concurrent = 3 check_interval = 1 log_level = "debug" shutdown_timeout = 0 [session_server] session_timeout = 1800 [[runners]] name = "dm" limit = 3 url = "https://gitlab.com/" id = 0 token = "glrt-NopeNopeNope" token_obtained_at = 0001-01-01T00:00:00Z token_expires_at = 0001-01-01T00:00:00Z executor = "kubernetes" shell = "bash" [runners.kubernetes] image = "ubuntu:22.04" privileged = true [[runners.kubernetes.services]] name = "nginx" [[runners.kubernetes.volumes.empty_dir]] name = "docker-certs" mount_path = "/certs/client" medium = "Memory" [runners.feature_flags] FF_USE_ADVANCED_POD_SPEC_CONFIGURATION = true FF_USE_POD_ACTIVE_DEADLINE_SECONDS = true FF_PRINT_POD_EVENTS = true FF_USE_FASTZIP = true
-
pipeline.yaml
stages: - test variables: DOCKER_HOST: tcp://docker:2376 DOCKER_TLS_CERTDIR: "/certs" DOCKER_TLS_VERIFY: 1 DOCKER_CERT_PATH: "$DOCKER_TLS_CERTDIR/client" # FF_USE_LEGACY_KUBERNETES_EXECUTION_STRATEGY: true default: image: docker services: - docker:dind - nginx Test: stage: test retry: max: 2 when: runner_system_failure script: - | while true ; do echo '=====' docker info >/dev/null wget -O /dev/null http://nginx/ sleep 10 # exit 1 # exit 0 done
-
run a build
-
disrupt the pod
- for EKS you can e.g. use the Fault Injection Service to mimic a spot instance termination
- on GCP preemptible instances and the termination thereof can be mimicked by a shutdown / ACPI soft off
- you can evict the pod, eg. with kubectl-evict
- you can "just" delete the pod
-
see, that this results in a system_failure rather than a build_failure