Executors become orphaned when Runner pods die abruptly

Summary

When a Runner pod is for some reason killed or evicted, its previous executor pods are not cleaned up, and remain around in an orphaned state. The GitLab job that spawned it continues running until it hits a timeout, and eventually raises the error "There has been a timeout failure or the job got stuck. Check your timeout limits or try again".

Steps to reproduce

Use a basic CI job to trigger the Runner:

test-runner:
image: alpine:edge
stage: test
tags:
  - test-runner
script:
  - sleep 2m

Manually kill the pod to simulate a situation where this would be done by the Kubernetes engine:

`kubectl delete pod gitlab-geo-logcursor-7c8cd99b45-dwr7p --grace-period=0

Observe that the executor pods linger indefinitely until otherwise cleaned up

Example Project

N/A (Runner-specific)

What is the current bug behavior?

Executor pods become orphaned when the Runner that spawned them is deleted/killed/evicted.

What is the expected correct behavior?

In an ideal world, when a new Runner pod is recreated to replace the previous one, it would first check for existing executors and either recover them and continue from where it stopped, or at least clean things up so started jobs and pods don't linger around as zombies.

Relevant logs and/or screenshot

Runner and sidecars started!
Pods linger indefinitely
Job hanging for several minutes
Executor pod logs

Possible fixes

Edited Dec 04, 2023 by Daniel Diniz