Skip to content

Executors become orphaned when Runner pods die abruptly

Summary

When a Runner pod is for some reason killed or evicted, its previous executor pods are not cleaned up, and remain around in an orphaned state. The GitLab job that spawned it continues running until it hits a timeout, and eventually raises the error "There has been a timeout failure or the job got stuck. Check your timeout limits or try again".

Steps to reproduce

  1. Use a basic CI job to trigger the Runner:
test-runner:
image: alpine:edge
stage: test
tags:
  - test-runner
script:
  - sleep 2m
  1. Manually kill the pod to simulate a situation where this would be done by the Kubernetes engine:

`kubectl delete pod gitlab-geo-logcursor-7c8cd99b45-dwr7p --grace-period=0

  1. Observe that the executor pods linger indefinitely until otherwise cleaned up

Example Project

N/A (Runner-specific)

What is the current bug behavior?

Executor pods become orphaned when the Runner that spawned them is deleted/killed/evicted.

What is the expected correct behavior?

In an ideal world, when a new Runner pod is recreated to replace the previous one, it would first check for existing executors and either recover them and continue from where it stopped, or at least clean things up so started jobs and pods don't linger around as zombies.

Relevant logs and/or screenshot

  • Runner and sidecars started! test1

  • Pods linger indefinitely test2

  • Job hanging for several minutes test3

  • Executor pod logs test4

Possible fixes

Edited by Daniel Diniz