Terminating or interrupting a Runner Manager K8s pod causes runner worker pods and jobs to be orphaned and to exceed the defined job timeout
Summary
If the k8s runner pod that manages the ephemeral job is restarted, that results in the ephemeral runner worker pod to be orphaned
. Orphaned means the worker pod does not get cleaned up even the job exits, therefore the ephemeral runner worker pod is running on the node and consuming resources
Steps to reproduce
- Have a runner pod (runner-1) with a label (k8s-tiny) registered in gitlab.com with some 12345 runner id Run a Job-1 with
x min
timeout with k8s-tiny tag. - Assume job-1 lands on ephemeral-1 pod belonging to runner-1 Delete this runner-1 pod from k8s.
- a new runner-1 pod get created but on gitlab.com, this new runner pod will have a different runner ID.
- Define
timeout:
keyword in job for 1 minute - Start pipeline and terminate pod while it is running a job
- Job continues running beyond the 1 minute timeout, even though the pod was terminated
- Waiting 100 minutes later, the job will be marked as failed
Example Project
Example job: https://gitlab.com/jdasmarinas/gitlab-runner-secrets-test/-/jobs/3725538578
What is the current bug behavior?
Job does not respect timeout if pod is terminated just after the job has started.
What is the expected correct behavior?
The job should be marked as failed in GitLab based on the timeout defined and should not wait for an update from the runner.
Relevant logs and/or screenshots
Output of checks
/label reproduced on GitLab.com
Results of GitLab environment info
Expand for output related to GitLab environment info
(For installations with omnibus-gitlab package run and paste the output of: `sudo gitlab-rake gitlab:env:info`) (For installations from source run and paste the output of: `sudo -u git -H bundle exec rake gitlab:env:info RAILS_ENV=production`)
Results of GitLab application Check
Expand for output related to the GitLab application check
(For installations with omnibus-gitlab package run and paste the output of:
sudo gitlab-rake gitlab:check SANITIZE=true
)(For installations from source run and paste the output of:
sudo -u git -H bundle exec rake gitlab:check RAILS_ENV=production SANITIZE=true
)(we will only investigate if the tests are passing)