Unauthorized errors for running jobs after removing a runner with helm uninstall or an abrupt restart/re-register of the runner
Summary
GitLab Ultimate customer raised a support ticket (internal links) to report that errors occurred if
- The runner abruptly restarted for some reason (re-registering after it had restarted)
- They upgrade a runner
CI Job pods under the previous runner will sometimes finish successfully, sometimes they'll be stuck running for days.
Errors include:
ERROR: Job failed (system failure): Unauthorized"
-
ERROR: Error cleaning up pod: Unauthorized
,ERROR: Error cleaning up secrets: Unauthorized
(butJob succeeded
)
Steps to reproduce
Reproduced by @atanayno
- install runner via helm
- start a long-running job
- uninstall runner via helm uninstall
- job may complete with success, but there will be errors:
...
Cleaning up file based variables
ERROR: Error cleaning up pod: Unauthorized
ERROR: Error cleaning up secrets: Unauthorized
Job succeeded
Example Project
What is the current bug behavior?
Race condition whereby the uninstall prevents the runner cleaning up the running jobs.
@steveazz hypothesised:
- Assumption: runners are deployed with RBAC roles and rolebindings created by the GitLab helm chart
-
helm uninstall
DELETES the roles and rolebinding, and starts terminating the runner pod - The runner pod carries on running until the job completes
- Job finishes, runner tried to delete it. But as the rolebinding was deleted a long time ago (
helm delete
) runner pod doesn’t have access to delete pods/secrets - Result:Unauthorized
What is the expected correct behavior?
Partial workaround
Before upgrading a runner deployed with helm
- Pause it in GitLab
- Wait for any running jobs to complete
- Then perform the upgrade
Relevant logs and/or screenshots
Possible fixes
Edited by Ben Prescott_