Skip to content

Unauthorized errors for running jobs after removing a runner with helm uninstall or an abrupt restart/re-register of the runner

Summary

GitLab Ultimate customer raised a support ticket (internal links) to report that errors occurred if

  • The runner abruptly restarted for some reason (re-registering after it had restarted)
  • They upgrade a runner

CI Job pods under the previous runner will sometimes finish successfully, sometimes they'll be stuck running for days.

Errors include:

  • ERROR: Job failed (system failure): Unauthorized"
  • ERROR: Error cleaning up pod: Unauthorized, ERROR: Error cleaning up secrets: Unauthorized (but Job succeeded)

Steps to reproduce

Reproduced by @atanayno

  • install runner via helm
  • start a long-running job
  • uninstall runner via helm uninstall
  • job may complete with success, but there will be errors:
...
Cleaning up file based variables
ERROR: Error cleaning up pod: Unauthorized
ERROR: Error cleaning up secrets: Unauthorized
Job succeeded 

Example Project

What is the current bug behavior?

Race condition whereby the uninstall prevents the runner cleaning up the running jobs.

@steveazz hypothesised:

  • Assumption: runners are deployed with RBAC roles and rolebindings created by the GitLab helm chart
  • helm uninstall DELETES the roles and rolebinding, and starts terminating the runner pod
  • The runner pod carries on running until the job completes
  • Job finishes, runner tried to delete it. But as the rolebinding was deleted a long time ago (helm delete) runner pod doesn’t have access to delete pods/secrets - Result: Unauthorized

What is the expected correct behavior?

Partial workaround

Before upgrading a runner deployed with helm

  • Pause it in GitLab
  • Wait for any running jobs to complete
  • Then perform the upgrade

Relevant logs and/or screenshots

Possible fixes

Edited by Ben Prescott_
To upload designs, you'll need to enable LFS and have an admin enable hashed storage. More information