Unauthorized errors for running jobs after removing a runner with helm uninstall or an abrupt restart/re-register of the runner

Summary

GitLab Ultimate customer raised a support ticket (internal links) to report that errors occurred if

The runner abruptly restarted for some reason (re-registering after it had restarted)
They upgrade a runner

CI Job pods under the previous runner will sometimes finish successfully, sometimes they'll be stuck running for days.

Errors include:

ERROR: Job failed (system failure): Unauthorized"
ERROR: Error cleaning up pod: Unauthorized, ERROR: Error cleaning up secrets: Unauthorized (but Job succeeded)

Steps to reproduce

Reproduced by @atanayno

install runner via helm
start a long-running job
uninstall runner via helm uninstall
job may complete with success, but there will be errors:

...
Cleaning up file based variables
ERROR: Error cleaning up pod: Unauthorized
ERROR: Error cleaning up secrets: Unauthorized
Job succeeded

Example Project

What is the current bug behavior?

Race condition whereby the uninstall prevents the runner cleaning up the running jobs.

@steveazz hypothesised:

Assumption: runners are deployed with RBAC roles and rolebindings created by the GitLab helm chart
helm uninstall DELETES the roles and rolebinding, and starts terminating the runner pod
The runner pod carries on running until the job completes
Job finishes, runner tried to delete it. But as the rolebinding was deleted a long time ago (helm delete) runner pod doesn’t have access to delete pods/secrets - Result: Unauthorized

What is the expected correct behavior?

Partial workaround

Before upgrading a runner deployed with helm

Pause it in GitLab
Wait for any running jobs to complete
Then perform the upgrade

Relevant logs and/or screenshots

Possible fixes

Edited Nov 05, 2020 by Ben Prescott (ex-GitLab)