Automated Runner Key and Registration Rotation
Problem to solve
Runner registration and authentication keys need to be rotated periodically (e.g., daily), or at least gitlab should provide support to enable key rotation. The registration token used for attaching runners is visible to users via the browser, and will allow an arbitrary runner which has that token to register. During registration, the runner gets an authentication token which it can store locally and reuse that authentication token on restart. Because both the registration token and authentication token live until manually revoked, old copies of these tokens can be compromised and used, e.g., by an ex-employee, resulting in arbitrary code running from arbitrary locations masquerading as runners.
Users that manage their own runners would have more confidence that those runners are authentic.
We are trying to use gitlab.com with runners on EKS K8s clusters in AWS. We believe the lack of runner key rotation will force us to move away from the cloud version of gitlab, and force us to host it in our AWS account, which we prefer not to do.
An API extension that allows the runner registration to be reset (as is allowed in the GUI), and returns the new token would allow user-provided automation running with sufficient privilege to read/change the registration token and store it as a kubernetes secret. If the runner needs to re-register, it will pull the new version of the secret, and old registration tokens would be unusable.
The above would be sufficient for a secure rotation if the automation reset the registration token, stopped all known runners, then unregistered all registered runners, and started known runners. The below describe other changes to get to a less draconian solution.
Provide a new API where the runner can rotate its authentication token, which will invalidate the old token, and return the new token. Enhance the runners to allow authentication token rotation.
Enhance the kubernetes helm chart to add a health check with will fail if the running hits "ERROR: Runner https://gitlab.com/o74wqTfcbaVMy3hWjYC4 is not healthy and will be disabled! " That would cause a new pod to be created which will re-register with new token. This might/might not require runner changes require. Or simpler, the runner should exit on that error.
Some mechanism is needed to detect that an authorization key was not rotated within a time window, and force an unregister. This may be possible currently with user-provided automation, depending on the semantics of the "token" field in the runner details. Or some minor extension to that API may be required.
Permissions and Security
Resetting and reading the runner registration token should require the name privileges as needed to do this via the browser.
The runner API documentation would need updating, as well as the API to allow registration token reset. Depending on how this is realized, other documentation may need changing.
The registration token reset is an existing feature, so exposing it via an API should not require testing beyond that API, which should include negative testing for unauthorized users, etc. Authentication token rotation would require testing of the API change, as well as ensuring that the browser view is unaffected by rotation.
What does success look like, and how can we measure that?
If I take a three day(hour) old registration token and a three day old authentication token, I will not be able to use them to add runners in any form. Ideally, users will perceive no downtime or glitches due to this.
Links / references
- [Feature flag] Rollout of `enforce_runner_token... (#352008 - closed)