Avoid partial failures in API calls due to bugs in auditing code
We've had an incident recently that was caused by a bug introduced during a runner audit logging code refactoring. It affected runners registered with the deprecated runner registration token method. While the runners themselves were created successfully, the auditing that happens afterwards failed with an exception, causing the client request to fail with an HTTP 500
error. The user's script in turn kept on trying to create a runner, to a point where it had more than "never contacted" 90K runners in a single namespace. This later caused a knock-on effect with a secondary incident, where users were over the 1000 runner limit, and couldn't register any more runners until they got back below the limit.
Proposals
Proposal 1
If an error is raised in the audit code, we log the failure but proceed with the request. The downside is that we'd have runners created for which there is no audit log entry.
Proposal 2
If an error is raised in the audit code, we try to rollback the action (in the case of a runner registration, we delete the runner before returning the error to the client). Some actions might be harder to roll back though, such as modifying a runner, or deleting a runner.