Incident Review: InactiveTokensDeletionCronWorker
Key Information
| Metric | Value |
|---|---|
| Customers Affected | 220 |
| Requests Affected | API access or other workflows depending on an access token are not available because the access token was unexpectedly deleted |
| Incident Severity | High - S2 |
| Start Time | Incident declared at 10:19 UTC 11th Sept as a severity3 and moved to severity2 at 10:51 UTC. Offending worker ran 2024-09-11 00:00:00 |
| End Time | Incident mitigated 12:02 UTC 11th Sept. |
| Total Duration | 2 hours though not applicable as the root cause of the issue was a chron job that had ran |
| Link to Incident Issue | #18548 (closed) |
Summary
ResourceAccessTokens::InactiveTokensDeletionCronWorker automation job that deletes bot users with associated access tokens that became inactive more than 30 days ago unintentionally removed bot users with active tokens.
Details
ResourceAccessTokens::InactiveTokensDeletionCronWorker runs daily at 00:00 UTC. It was deployed at Sep 10, 2024, 19:03. Its first run was at Sep 11, 2024, 00:00.
This worker was implemented assuming that a bot user can be only associated with one access token. That job iterates in batches over bot users. It deletes bot user if it is associated with access token that became inactive more than 30 days ago.
After the fist report we started investigation and identified that when rotating group and project access token, we revoke the old token and create new token for the bot user, but retain the old revoked token in the DB. That means that bot user is associated with more than 1 access token after rotating - new token and old revoked token. That is the root cause of the issue where bots associated with 1456 token that had been rotated were unintentionally removed.
We disabled the worker on .com and in the code to ensure self-managed users are not impacted.
Outcomes/Corrective Actions
Please see linked corrective action issues to this review.
- Revise mandatory rollout plan for all feature related to tokens or those that can perform destructive DB updates.
- This includes the use of FF (where possible), peer review on rollout plan, soft deletion, batched runs and internal production group testing etc. See MR for this change
- Redesign expired and revoked token deletion to perform phased clean up by separately removing expired tokens and then any orphaned bots.
- Introduce audit events for the expired token bot deletion.
- Expand test coverage around token bot memberships.
- Potential database constraints to flag modification of relations that are still in active use.
Learning Opportunities
What went well?
- First customer reports occurred at 8:50 UTC and the root cause was established and disabled on GitLab.com by 10:24 UTC. MR removing the offending job for SM users was merged by 12:28 UTC. So the root cause was identified fairly quickly.
- A query was available to identify the affected bots and the parent projects/namespaces
What was difficult?
- There wasn't a way to restore access for these bots aside from comparing a diff from DBLabs backup which may have been a few hours behind.
- There isn't dedicated audit events for bots that were removed by the cron job.
Review Guidelines
This review should be completed by the team which owns the service causing the alert. That team has the most context around what caused the problem and what information will be needed for an effective fix. The EOC or IMOC may create this issue, but unless they are also on the service owning team, they should assign someone from that team as the DRI.
For the person opening the Incident Review
-
Set the title to Incident Review: (Incident issue name) -
Assign a Service::*label (most likely matching the one on the incident issue) -
Set a Severity::*label which matches the incident -
In the Key Informationsection, make sure to include a link to the incident issue -
Find and Assign a DRI from the team which owns the service (check their slack channel or assign the team's manager) The DRI for the incident review is the issue assignee. -
Announce the incident review in the incident channel on Slack.
:mega: @here An incident review issue was created for this incident with <USER> assigned as the DRI.
If you have any review feedback please add it to <ISSUE_LINK>.
For the assigned DRI
-
Fill in the remaining fields in the Key Informationsection, using the incident issue as a reference. Feel free to ask the EOC or other folks involved if anything is difficult to find. -
If there are metrics showing Customers AffectedorRequests Affected, link those metrics in those fields -
Create a few short sentences in the Summary section summarizing what happened (TL;DR) -
Use the description section to write a few paragraphs explaining what happened -
Link any corrective actions and describe any other actions or outcomes from the incident -
Consider the implications for self-managed and Dedicated instances. For example, do any bug fixes need to be backported? -
Add any appropriate labels based on the incident issue and discussions -
Once discussion wraps up in the comments, summarize any takeaways in the details section -
Close the review before the due date