Incident Review: InactiveTokensDeletionCronWorker

Key Information

Metric	Value
Customers Affected	220
Requests Affected	API access or other workflows depending on an access token are not available because the access token was unexpectedly deleted
Incident Severity	High - S2
Start Time	Incident declared at 10:19 UTC 11th Sept as a severity3 and moved to severity2 at 10:51 UTC. Offending worker ran 2024-09-11 00:00:00
End Time	Incident mitigated 12:02 UTC 11th Sept.
Total Duration	2 hours though not applicable as the root cause of the issue was a chron job that had ran
Link to Incident Issue	#18548 (closed)

Summary

ResourceAccessTokens::InactiveTokensDeletionCronWorker automation job that deletes bot users with associated access tokens that became inactive more than 30 days ago unintentionally removed bot users with active tokens.

Details

ResourceAccessTokens::InactiveTokensDeletionCronWorker runs daily at 00:00 UTC. It was deployed at Sep 10, 2024, 19:03. Its first run was at Sep 11, 2024, 00:00.

This worker was implemented assuming that a bot user can be only associated with one access token. That job iterates in batches over bot users. It deletes bot user if it is associated with access token that became inactive more than 30 days ago.

After the fist report we started investigation and identified that when rotating group and project access token, we revoke the old token and create new token for the bot user, but retain the old revoked token in the DB. That means that bot user is associated with more than 1 access token after rotating - new token and old revoked token. That is the root cause of the issue where bots associated with 1456 token that had been rotated were unintentionally removed.

We disabled the worker on .com and in the code to ensure self-managed users are not impacted.

Outcomes/Corrective Actions

Please see linked corrective action issues to this review.

Revise mandatory rollout plan for all feature related to tokens or those that can perform destructive DB updates.
- This includes the use of FF (where possible), peer review on rollout plan, soft deletion, batched runs and internal production group testing etc. See MR for this change
Redesign expired and revoked token deletion to perform phased clean up by separately removing expired tokens and then any orphaned bots.
Introduce audit events for the expired token bot deletion.
Expand test coverage around token bot memberships.
Potential database constraints to flag modification of relations that are still in active use.

Learning Opportunities

What went well?

First customer reports occurred at 8:50 UTC and the root cause was established and disabled on GitLab.com by 10:24 UTC. MR removing the offending job for SM users was merged by 12:28 UTC. So the root cause was identified fairly quickly.
A query was available to identify the affected bots and the parent projects/namespaces

What was difficult?

There wasn't a way to restore access for these bots aside from comparing a diff from DBLabs backup which may have been a few hours behind.
There isn't dedicated audit events for bots that were removed by the cron job.

Review Guidelines

This review should be completed by the team which owns the service causing the alert. That team has the most context around what caused the problem and what information will be needed for an effective fix. The EOC or IMOC may create this issue, but unless they are also on the service owning team, they should assign someone from that team as the DRI.

For the person opening the Incident Review

Set the title to Incident Review: (Incident issue name)
Assign a Service::* label (most likely matching the one on the incident issue)
Set a Severity::* label which matches the incident
In the Key Information section, make sure to include a link to the incident issue
Find and Assign a DRI from the team which owns the service (check their slack channel or assign the team's manager) The DRI for the incident review is the issue assignee.
Announce the incident review in the incident channel on Slack.

:mega: @here An incident review issue was created for this incident with <USER> assigned as the DRI.
If you have any review feedback please add it to <ISSUE_LINK>.

For the assigned DRI

Fill in the remaining fields in the Key Information section, using the incident issue as a reference. Feel free to ask the EOC or other folks involved if anything is difficult to find.
If there are metrics showing Customers Affected or Requests Affected, link those metrics in those fields
Create a few short sentences in the Summary section summarizing what happened (TL;DR)
Use the description section to write a few paragraphs explaining what happened
Link any corrective actions and describe any other actions or outcomes from the incident
Consider the implications for self-managed and Dedicated instances. For example, do any bug fixes need to be backported?
Add any appropriate labels based on the incident issue and discussions
Once discussion wraps up in the comments, summarize any takeaways in the details section
Close the review before the due date

Edited Sep 25, 2024 by Adil Farrukh