2021-12-15 Database migration removed some epic to issue relationships

Current Status

An MR was merged gitlab-org/gitlab!73007 (merged) and deployed to GitLab.com which due to a bug, removed some valid epic/issue relationships. The result of this is that some epics are missing issues that should be related to them (but the issues themselves are untouched).

The MR was reverted with gitlab-org/gitlab!76815 (merged).

Status update 2021-12-15 09:20: We’ve successfully restored the majority of deleted epic and issue relations that were affected by this bug and are working on restoring the rest with gitlab-org/gitlab#348547 (closed). If you think you were affected by this issue, please feel free to let us know with a comment.

Timeline

Recent Events (available internally only):

All times UTC.

2021-12-15

01:10 - @cablett declares incident in Slack.
01:11 - @cablett identified root cause.
01:29 - gitlab-org/gitlab!76815 (merged) created to prevent issue from happening on self-managed instances.
01:36 - restoration plan in progress.
01:54 - downgraded from severity1 to severity3 due to less than 25% of users affected.
05:00 - Restored 3 hour snapshot to evaluate the recovery of associations
05:51 - Increased from severity3 severity2 due to a better understanding of the seriousness of the issue and users affected
07:46 - Ran script to restore associations from affected epics using data from an earlier database snapshot (3 hours since the incident)
08:31 - Script completes
09:00 - Incident changed from active to mitigated
09:00 - We decide to end the incident call with followup actions for continued analysis
09:10 - Status page updated
13:56 - Ran a script to fix affected epics during the 2-hour window between the snapshot and the execution of the bug.

Corrective Actions

Corrective actions should be put here as soon as an incident is mitigated, ensure that all corrective actions mentioned in the notes below are included.

Harden the review process for risky MRs/data migrations:
- Add a new MR label for destructive data migrations gitlab-org/gitlab!77256 (merged)
Remove/reduce probability of human error resulting in real data loss:
- Require a dry-run or other analysis of query removing data gitlab-org/gitlab!77256 (merged)
Improve time to diagnosis:
- Require MR author to describe possible symptoms of malfunction/failure gitlab-org/gitlab!77256 (merged)
Improve time to mitigation:
- Require MR author to describe reversion plan in the description, unless the MR is automatically reversible gitlab-org/gitlab!77256 (merged)
- Add to IMOC duties to assemble a 'bench' during incident; maintainer and reviewer(s). gitlab-com/www-gitlab-com!95868 (merged)
- Enforce no data migrations in Security MRs, which greatly helped in this case gitlab-org/release/docs!417 (merged)
Improve understanding of severity:
- Make an update to the guidelines establishing S1 as the default in situations where data loss is suspected. gitlab-com/www-gitlab-com!95867 (merged)

Corrective Actions (future)

Open discussion with the database team about the possibility of domain-specific DB reviewers/maintainers on each product team gitlab-org/gitlab#349127
Run query writing only to structured logs in the first instance, inspect logs for accuracy gitlab-org/gitlab#349532
Automatically publish the start and end of migrations in production as widely as Feature Flags are delivery#2167 (comment 784168537)
Create a tool for establishing customer tier/size from group ids gitlab-org/gitlab#349046 (closed)

Note: In some cases we need to redact information from public view. We only do this in a limited number of documented cases. This might include the summary, timeline or any other bits of information, laid out in out handbook page. Any of this confidential data will be in a linked issue, only visible internally. By default, all information we can share, will be public, in accordance to our transparency value.

Click to expand or collapse the Incident Review section.

Incident Review

Ensure that the exec summary is completed at the top of the incident issue, the timeline is updated and relevant graphs are included in the summary
If there are any corrective action items mentioned in the notes on the incident, ensure they are listed in the "Corrective Action" section
Fill out relevant sections below or link to the meeting review notes that cover these topics

Customer Impact

Who was impacted by this incident? (i.e. external customers, internal customers)
1. External and internal customers
What was the customer experience during the incident? (i.e. preventing them from doing X, incorrect display of Y, ...)
1. Customers observed some issues disappearing from Epics. In some cases all issues disappeared, in other cases just some. System notes did not accompany the removal.
How many customers were affected?
1. Exact number is unknown; 2157 groups were affected. Some of those are subgroups of each other.
If a precise customer impact number is unknown, what is the estimated impact (number and ratio of failed requests, amount of traffic drop, ...)?
1. 99% of affected epics had missing issues for at least 7 hours. The long tail of 313 remaining epics saw the same behavior for a further 7 hours.

What were the root causes?

Why wasn't this caught in review?

There wasn't a dry-run against real-world data
No-one noticed the original query in gitlab-org/gitlab!73007 (merged) didn't quite match the one in https://gitlab.com/gitlab-org/security/gitlab/-/merge_requests/1626

Why wasn't this dry-run against real-world data?

Data migrations don't have any special review process for dry-running them.
To do so voluntarily might expose sensitive data on the MR.

Why did no-one notice the queries didn't match?

The previous queries didn't come up, there may have been an assumption that they were the same.
MRs with Background migrations do not show the exact SQL of the migration itself, only of the batch size (but this one did)
Hierarchies are quite complex and this blip was an easy thing to miss.
The focus of the review was on performance and batch size

Why was the focus on performance and batch size?

This is typical for database reviews. It's critical to ensure the database can reliably handle database migrations, especially for high-traffic tables, and this was the right thing to do, following proper DB procedure.

Why was there no "down" migration to undo?

Since this is a hard delete of records, it was not really possible to know in advance how to restore the records.

Incident Response Analysis

How was the incident detected?
1. External and internal support requests
How could detection time be improved?
1. ...
How was the root cause diagnosed?
1. Looked at recent changes to Epics
2. Identified 1 MR with destructive migration that could have had an impact
3. Identified as probable root cause based on interrogation of the code and timing of roll-out
How could time to diagnosis be improved?
1. A log of MRs with potential data loss described if the migration were to fail. Reviewing such a list would've quickly exposed the problematic MR.
How did we reach the point where we knew how to mitigate the impact?
1. Comparison of a stale DB backup to current data revealed the extent of the effect. After this further action could be planning using system notes to close the 2 hour gap between backup and the migration roll-out.
How could time to mitigation be improved?
1. Migration should've been reversible. Three mitigations were required when one would've done.
2. Better common understanding of severity and customer impact. Demotion to S3 nearly made it harder to mitigate.
3. Declared incident sooner
What went well?
1. Migration was split from original security MR in the first place making it much easier to recover data.
2. Path to majority fix was pursued first, this was correct and solved 98.7% of the problem in about half of the time to total resolution.

Post Incident Analysis

Did we have other events in the past with the same root cause?
1. ...
Do we have existing backlog items that would've prevented or greatly reduced the impact of this incident?
1. ...
Was this incident triggered by a change (deployment of code or change to infrastructure)? If yes, link the issue.
1. gitlab-org/gitlab!73007 (merged)

Lessons Learned

See https://gitlab.com/gitlab-org/plan/-/issues/492

Guidelines

Blameless RCA Guideline

Resources

If the Situation Zoom room was utilised, recording will be automatically uploaded to Incident room Google Drive folder (private)

Edited Jan 04, 2022 by John Hope