RCA for introduction of unplanned 16.[012] required upgrade stop
Please note: if the incident relates to sensitive data or is security-related, consider labeling this issue with security and mark it confidential, or create it in a private repository.
There is now a separate internal-only RCA template for SIRT issues referenced https://about.gitlab.com/handbook/security/root-cause-analysis.html
Summary
A brief summary of what happened. Try to make it as executive-friendly as possible.
- Service(s) affected: Self-managed instances trying to upgrade
- Team attribution: TBD
- Minutes downtime or degradation:
Impact & Metrics
Start with the following:
Question | Answer |
---|---|
What was the impact? | Some self-managed instances would be unexpectedly unable to upgrade to 16.3 directly and must backtrack to 16.0, 16.1 and 16.2 first |
Who was impacted? | Customers with large instances or user counts |
How did this impact customers? | Some customers would experience downtime trying to upgrade and then the upgrade would fail if skipping 16.0, 16.1, 16.2 |
How many attempts made to access? | |
How many customers affected? | ~20-80 |
How many customers tried to access? | Unknown |
Include any additional metrics that are of relevance.
Issues:
Discussion around 16.0 not needing to be a stop - #412696 (closed) Discovery 16.0 might need to be a stop - !127278 (closed) Discovery of migrations taking a long time - gitlab-org/charts/gitlab#4854 (comment 1453334155) Finalization introduction - gitlab-org/charts/gitlab#4854 (comment 1494924183)
Provide any relevant graphs that could help understand the impact of the incident and its dynamics.
Support Interest groupdistribution
Detection & Response
Start with the following:
Question | Answer |
---|---|
When was the incident detected? | 2023-07-21 |
How was the incident detected? | Contributor noticed upgrade path included 16.0 |
Did alarming work as expected? | n/a |
How long did it take from the start of the incident to its detection? | 2 weeks |
How long did it take from detection to remediation? | 2 weeks |
What steps were taken to remediate? | Added 16.0 as an upgrade stop |
Were there any issues with the response? | Added an unexpected upgrade stop, after the fact, after new scheduling process |
MR Checklist
Consider these questions if a code change introduced the issue.
Question | Answer |
---|---|
Was the MR acceptance checklist marked as reviewed in the MR? | |
Should the checklist be updated to help reduce chances of future recurrences? If so, who is the DRI to do so? |
Timeline
16.0
2023-07-21
- Discovered that 16.0 might actually need to be an upgrade stop
2023-08-01
- Found finalization causing upgrade stop
2023-08-09
- User counts effected greater than 30k
2023-08-10
- Added 16.0 as an upgrade stop
16.1
2023-09-04 - Received a customer support request
2023-09-25 - Confirmed 16.1 is an upgrade stop and identified the related MR
2023-10-26 - groupdistribution was informed that 16.1 is an upgrade stop
16.2
2023-09-16 - Received a customer support request and a follow up request
2023-09-23 - Issue Long-running finalize migrations causing unacce... (#426052 - closed) created for investigation
2023-09-26 - Confirmed 16.2 is an upgrade stop and also another case for 16.0
2023-10-26 - groupdistribution was informed that 16.2 is an upgrade stop
Root Cause Analysis
The purpose of this document is to understand the reasons that caused an incident, and to create mechanisms to prevent it from recurring in the future. A root cause can never be a person, the way of writing has to refer to the system and the context rather than the specific actors.
Follow the "5 whys" in a blameless manner as the core of the root cause analysis.
For this, it is necessary to start with the incident and question why it happened. Keep iterating asking "why?" 5 times. While it's not a hard rule that it has to be 5 times, it helps to keep questions get deeper in finding the actual root cause.
Keep in mind that from one "why?" there may come more than one answer, consider following the different branches.
Example of the usage of "5 whys"
The vehicle will not start. (the problem)
- Why? - The battery is dead.
- Why? - The alternator is not functioning.
- Why? - The alternator belt has broken.
- Why? - The alternator belt was well beyond its useful service life and not replaced.
- Why? - The vehicle was not maintained according to the recommended service schedule. (Fifth why, a root cause)
What went well
Start with the following:
- Identify the things that worked well or as expected.
- Any additional call-outs for what went particularly well.
What can be improved
Start with the following:
- Using the root cause analysis, explain what can be improved to prevent this from happening again.
- Is there anything that could have been done to improve the detection or time to detection?
- Is there anything that could have been done to improve the response or time to response?
- Is there an existing issue that would have either prevented this incident or reduced the impact?
- Did we have any indication or beforehand knowledge that this incident might take place?
- Was the MR acceptance checklist marked as reviewed in the MR?
- Should the checklist be updated to help reduce chances of future recurrences?
Corrective actions
- List issues that have been created as corrective actions from this incident.
- For each issue, include the following:
-
<Bare issue link>
- Issue labeled as corrective action. - An estimated date of completion of the corrective action.
- The named individual who owns the delivery of the corrective action.
-