2020-10-22: 13.5.0 was released with broken upgrade path from 13.4.4
Summary
13.5.0 was released with a broken upgrade path from 13.4.4.
This bug was fixed in gitlab-org/omnibus-gitlab!4661 (merged)
See also gitlab-org/omnibus-gitlab#5743 (closed)
Questions to answer:
- Whether we missed something with our upgrade testing on preprod/release instances, right now it seems like we did but we need to check the specific upgrade paths and configuration
- How we can avoid this type of issue in the future where a team believes a fix was included but it misses the deadline
More information will be added as we investigate the issue.
Timeline
All times UTC
.
YYYY-MM-DD
-
2020-10-05
- gitlab-org/omnibus-gitlab!4591 (merged) is merged. This leads to errors ingitlab-ctl reconfigure
causing the upgrade to fail. -
2020-10-16
- gitlab-org/omnibus-gitlab!4589 (merged) is merged, which exposes the existence of the underlying bug. -
2020-10-18
- gitlab-org/omnibus-gitlab#5728 (closed) is opened when quality team notices failures due to the change in gitlab-org/omnibus-gitlab!4589 (merged) -
2020-10-19
- Candidate commit to be released on 22nd is shared in#development
,#backend
, and#frontend
-
2020-10-20
- gitlab-org/omnibus-gitlab!4661 (merged) is merged to fix the issue. -
2020-10-21
-
16:11
- GitLab 13.5.0-RC43 is tagged https://gitlab.slack.com/archives/C0139MAV672/p1603293105259100 -
19:17
- 13.5.0-rc43 is successfully deployed to the Pre environment https://gitlab.slack.com/archives/C8PKBH3M5/p1603303632106800 -
21:35
- 13.5.0-rc43 is successfully deployed to the Release environment https://gitlab.slack.com/archives/C8PKBH3M5/p1603312500107600
-
-
2020-10-22
-
13:50
1 - GitLab 13.5.0 is released without the fix. -
15:51
- A user experienced an upgrade failure, opened an issue, and posted on Twitter. The tweet has since been deleted -
21:10
1 - GitLab 13.5.1 is released with the fix.
-
Incident Review
Summary
- Service(s) affected : Self Managed
- Team attribution :
- Minutes downtime or degradation :
Metrics
Customer Impact
- Who was impacted by this incident?
- External customer
- What was the customer experience during the incident?
- Upgrade failure on self managed instance
- How many customers were affected?
- 1 known, if left unpatched this would have affected almost all users of the Docker image
- If a precise customer impact number is unknown, what is the estimated potential impact?
Incident Response Analysis
- How was the event detected?
- Customer communications
- How could detection time be improved?
- How did we reach the point where we knew how to mitigate the impact?
- Once report was received, the cause was recognized, and it was noticed that the fix for the issue was not included in 13.5.0.
- How could time to mitigation be improved?
Post Incident Analysis
- How was the root cause diagnosed?
- Engineering knowledge. Once the report was received, it was pretty clear what happened.
- How could time to diagnosis be improved?
- Unkown
- Do we have an existing backlog item that would've prevented or greatly reduced the impact of this incident?
- Unknown
- Was this incident triggered by a change (deployment of code or change to infrastructure. If yes, have you linked the issue which represents the change?)?
Timeline
2020-10-22
-
13:50
1 - GitLab 13.5.0 is released without the fix. -
15:51
- A user experienced an upgrade failure, opened an issue, and posted on Twitter. The tweet has since been deleted -
21:10
1 - GitLab 13.5.1 is released with the fix.
5 Whys
Lessons Learned
Corrective Actions
Guidelines
Resources
- If the Situation Zoom room was utilised, recording will be automatically uploaded to Incident room Google Drive folder (private)
Edited by Ian Baum