Post mortem for outage caused by 10.7.0-rc5-ee deployment - 2018-04-16
Context
On 2018-04-16 there was a series of outages that occurred during the 10.7.0 rc5 deployment. The root cause was related to caching where multiple versions of the application share the same cached object causing exceptions during the deploy. For more information see https://gitlab.com/gitlab-com/infrastructure/issues/4041#note_68616781
Timeline
On date: 2018-04-16
- 06:15 UTC - Requested deploying to production
- 06:24 UTC - OK to deploy to production
- 06:43 UTC - api05 and git05 were down
- 06:44 UTC - Redeploying api05 and git05
- 07:36 UTC - Deployment cancelled (haven't started) as there were some intermittent worker timeouts.
- 09:39 UTC - OK to deploy to production
- 09:45 UTC - Started warming up the packages.
- 09:47 UTC - Error
Could not get lock /var/lib/apt/lists/lock - open (11: Resource temporarily unavailable)
- 09:50 UTC - Retried, worked. Finished warming up the packages.
- 09:51 UTC - Started deployment
- 09:59 UTC - Stopped deployment - api10 and api03 didn't respond to
service
commands (possibly systemd related) - 10:09 UTC - api10 and api03 nodes were rebooted
- 10:14 UTC - Resumed deployment
- 11:13 UTC - Deployment stuck for a long time on
gitlab-base-stor-nfs
. - 11:13 UTC - Cancelled and resumed the deployment
- 11:25 UTC - 500s on GitLab.com
- 11:25 UTC - At this point, we were starting to deploy on the sidekiq nodes
- 12:40 UTC - Deployment stuck on web nodes
- 12:45 UTC - Had to issue a bunch of HUPs (manually) until they all responded https://gitlab.com/gitlab-org/takeoff/issues/58
- 12:49 UTC - Deployment resumed
- 12:51 UTC - Deployment stopped -
Net::SSH::ConnectionTimeout
- api13 was down - 12:57 UTC - Redeploying api13
- 13:11 UTC - api13 is back, resumed deployment
- 13:11 UTC - Deployment stopped, as api14 is now down
- 13:14 UTC - api14 rebooted, resumed the deployment
- 14:08 UTC - API nodes were taking a long time to respond to a restart, api05 seemed the cause
- 14:09 UTC - Resumed the deployment
- 14:56 UTC - Deployment finished
Incident Analysis
- How was the incident detected?
- Is there anything that could have been done to improve the time to detection?
- How was the root cause discovered?
- Was this incident triggered by a change?
- Was there an existing issue that would have either prevented this incident or reduced the impact?
Root Cause Analysis
Follow the the 5 whys in a blameless manner as the core of the post mortem.
For this it is necessary to start with the production incident, and question why this incident happen, once there is an explanation of why this happened keep iterating asking why until we reach 5 whys.
It's not a hard rule that it has to be 5 times, but it helps to keep questioning to get deeper in finding the actual root cause. Additionally, from one why there may come more than one answer, consider following the different branches.
A root cause can never be a person, the way of writing has to refer to the system and the context rather than the specific actors.
For Ex:
At 00:00 UTC something happened that led to downtime
- Why did X caused downtime?
...
What went well
- Identify the things that worked well
What can be improved
- Using the root cause analysis, explain what things can be improved.
Corrective actions
- https://gitlab.com/gitlab-org/gitlab-ce/issues/45175
- https://gitlab.com/gitlab-org/gitlab-ee/issues/5571
- https://gitlab.com/gitlab-org/gitlab-ce/issues/45402