Post mortem for outage caused by 10.7.0-rc5-ee deployment - 2018-04-16

Context

On 2018-04-16 there was a series of outages that occurred during the 10.7.0 rc5 deployment. The root cause was related to caching where multiple versions of the application share the same cached object causing exceptions during the deploy. For more information see https://gitlab.com/gitlab-com/infrastructure/issues/4041#note_68616781

Timeline

On date: 2018-04-16

06:15 UTC - Requested deploying to production
06:24 UTC - OK to deploy to production
06:43 UTC - api05 and git05 were down
06:44 UTC - Redeploying api05 and git05
07:36 UTC - Deployment cancelled (haven't started) as there were some intermittent worker timeouts.
09:39 UTC - OK to deploy to production
09:45 UTC - Started warming up the packages.
09:47 UTC - Error Could not get lock /var/lib/apt/lists/lock - open (11: Resource temporarily unavailable)
09:50 UTC - Retried, worked. Finished warming up the packages.
09:51 UTC - Started deployment
09:59 UTC - Stopped deployment - api10 and api03 didn't respond to service commands (possibly systemd related)
10:09 UTC - api10 and api03 nodes were rebooted
10:14 UTC - Resumed deployment
11:13 UTC - Deployment stuck for a long time on gitlab-base-stor-nfs.
11:13 UTC - Cancelled and resumed the deployment
11:25 UTC - 500s on GitLab.com
11:25 UTC - At this point, we were starting to deploy on the sidekiq nodes
12:40 UTC - Deployment stuck on web nodes
12:45 UTC - Had to issue a bunch of HUPs (manually) until they all responded https://gitlab.com/gitlab-org/takeoff/issues/58
12:49 UTC - Deployment resumed
12:51 UTC - Deployment stopped - Net::SSH::ConnectionTimeout - api13 was down
12:57 UTC - Redeploying api13
13:11 UTC - api13 is back, resumed deployment
13:11 UTC - Deployment stopped, as api14 is now down
13:14 UTC - api14 rebooted, resumed the deployment
14:08 UTC - API nodes were taking a long time to respond to a restart, api05 seemed the cause
14:09 UTC - Resumed the deployment
14:56 UTC - Deployment finished

Incident Analysis

How was the incident detected?
Is there anything that could have been done to improve the time to detection?
How was the root cause discovered?
Was this incident triggered by a change?
Was there an existing issue that would have either prevented this incident or reduced the impact?

Root Cause Analysis

Follow the the 5 whys in a blameless manner as the core of the post mortem.

For this it is necessary to start with the production incident, and question why this incident happen, once there is an explanation of why this happened keep iterating asking why until we reach 5 whys.

It's not a hard rule that it has to be 5 times, but it helps to keep questioning to get deeper in finding the actual root cause. Additionally, from one why there may come more than one answer, consider following the different branches.

A root cause can never be a person, the way of writing has to refer to the system and the context rather than the specific actors.

For Ex:

At 00:00 UTC something happened that led to downtime

Why did X caused downtime?

...

What went well

Identify the things that worked well

What can be improved

Using the root cause analysis, explain what things can be improved.

Corrective actions

Guidelines

Edited Apr 20, 2018 by John Jarvis