10.2.4 rolled back by chef-client after deployment

Context

10.2.4 was silently rolled back by chef-client on all environments as soon as chef-client was started after deploy.

Timeline

On date: 2012-12-08

(times from each web-01 server's apt history.log, doesn't exactly match when deployment script was run, but give an idea)

09:26:22 UTC - 10.2.4 deployed on staging
10:14:52 UTC - staging rolled back to 10.2.3 silently
10:47:27 UTC - 10.2.4 deployed on canary
11:28:23 UTC - canary rolled back to 10.2.3 silently
12:27:18 UTC - 10.2.4 deployed on production
14.31.52 UTC - production rolled back to 10.2.4 silently
~23:10 UTC - we discovered that production is running 10.2.3 and started investigation.
~00:10 UTC - we found out that {staging,canary,gitlab}-omnibus-version role was still having 10.2.3 in it.
00:21 UTC - staging and canary roles updated to 10.2.4 while investigation continued
00:48 UTC - version in the production role bumped up

Slack converstation start: https://gitlab.slack.com/archives/C101F3796/p1512774793000066

Incident Analysis

How was the incident detected?
Is there anything that could have been done to improve the time to detection?
How was the root cause discovered?
Was this incident triggered by a change?
Was there an existing issue that would have either prevented this incident or reduced the impact?

Root Cause Analysis

Follow the the 5 whys in a blameless manner as the core of the post mortem.

For this it is necessary to start with the production incident, and question why this incident happen, once there is an explanation of why this happened keep iterating asking why until we reach 5 whys.

It's not a hard rule that it has to be 5 times, but it helps to keep questioning to get deeper in finding the actual root cause. Additionally, from one why there may come more than one answer, consider following the different branches.

A root cause can never be a person, the way of writing has to refer to the system and the context rather than the specific actors.

For Ex:

At 00:00 UTC something happened that led to downtime

Why did X caused downtime?

...

What went well

Identify the things that worked well

What can be improved

Using the root cause analysis, explain what things can be improved.

Corrective actions

https://gitlab.com/gitlab-com/infrastructure/issues/3381

Edited Dec 13, 2017 by Ilya Frolov