Git-over-SSH operations were failing during 11.0-rc5 deploy

Context

Incompatibility between production and canary state saved into Redis probably caused git-over-SSH operations to fail.

Timeline

On date: 2018-06-07

09:10 UTC - Deploying 11.0-rc5 to production was resumed (there was an unrelated DB hiccup)
09:18 UTC - First Sentry errors- 09:26 UTC - We're receiving reports of users unable to pull or push over SSH
09:35 UTC - We see a lot of undefined method 'key_restriction_for' for #<Gitlab::FakeApplicationSettings errors in the git node logs
09:49 UTC - There was a speculation that this is a caching problem, maybe because we skipped deploying to canary
09:55 UTC - We start deploying to canary
09:57 UTC - Another speculation that this could be related to a configuration error, but it wasn't the case
09:59 UTC - _Filipa Lacerda_ started the deploy of GitLab *v11.0.0-rc5.ee.0* to *canary*
10:02 UTC - _James Lopez_ post-deployment migrations took *12 minutes* to run on *production*
10:03 UTC - Sentry errors stopped
10:04 UTC - _James Lopez_ finished deploying GitLab *v11.0.0-rc5.ee.0* to *production* after 1.2 hours
10:07 UTC - _Filipa Lacerda_ finished deploying GitLab *v11.0.0-rc5.ee.0* to *canary* after 7 minutes
10:09 UTC - Errors are trending down

Incident Analysis

How was the incident detected?
- Reports from users
Is there anything that could have been done to improve the time to detection?
- We should've have been paged about increased number of errors, but it didn't get through.
How was the root cause discovered?
- Checking the logs
Was this incident triggered by a change?
- Yes
Was there an existing issue that would have either prevented this incident or reduced the impact?

Root Cause Analysis

Follow the the 5 whys in a blameless manner as the core of the post mortem.

For this it is necessary to start with the production incident, and question why this incident happen, once there is an explanation of why this happened keep iterating asking why until we reach 5 whys.

It's not a hard rule that it has to be 5 times, but it helps to keep questioning to get deeper in finding the actual root cause. Additionally, from one why there may come more than one answer, consider following the different branches.

A root cause can never be a person, the way of writing has to refer to the system and the context rather than the specific actors.

For Ex:

At 00:00 UTC something happened that led to downtime

Why did X caused downtime?

...

What went well

Identify the things that worked well

What can be improved

Using the root cause analysis, explain what things can be improved.

Corrective actions

- Issue labeled as corrective action

Guidelines

/label outage

Edited Jun 07, 2018 by Rémy Coutable