Git-over-SSH operations were failing during 11.0-rc5 deploy
Context
Incompatibility between production and canary state saved into Redis probably caused git-over-SSH operations to fail.
Timeline
On date: 2018-06-07
- 09:10 UTC - Deploying 11.0-rc5 to production was resumed (there was an unrelated DB hiccup)
- 09:18 UTC - First Sentry errors- 09:26 UTC - We're receiving reports of users unable to pull or push over SSH
- 09:35 UTC - We see a lot of
undefined method 'key_restriction_for' for #<Gitlab::FakeApplicationSettings
errors in the git node logs - 09:49 UTC - There was a speculation that this is a caching problem, maybe because we skipped deploying to canary
- 09:55 UTC - We start deploying to canary
- 09:57 UTC - Another speculation that this could be related to a configuration error, but it wasn't the case
- 09:59 UTC -
_Filipa Lacerda_ started the deploy of GitLab *v11.0.0-rc5.ee.0* to *canary*
- 10:02 UTC -
_James Lopez_ post-deployment migrations took *12 minutes* to run on *production*
- 10:03 UTC - Sentry errors stopped
- 10:04 UTC -
_James Lopez_ finished deploying GitLab *v11.0.0-rc5.ee.0* to *production* after 1.2 hours
- 10:07 UTC -
_Filipa Lacerda_ finished deploying GitLab *v11.0.0-rc5.ee.0* to *canary* after 7 minutes
- 10:09 UTC - Errors are trending down
Incident Analysis
- How was the incident detected?
- Reports from users
- Is there anything that could have been done to improve the time to detection?
- We should've have been paged about increased number of errors, but it didn't get through.
- How was the root cause discovered?
- Checking the logs
- Was this incident triggered by a change?
- Yes
- Was there an existing issue that would have either prevented this incident or reduced the impact?
Root Cause Analysis
Follow the the 5 whys in a blameless manner as the core of the post mortem.
For this it is necessary to start with the production incident, and question why this incident happen, once there is an explanation of why this happened keep iterating asking why until we reach 5 whys.
It's not a hard rule that it has to be 5 times, but it helps to keep questioning to get deeper in finding the actual root cause. Additionally, from one why there may come more than one answer, consider following the different branches.
A root cause can never be a person, the way of writing has to refer to the system and the context rather than the specific actors.
For Ex:
At 00:00 UTC something happened that led to downtime
- Why did X caused downtime?
...
What went well
- Identify the things that worked well
What can be improved
- Using the root cause analysis, explain what things can be improved.
Corrective actions
- - Issue labeled as corrective action
Guidelines
/label outage