Outage of GitLab.com due to Database Host Restart
## Context GitLab.com went down due to a _follower_ database node being restart by the provider. ## Timeline On date: 2018-03-29 - 21:35 UTC - `postgres-04.db.prd.gitlab.com` went down causing 500's on GitLab.com - 21:42 UTC - GitLab.com recovered from database outage ## Incident Analysis - How was the incident detected? - Alerting notification from Prometheus - Is there anything that could have been done to improve the time to detection? - No - How was the root cause discovered? - An error in the DB follower code allowed for a state where clients weren't updated to follower removal in a timely manner. - Was this incident triggered by a change? - No - Was there an existing issue that would have either prevented this incident or reduced the impact? - No ## Root Cause Analysis Follow the the 5 whys in a blameless manner as the core of the post mortem. For this it is necessary to start with the production incident, and question why this incident happen, once there is an explanation of why this happened keep iterating asking why until we reach 5 whys. It's not a hard rule that it has to be 5 times, but it helps to keep questioning to get deeper in finding the actual root cause. Additionally, from one why there may come more than one answer, consider following the different branches. A root cause can never be a person, the way of writing has to refer to the system and the context rather than the specific actors. For Ex: At 00:00 UTC something happened that led to downtime - Why did X caused downtime? ... ## What went well - Identify the things that worked well ## What can be improved - Using the root cause analysis, explain what things can be improved. ## Corrective actions - <Bare Issue link> - Issue labeled as infrastructure~2132984 ## Guidelines * [Blameless Postmortems Guideline](https://about.gitlab.com/handbook/infrastructure/#postmortems) * [5 whys](https://en.wikipedia.org/wiki/5_Whys)
issue