GitLab Outage 2018-06-12 during planned maintenance
WIP
Issue has been fully resolved and database replication is back to full operation.
- Executive Summary: https://docs.google.com/document/d/1Zxl7bNXgIA5PpHonY__4hOPz8Mhfs0P474zbNisQBCw/
- Working Document: https://docs.google.com/document/d/1WmOMKq63Rap2wfQ-yy7sHyocFvmbJgIt1UIj2xEXEco/
Incident Analysis and Root Cause Analysis scheduled for tomorrow, 2018-06-13 (as well as Improvements and Corrective Actions).
Context
While executing a scheduled change on the database servers (database/93
), the automated script that performed the change experienced errors that it did not recover from. In manually addressing these, secondary databases became corrupted, causing them to stop replicating and sending all queries to the primary database, causing contention visible as 500 errors on the site. Once the secondaries were fully recovered, traffic to GitLab.com started flowing normally (albeit slower than normal as replication caught up).
Timeline
On date: 2018-06-12
- 14:30 UTC - change
database/93
is executed but it fails to run properly because disk could not be unmounted; disks are unmounted manually to perform the work: primary database restarts successfully, but secondaries do not. - 14:40 UTC - more staff joins to troubleshoot the incident
- 15:08 UTC - secondaries are taken out of rotation because of WAL segment corruption
- 15:12 UTC - attempted to restart secondaries, but they fail to do so; suspect disk unmounting caused corruption
- 15:28 UTC - some WAL files missing on primary database server
- 15:30 UTC - WAL files MD5 checksums differ from primary to secondaries
- 15:31 UTC - MD5 checksums from GCP secondaries match primary, so they are copied to the secondaries
- 15:32 UTC - same erroers are reported as databases refuse to restart
- 15:40 UTC - start evaluating alternatives to restore secondaries from primary: place primary in backup mode and rsync over; snapshot volumes as alternate approach
- 16:08 UTC: notice load is down across the fleet
- 16:09 UTC: tuning of PG bouncer to increase default pool size
- 16:11 UTC: ready to create snapshots, so primary put in backup mode
- 16:15 UTC: snapshots complete; work on secondaries progresses, introducing new disks into database servers with newly minted backups
- 16:56 UTC: replication is catching up
- 16:59 UTC - still observing 500 errors on the site
- 17:03 UTC - health checks failing across the board
- 17:08 UTC -
postgres-04
finally starts after turning off backup mode - 17:11 UTC - noticed Consul is down
- 17:19 UTC - pgbouncer looking healthy
- 17:24 UTC - after a restart of unicorns, they seem to be queing, while workers are underutilized
- 17:31 UTC - unicorns trying to talk to
postgres-03
, which is still coming online - 17:39 UTC - realized unicorn is not failing fast when it fails to connect to
postgres-03
, which causes pile-on - 17:50 UTC -
postgres-03
is back, which returns the site back to normal - 18:08 UTC - kick off
pgbase_backup
onpostgres-02
- 18:34 UTC - postgres load spikes
- 18:37 UTC - vacuuming isn't cleaning up tuples, which is not an immediate thread to the health of the site but will eventually slow down queries
- 18:45 UTC - vacuuming continues to not work
- 18:57 UTC - start manually vacuuming
- 19:24 UTC - manual vacuuming is helping
- 20:22 UTC -
hot_standby_feedback
re-enabled,Xmin
goes back to its old value - 20:30 UTC - pgbackup restarted
- 20:36 UTC - discussion on current situation is that it is not an immediate thread: agree to finish backup and recreate secondaries tomorrow
- 22:00 UTC -
Xmin
problem resolved after removing replication slots on secondaries
Themes
This incident was many small incidents happening in parallel and also sequentially. They can be broken down broadly into several themes:
(copied from https://gitlab.com/gitlab-com/infrastructure/issues/4380#note_81232931)
-
Planned maintenance that failed and led to data corruption on our Postgres secondaries
-
Could we have tested the bash script?
-
Could we have run the script on a each secondary one-at-a-time and isolated the fallout if the script malfunctioned?
-
Was the script reviewed by a second person before it was run?
-
Were execution steps written down?
-
Were the execution steps run on staging, before production?
-
-
The lead time in re-establishing a new secondary postgres instance
-
Is there a way to automate this and/or speed it up greatly
-
We also learned a great deal about how to snapshot a primary to create a new secondary!
-
-
Pgbouncer connection pooling issues
-
Once the secondary instances had failed and all traffic was directed to the primary, I could not find evidence of the primary being overloaded. Transaction counts and load on the primary seemed to be down during that phase.
-
The latencies may have been the result of pgbouncer contention, rather than overloading on the machine
-
-
Unicorn queuing failing poorly
-
Instead of rejecting incoming requests, there's evidence that unicorn queued them up. In some cases for up to 20 minutes.
-
This means that clients would have disconnected, but their requests were still queued to be executed.
-
Once the pg/pgbouncer issues were resolved, the servers had to deal with a glut of zombie requests.
-
We should fail fast and fail predictably.
-
When a client disconnects, the queued request should be dropped
-
Any request queued for more than 1 minute should be rejected
-
-
PGBouncer client failing poorly
-
Postgres-03's pgbouncer service being down led clients to hang
-
Instead of rejecting the non-responsive pgbouncer, clients hung on this
-
This is exactly the opposite of what HA infrastructure should do. In this situation, more redundancy leads to a higher chances of outages (since the failure of any single service will hang everything). We want more redundancy to lead to lower chances of failure.
-
-
Observability Failures
-
Sentry could not handle the load
-
At various times questions were asked about the accuracy of dashboards. We need to be able to trust these during an incident
-
HAProxy was returning 500 errors when it had run out of backend services. This is the wrong status code.
-
Our external status page,
status.gitlab.com
went down
-
Incident Analysis
- How was the incident detected?
- Is there anything that could have been done to improve the time to detection?
- How was the root cause discovered?
- Was this incident triggered by a change?
Yes: database/93
- Was there an existing issue that would have either prevented this incident or reduced the impact?
Root Cause Analysis
Follow the the 5 whys in a blameless manner as the core of the post mortem.
For this it is necessary to start with the production incident, and question why this incident happen, once there is an explanation of why this happened keep iterating asking why until we reach 5 whys.
It's not a hard rule that it has to be 5 times, but it helps to keep questioning to get deeper in finding the actual root cause. Additionally, from one why there may come more than one answer, consider following the different branches.
A root cause can never be a person, the way of writing has to refer to the system and the context rather than the specific actors.
For Ex:
At 00:00 UTC something happened that led to downtime
- Why did X caused downtime?
...
What went well
- Identify the things that worked well
What can be improved
- Using the root cause analysis, explain what things can be improved.
Corrective actions
- - Issue labeled as corrective action