GitLab Outage 2018-06-12 during planned maintenance
Issue has been fully resolved and database replication is back to full operation.
- Executive Summary: https://docs.google.com/document/d/1Zxl7bNXgIA5PpHonY__4hOPz8Mhfs0P474zbNisQBCw/
- Working Document: https://docs.google.com/document/d/1WmOMKq63Rap2wfQ-yy7sHyocFvmbJgIt1UIj2xEXEco/
Incident Analysis and Root Cause Analysis scheduled for tomorrow, 2018-06-13 (as well as Improvements and Corrective Actions).
While executing a scheduled change on the database servers (
database/93), the automated script that performed the change experienced errors that it did not recover from. In manually addressing these, secondary databases became corrupted, causing them to stop replicating and sending all queries to the primary database, causing contention visible as 500 errors on the site. Once the secondaries were fully recovered, traffic to GitLab.com started flowing normally (albeit slower than normal as replication caught up).
On date: 2018-06-12
- 14:30 UTC - change
database/93is executed but it fails to run properly because disk could not be unmounted; disks are unmounted manually to perform the work: primary database restarts successfully, but secondaries do not.
- 14:40 UTC - more staff joins to troubleshoot the incident
- 15:08 UTC - secondaries are taken out of rotation because of WAL segment corruption
- 15:12 UTC - attempted to restart secondaries, but they fail to do so; suspect disk unmounting caused corruption
- 15:28 UTC - some WAL files missing on primary database server
- 15:30 UTC - WAL files MD5 checksums differ from primary to secondaries
- 15:31 UTC - MD5 checksums from GCP secondaries match primary, so they are copied to the secondaries
- 15:32 UTC - same erroers are reported as databases refuse to restart
- 15:40 UTC - start evaluating alternatives to restore secondaries from primary: place primary in backup mode and rsync over; snapshot volumes as alternate approach
- 16:08 UTC: notice load is down across the fleet
- 16:09 UTC: tuning of PG bouncer to increase default pool size
- 16:11 UTC: ready to create snapshots, so primary put in backup mode
- 16:15 UTC: snapshots complete; work on secondaries progresses, introducing new disks into database servers with newly minted backups
- 16:56 UTC: replication is catching up
- 16:59 UTC - still observing 500 errors on the site
- 17:03 UTC - health checks failing across the board
- 17:08 UTC -
postgres-04finally starts after turning off backup mode
- 17:11 UTC - noticed Consul is down
- 17:19 UTC - pgbouncer looking healthy
- 17:24 UTC - after a restart of unicorns, they seem to be queing, while workers are underutilized
- 17:31 UTC - unicorns trying to talk to
postgres-03, which is still coming online
- 17:39 UTC - realized unicorn is not failing fast when it fails to connect to
postgres-03, which causes pile-on
- 17:50 UTC -
postgres-03is back, which returns the site back to normal
- 18:08 UTC - kick off
- 18:34 UTC - postgres load spikes
- 18:37 UTC - vacuuming isn't cleaning up tuples, which is not an immediate thread to the health of the site but will eventually slow down queries
- 18:45 UTC - vacuuming continues to not work
- 18:57 UTC - start manually vacuuming
- 19:24 UTC - manual vacuuming is helping
- 20:22 UTC -
Xmingoes back to its old value
- 20:30 UTC - pgbackup restarted
- 20:36 UTC - discussion on current situation is that it is not an immediate thread: agree to finish backup and recreate secondaries tomorrow
- 22:00 UTC -
Xminproblem resolved after removing replication slots on secondaries
This incident was many small incidents happening in parallel and also sequentially. They can be broken down broadly into several themes:
Planned maintenance that failed and led to data corruption on our Postgres secondaries
Could we have tested the bash script?
Could we have run the script on a each secondary one-at-a-time and isolated the fallout if the script malfunctioned?
Was the script reviewed by a second person before it was run?
Were execution steps written down?
Were the execution steps run on staging, before production?
The lead time in re-establishing a new secondary postgres instance
Is there a way to automate this and/or speed it up greatly
We also learned a great deal about how to snapshot a primary to create a new secondary!
Pgbouncer connection pooling issues
Once the secondary instances had failed and all traffic was directed to the primary, I could not find evidence of the primary being overloaded. Transaction counts and load on the primary seemed to be down during that phase.
The latencies may have been the result of pgbouncer contention, rather than overloading on the machine
Unicorn queuing failing poorly
Instead of rejecting incoming requests, there's evidence that unicorn queued them up. In some cases for up to 20 minutes.
This means that clients would have disconnected, but their requests were still queued to be executed.
Once the pg/pgbouncer issues were resolved, the servers had to deal with a glut of zombie requests.
We should fail fast and fail predictably.
When a client disconnects, the queued request should be dropped
Any request queued for more than 1 minute should be rejected
PGBouncer client failing poorly
Postgres-03's pgbouncer service being down led clients to hang
Instead of rejecting the non-responsive pgbouncer, clients hung on this
This is exactly the opposite of what HA infrastructure should do. In this situation, more redundancy leads to a higher chances of outages (since the failure of any single service will hang everything). We want more redundancy to lead to lower chances of failure.
Sentry could not handle the load
At various times questions were asked about the accuracy of dashboards. We need to be able to trust these during an incident
HAProxy was returning 500 errors when it had run out of backend services. This is the wrong status code.
Our external status page,
- How was the incident detected?
- Is there anything that could have been done to improve the time to detection?
- How was the root cause discovered?
- Was this incident triggered by a change?
- Was there an existing issue that would have either prevented this incident or reduced the impact?
Root Cause Analysis
Follow the the 5 whys in a blameless manner as the core of the post mortem.
For this it is necessary to start with the production incident, and question why this incident happen, once there is an explanation of why this happened keep iterating asking why until we reach 5 whys.
It's not a hard rule that it has to be 5 times, but it helps to keep questioning to get deeper in finding the actual root cause. Additionally, from one why there may come more than one answer, consider following the different branches.
A root cause can never be a person, the way of writing has to refer to the system and the context rather than the specific actors.
At 00:00 UTC something happened that led to downtime
- Why did X caused downtime?
What went well
- Identify the things that worked well
What can be improved
- Using the root cause analysis, explain what things can be improved.
- - Issue labeled as corrective action