GitLab Outage 2018-06-12 during planned maintenance

WIP

Issue has been fully resolved and database replication is back to full operation.

Executive Summary: https://docs.google.com/document/d/1Zxl7bNXgIA5PpHonY__4hOPz8Mhfs0P474zbNisQBCw/
Working Document: https://docs.google.com/document/d/1WmOMKq63Rap2wfQ-yy7sHyocFvmbJgIt1UIj2xEXEco/

Incident Analysis and Root Cause Analysis scheduled for tomorrow, 2018-06-13 (as well as Improvements and Corrective Actions).

Context

While executing a scheduled change on the database servers (database/93), the automated script that performed the change experienced errors that it did not recover from. In manually addressing these, secondary databases became corrupted, causing them to stop replicating and sending all queries to the primary database, causing contention visible as 500 errors on the site. Once the secondaries were fully recovered, traffic to GitLab.com started flowing normally (albeit slower than normal as replication caught up).

Timeline

On date: 2018-06-12

14:30 UTC - change database/93 is executed but it fails to run properly because disk could not be unmounted; disks are unmounted manually to perform the work: primary database restarts successfully, but secondaries do not.
14:40 UTC - more staff joins to troubleshoot the incident
15:08 UTC - secondaries are taken out of rotation because of WAL segment corruption
15:12 UTC - attempted to restart secondaries, but they fail to do so; suspect disk unmounting caused corruption
15:28 UTC - some WAL files missing on primary database server
15:30 UTC - WAL files MD5 checksums differ from primary to secondaries
15:31 UTC - MD5 checksums from GCP secondaries match primary, so they are copied to the secondaries
15:32 UTC - same erroers are reported as databases refuse to restart
15:40 UTC - start evaluating alternatives to restore secondaries from primary: place primary in backup mode and rsync over; snapshot volumes as alternate approach
16:08 UTC: notice load is down across the fleet
16:09 UTC: tuning of PG bouncer to increase default pool size
16:11 UTC: ready to create snapshots, so primary put in backup mode
16:15 UTC: snapshots complete; work on secondaries progresses, introducing new disks into database servers with newly minted backups
16:56 UTC: replication is catching up
16:59 UTC - still observing 500 errors on the site
17:03 UTC - health checks failing across the board
17:08 UTC - postgres-04 finally starts after turning off backup mode
17:11 UTC - noticed Consul is down
17:19 UTC - pgbouncer looking healthy
17:24 UTC - after a restart of unicorns, they seem to be queing, while workers are underutilized
17:31 UTC - unicorns trying to talk to postgres-03, which is still coming online
17:39 UTC - realized unicorn is not failing fast when it fails to connect to postgres-03, which causes pile-on
17:50 UTC - postgres-03 is back, which returns the site back to normal
18:08 UTC - kick off pgbase_backup on postgres-02
18:34 UTC - postgres load spikes
18:37 UTC - vacuuming isn't cleaning up tuples, which is not an immediate thread to the health of the site but will eventually slow down queries
18:45 UTC - vacuuming continues to not work
18:57 UTC - start manually vacuuming
19:24 UTC - manual vacuuming is helping
20:22 UTC - hot_standby_feedback re-enabled, Xmin goes back to its old value
20:30 UTC - pgbackup restarted
20:36 UTC - discussion on current situation is that it is not an immediate thread: agree to finish backup and recreate secondaries tomorrow
22:00 UTC - Xmin problem resolved after removing replication slots on secondaries

Themes

This incident was many small incidents happening in parallel and also sequentially. They can be broken down broadly into several themes:

(copied from https://gitlab.com/gitlab-com/infrastructure/issues/4380#note_81232931)

Planned maintenance that failed and led to data corruption on our Postgres secondaries
1. Could we have tested the bash script?
2. Could we have run the script on a each secondary one-at-a-time and isolated the fallout if the script malfunctioned?
3. Was the script reviewed by a second person before it was run?
4. Were execution steps written down?
5. Were the execution steps run on staging, before production?
The lead time in re-establishing a new secondary postgres instance
1. Is there a way to automate this and/or speed it up greatly
2. We also learned a great deal about how to snapshot a primary to create a new secondary!
Pgbouncer connection pooling issues
1. Once the secondary instances had failed and all traffic was directed to the primary, I could not find evidence of the primary being overloaded. Transaction counts and load on the primary seemed to be down during that phase.
2. The latencies may have been the result of pgbouncer contention, rather than overloading on the machine
Unicorn queuing failing poorly
1. Instead of rejecting incoming requests, there's evidence that unicorn queued them up. In some cases for up to 20 minutes.
2. This means that clients would have disconnected, but their requests were still queued to be executed.
3. Once the pg/pgbouncer issues were resolved, the servers had to deal with a glut of zombie requests.
4. We should fail fast and fail predictably.
5. When a client disconnects, the queued request should be dropped
6. Any request queued for more than 1 minute should be rejected
PGBouncer client failing poorly
1. Postgres-03's pgbouncer service being down led clients to hang
2. Instead of rejecting the non-responsive pgbouncer, clients hung on this
3. This is exactly the opposite of what HA infrastructure should do. In this situation, more redundancy leads to a higher chances of outages (since the failure of any single service will hang everything). We want more redundancy to lead to lower chances of failure.
Observability Failures
1. Sentry could not handle the load
2. At various times questions were asked about the accuracy of dashboards. We need to be able to trust these during an incident
3. HAProxy was returning 500 errors when it had run out of backend services. This is the wrong status code.
4. Our external status page, status.gitlab.com went down

Incident Analysis

How was the incident detected?
Is there anything that could have been done to improve the time to detection?
How was the root cause discovered?
Was this incident triggered by a change?

Yes: database/93

Was there an existing issue that would have either prevented this incident or reduced the impact?

Root Cause Analysis

Follow the the 5 whys in a blameless manner as the core of the post mortem.

For this it is necessary to start with the production incident, and question why this incident happen, once there is an explanation of why this happened keep iterating asking why until we reach 5 whys.

It's not a hard rule that it has to be 5 times, but it helps to keep questioning to get deeper in finding the actual root cause. Additionally, from one why there may come more than one answer, consider following the different branches.

A root cause can never be a person, the way of writing has to refer to the system and the context rather than the specific actors.

For Ex:

At 00:00 UTC something happened that led to downtime

Why did X caused downtime?

...

What went well

Identify the things that worked well

What can be improved

Using the root cause analysis, explain what things can be improved.

Corrective actions

- Issue labeled as corrective action

Guidelines

Edited Jun 13, 2018 by John Jarvis