Database Outage on 2016/11/28 when project_authorizations had too much bloat

TL;DR

On Monday 2016/11/28 we had an outage that took GitLab.com down. The initial culprit was high database load, further investigation pointed in the direction of a large amount of slow queries which eventually pointed in the direction of a large amount of table bloat in the project_authorizations table.

The outage was resolved by issuing a VACUUM FULL project_authorizations command in the database, which instantly removed all the bloat away and returned the DB to normal operational levels.

Timeline

12:50 UTC - we got alerted by @jnijhof in slack that the DB had a load of 400+
13:03 UTC - we ssh'd into the db1 server to try to rule out what the problem was. At this point we executed a set of queries to understand if there were slow and/or blocked queries. There were a number of those all around.
At this point we reviewed if Vacuum was executing to find that VACUUM ANALYZE public.ci_runners, ANALYZE public.projects, ANALYZE public.users and VACUUM ANALYZE public.user_activities where executing, some of them for a while already. This turned out to be a red herring.
We also saw that there were a bunch of different queries executing for a long time in the database already, circa 500 to 600 slow queries were already in the database. So we decided to kill these slow queries to get the database in a better shape and unlock Vacuum so it could finish.
After killing queries the main offender in slow queries turned to be SELECT "projects".* FROM "projects" INNER JOIN "project_authorizations" ON "projects"."id" = "project_authorizations"...
Still load was not going down so we kept killing queries with no form of success. Our reasoning was that vacuum was blocking the database as we have seen before, but even we were killing queries all the time the situation was not improving.
We left a while true; ./kill_blocking_queries; sleep 1; done executing in a terminal to keep slow and blocking queries down while vacuum was doing it's job.
At some point all the vacuum process finished, we cancelled the killing loop and load started going high again.
13:40 UTC - Here we took a step back trying to understand what was actually going on, as Vacuum was not to be blamed anymore. We were still piling slow queries and load was not going down.
- using htop we realized that memory usage was actually low (32G out of 100G assigned to shared buffers)
- We double checked this in the database itself by issuing the command show ALL; to then look up the value shared_buffers | 112896MB
- All the cores where at 100% usage in user space.
We checked the graphs for postgresql and followed the lead of the table project_authorizations having a spike up to 8M dead tuples at 12:40 UTC, even though the graph was showing a low number now.
We started taking a wider look at the graphs and realized that the database was using 220Gb of storage space, which was particularly high for what we were used to (180G)
13:49 UTC - After some discussion we decided to issue a VACUUM FULL on project_authorizations as it was the main offender and was not being vacuumed for some reason.
13:50 UTC
- storage usage in the database drops 20G,
- load normalizes at 20 and connections start to recover.
- Slow queries count goes away only showing vacuum process as being slow (normal behavior)
We remove killing scripts and wait a minute monitoring the database to check that it's responding correctly.
Everything is calmed, the outage is resolved.

Graphs

How storage dropped when the outage was resolved

General view of the timeline

DB1 host metrics

Wide view of DB1 storage usage

Inline with what was going on with dead tuples, which makes the 8M dead tuples look like a glitch

What went wrong

Between the point where the high load started (12:43), to the point when I was alerted by @jnijhof (12:50) to the point when we got a pingdom page (13:08) there were 25 minutes of extremely high latency and downtime. Our paging sucked here.
There where no helper scripts in the database servers, since we changed database servers recently we lost all those scripts that we kept throwing in /root in those hosts.
We had no way to clearly understand what the table bloat is, we need to consider a way of understanding this.
We were originally not aware of VACUUM FULL performing an aggressive storage reclaim fixing table fragmentation.
We don't have pg_repack available in the new database to use it in the case it is necessary for a similar situation.

What can be better

Add alerts for database high load
Add page for when GitLab.com stops replying completely
Add alert for high number of slow queries in the database
Add alert for high rate of dead tuples per minute (10k for more than 5 minutes is a lot and will bring up issues like this again)
Add low level host metrics to the database dashboard
Add availability graphs to the fleet overview dashboard for clarity
Add database helper scripts to the chef recipe so we can always find the scripts and we don't need to build them while in the middle of an outage.
Add pg_repack extension in the new databases to have it as an extra tool for this kind of issues.
Update the PG under heavy load runbook with all the findings from this outage to keep it up to date

cc/ @jnijhof @ahanselka @stanhu @northrup @ahmadsherif @maratkalibek @eReGeBe @yorickpeterse