Database Outage on 2016/11/28 when project_authorizations had too much bloat
On Monday 2016/11/28 we had an outage that took GitLab.com down. The initial culprit was high database load, further investigation pointed in the direction of a large amount of slow queries which eventually pointed in the direction of a large amount of table bloat in the project_authorizations table.
The outage was resolved by issuing a
VACUUM FULL project_authorizations command in the database, which instantly removed all the bloat away and returned the DB to normal operational levels.
- 12:50 UTC - we got alerted by @jnijhof in slack that the DB had a load of 400+
- 13:03 UTC - we ssh'd into the db1 server to try to rule out what the problem was. At this point we executed a set of queries to understand if there were slow and/or blocked queries. There were a number of those all around.
- At this point we reviewed if Vacuum was executing to find that
VACUUM ANALYZE public.ci_runners,
VACUUM ANALYZE public.user_activitieswhere executing, some of them for a while already. This turned out to be a red herring.
- We also saw that there were a bunch of different queries executing for a long time in the database already, circa 500 to 600 slow queries were already in the database. So we decided to kill these slow queries to get the database in a better shape and unlock Vacuum so it could finish.
- After killing queries the main offender in slow queries turned to be
SELECT "projects".* FROM "projects" INNER JOIN "project_authorizations" ON "projects"."id" = "project_authorizations"...
- Still load was not going down so we kept killing queries with no form of success. Our reasoning was that vacuum was blocking the database as we have seen before, but even we were killing queries all the time the situation was not improving.
- We left a
while true; ./kill_blocking_queries; sleep 1; doneexecuting in a terminal to keep slow and blocking queries down while vacuum was doing it's job.
- At some point all the vacuum process finished, we cancelled the killing loop and load started going high again.
- 13:40 UTC - Here we took a step back trying to understand what was actually going on, as Vacuum was not to be blamed anymore. We were still piling slow queries and load was not going down.
htopwe realized that memory usage was actually low (32G out of 100G assigned to shared buffers)
- We double checked this in the database itself by issuing the command
show ALL;to then look up the value
shared_buffers | 112896MB
- All the cores where at 100% usage in user space.
- We checked the graphs for postgresql and followed the lead of the table
project_authorizationshaving a spike up to 8M dead tuples at 12:40 UTC, even though the graph was showing a low number now.
- We started taking a wider look at the graphs and realized that the database was using 220Gb of storage space, which was particularly high for what we were used to (180G)
- 13:49 UTC - After some discussion we decided to issue a
VACUUM FULLon project_authorizations as it was the main offender and was not being vacuumed for some reason.
- 13:50 UTC
- storage usage in the database drops 20G,
- load normalizes at 20 and connections start to recover.
- Slow queries count goes away only showing vacuum process as being slow (normal behavior)
- We remove killing scripts and wait a minute monitoring the database to check that it's responding correctly.
- Everything is calmed, the outage is resolved.
How storage dropped when the outage was resolved
General view of the timeline
DB1 host metrics
Wide view of DB1 storage usage
Inline with what was going on with dead tuples, which makes the 8M dead tuples look like a glitch
What went wrong
- Between the point where the high load started (12:43), to the point when I was alerted by @jnijhof (12:50) to the point when we got a pingdom page (13:08) there were 25 minutes of extremely high latency and downtime. Our paging sucked here.
- There where no helper scripts in the database servers, since we changed database servers recently we lost all those scripts that we kept throwing in
/rootin those hosts.
- We had no way to clearly understand what the table bloat is, we need to consider a way of understanding this.
- We were originally not aware of
VACUUM FULLperforming an aggressive storage reclaim fixing table fragmentation.
- We don't have pg_repack available in the new database to use it in the case it is necessary for a similar situation.
What can be better
- Add alerts for database high load
- Add page for when GitLab.com stops replying completely
- Add alert for high number of slow queries in the database
- Add alert for high rate of dead tuples per minute (10k for more than 5 minutes is a lot and will bring up issues like this again)
- Add low level host metrics to the database dashboard
- Add availability graphs to the fleet overview dashboard for clarity
- Add database helper scripts to the chef recipe so we can always find the scripts and we don't need to build them while in the middle of an outage.
- Add pg_repack extension in the new databases to have it as an extra tool for this kind of issues.
- Update the PG under heavy load runbook with all the findings from this outage to keep it up to date
Marked the task Add availability graphs to the fleet overview dashboard for clarity as completedToggle commit list
@pcarranza Prior to gitlab-org/gitlab-ce!7733 (merged) the code would use a serializable transaction. The idea was to prevent the data from the same user being updated concurrently as this could potentially result in an inconsistent state. This worked as follows:
loop until success start serializable transaction delete existing entries for user X insert new entries for user X if commit failed retry
While the removal worked as inspected the queries would block each other on the insert. This meant that if data had to be refreshed for user A, B, and C all of this would happen in serial. Because this code would try to run until success (much like your typical compare-and-swap loop) this could lead to a lot of queries being executed (and failing). I suspect that due to the
DELETEsucceeding this would create a lot of dead tuples, though I expect that not to happen until a transaction commits (but I'm not familiar enough with these internals to be certain). This would explain the VACUUM overhead, the space usage, the dead tuples, etc.
The new setup is basically:
lock entries for user X start regular transaction delete entries for user X insert new entries for user X commit unlock entries for user X
This ensures there's ever only 3 queries being executed (DELETE, SELECT, and an INSERT), and there's no busy loop needed anymore.
@yorickpeterse I was writing an answer and started wondering.
This graph is showing the spike in dead tuples:
But it doesn't explain why we had so much storage used:
Taking a second look at dead tuples in the same timespan look what I found:
So an 8M dead tuple peak is not a peak, it's barely an annoyance.
There you go... we need to alert whenever we reach 100M dead rows, or we need to have a way of alerting whenever we see a huge increase in dead rows and storage usage for a given time. I've no idea how to do this yet.
Support received a request (handled by @athar) from a GitLab.com user on 2016/11/23 regarding an issue "not seeing all created projects on the dashboard page". As @athar did not have staging/production access he was unable to debug the issue (#801 + improvements to SE onboarding). This was escalated and debugged on 2016/11/29. Steps taken:
- Impersonate the user and review dashboard - missing a number of created projects
- Load the rails console and review
current_user.authorized_projects- same result
- Check the users
falseand refreshed the users dashboard - issue resolved
We attempted to first debug this issue on staging.gitlab.com (before screwing with production), however it was borked - #802 (closed)
We should work to identify these occasional issues when first reported (via support). This will involve having cross-team communication from support to dev or infra if we receive a report from a GitLab.com user describing unusual behaviour.
I'll clean up the pile of dead bodies when the dust settles.
What do we have to do to prevent this from happening again?
@stanhu Stop generating so much bloat in the project_authorizations table, or start using pg_repack on this table as it seems to be the main bloat generator available.
@yorickpeterse also mentioned reviewing the way we are updating this table as it seems that we are being too aggressive by basically blowing all the rows and generating them again.
From the alerting standpoint, I just added an alert for db high load with a lot of information, and we should probably alert whenever we see that there is a high dead tuple generation rate to get us ready. It will not prevent this from happening, but it will at least warn with some time to be ready.
marked the task Add alerts for database high load as completedToggle commit list
marked the task Add alert for high number of slow queries in the database as completedToggle commit list
marked the task Add alert for high rate of dead tuples per minute (10k for more than 5 minutes is a lot and will bring up issues like this again) as completedToggle commit list
marked the task Add low level host metrics to the database dashboard as completedToggle commit list
marked the task Add database helper scripts to the chef recipe so we can always find the scripts and we don't need to build them while in the middle of an outage. as completedToggle commit list
I'm closing this issue for now as I'm not clear the next step is to use pg_repack. I think we are going in a different direction here.
closedToggle commit list