DB high load on May 25 - outage
What happened
- At 9:58 UTC we got an alert of high load in DB4 (over 300)
- We ssh'd into the DB server
- We opened a hangout and @jacobvosmaer-gitlab @jnijhof and I joined in.
- We gathered data and queries (lots of queries)
- We asked @yorickpeterse to join the hangout as he was also ssh'd into the server.
- After asking @yorickpeterse through Slack if we could take action or if he was doing something, we decided to restart the DB.
- The restart did not decreased the load, after a bit we got high load again. (~10:14 UTC)
- We thought that the issue could have been that we didn't vacuum the issues table (all the queries were about the issues table)
- We started a manual vacuum on the issues table. (~10:22 UTC)
- The vacuum was not moving after a while and the load was still high.
- We bounced again the DB and started vacuum right away.
- After just vacuuming 2 indexes, the process was also not moving.
- We decided to bring up the deploy page to cut requests for a while (~10:49 UTC)
- The vacuuming progressed and finished fine.
- We took the deploy page down (~10:54 UTC)
- Since then load is normal, we monitored for a couple of minutes. (~10:59 UTC)
- We then enabled auto-vacuum and logging of it so we can monitor and rule it out completely (~11:06 UTC)
Assumptions & Guessings
- We can't blame auto-vacuuming as the source of the problem as it was manually disabled.
- There was a spiked increase in requests in the load balancers, after a brief discussion we are assuming that it is because of the contention happening in the database (requests are being queued)
- We are guessing that there may be a set of queries that are locking the database, we don't have strong enough evidence of it being so yet.
Things that worked well
- The deploy page cut the load right away and allowed the database to finish whatever was locking it. We should do this sooner.
- Having a script in the root partition to dump slow queries that are in flight, we should have more of these and document them.
Things that we can improve
- We don't had a quick way of setting a rate limit, we discussed it but abandoned the idea.
- We don't have a good way of understanding if we are being hammered by a bot, or if all the requests are coming from the same ip. We need to improve our monitoring for the connection between the load balancers and the application: we need to check in way too many places to understand what's going on.
Conversation in infrastructure channel
https://gitlab.slack.com/archives/infrastructure/p1464170095000697
Graphs
Systems metrics
DB4 load spike
LB requests sample
Performance metrics
Global requests served
Rails requests served
Rails timings
API requests served
API timings
Sidekiq transactions
cc/ @stanhu