Skip to content

DB high load on May 25 - outage

What happened

  • At 9:58 UTC we got an alert of high load in DB4 (over 300)
  • We ssh'd into the DB server
  • We opened a hangout and @jacobvosmaer-gitlab @jnijhof and I joined in.
  • We gathered data and queries (lots of queries)
  • We asked @yorickpeterse to join the hangout as he was also ssh'd into the server.
  • After asking @yorickpeterse through Slack if we could take action or if he was doing something, we decided to restart the DB.
  • The restart did not decreased the load, after a bit we got high load again. (~10:14 UTC)
  • We thought that the issue could have been that we didn't vacuum the issues table (all the queries were about the issues table)
  • We started a manual vacuum on the issues table. (~10:22 UTC)
  • The vacuum was not moving after a while and the load was still high.
  • We bounced again the DB and started vacuum right away.
  • After just vacuuming 2 indexes, the process was also not moving.
  • We decided to bring up the deploy page to cut requests for a while (~10:49 UTC)
  • The vacuuming progressed and finished fine.
  • We took the deploy page down (~10:54 UTC)
  • Since then load is normal, we monitored for a couple of minutes. (~10:59 UTC)
  • We then enabled auto-vacuum and logging of it so we can monitor and rule it out completely (~11:06 UTC)

Assumptions & Guessings

  • We can't blame auto-vacuuming as the source of the problem as it was manually disabled.
  • There was a spiked increase in requests in the load balancers, after a brief discussion we are assuming that it is because of the contention happening in the database (requests are being queued)
  • We are guessing that there may be a set of queries that are locking the database, we don't have strong enough evidence of it being so yet.

Things that worked well

  • The deploy page cut the load right away and allowed the database to finish whatever was locking it. We should do this sooner.
  • Having a script in the root partition to dump slow queries that are in flight, we should have more of these and document them.

Things that we can improve

  • We don't had a quick way of setting a rate limit, we discussed it but abandoned the idea.
  • We don't have a good way of understanding if we are being hammered by a bot, or if all the requests are coming from the same ip. We need to improve our monitoring for the connection between the load balancers and the application: we need to check in way too many places to understand what's going on.

Conversation in infrastructure channel

https://gitlab.slack.com/archives/infrastructure/p1464170095000697

Graphs

Systems metrics

DB4 load spike

db4_load

LB requests sample

lb-incoming-requests

Performance metrics

Global requests served

global-transactions

Rails requests served

rails-transactions

Rails timings

rails-transaction-timings

rails-sql-transaction-timings

API requests served

api-transactions

API timings

api-transaction-timings

api-sql-transaction-timings

Sidekiq transactions

sidekiq-transactions

cc/ @stanhu