High database load on primary database

Context

Re-scheduled background migration caused high database load.

Timeline

On date: 2017-12-24

09:00 UTC - Database load gone up
11:05 UTC - Alert generated by PagerDuty, investigation started, slack conversation start: https://gitlab.slack.com/archives/C101F3796/p1514113546000018
11:17 UTC - @_stark identified the offending query to be:

DELETE FROM "redirect_routes" WHERE (EXISTS (
  SELECT 1 FROM routes
  WHERE (
    LOWER(redirect_routes.path) = LOWER(routes.path)
    OR LOWER(redirect_routes.path) LIKE LOWER(CONCAT(routes.path, '/%'))  
  )
  AND routes.id BETWEEN 3295386 AND 3295646
)

11:35 UTC - We're disabling background_migrations queue to ease the pressure (MR: https://dev.gitlab.org/cookbooks/chef-repo/merge_requests/1437)
11:37 UTC - While we do it, we're trying to pause the queue with Sidekiq::Queue.new('background_migration').pause, but that didn't work.
11:45 UTC - change is live
11:55 UTC - project_cache queue starts to drop. Sidekiq stats: https://performance.gitlab.net/dashboard/db/sidekiq-stats?orgId=1&from=1514105230636&to=1514117920523
12:01 UTC - @ayufan finds out that the backround migration that got picked up today is from 1508191326.1909287 => 10/16/2017 @ 10:02pm (UTC)
12:01 UTC - Alarm cleared for db load
12:06 UTC - Queue sizes are back to zeroes, everything is normal (with respect to new meanings of zero and normal)

Timeline graphs:

Incident Analysis

How was the incident detected? PagerDuty alert was generated for the high database load, exactly 2 hours after load started.
Is there anything that could have been done to improve the time to detection? @ilyaf: No. However, I think that detecting this by database load is not optimal approach, and we need more visibility into the background_migrations state.
How was the root cause discovered? First from monitoring, then from querying database and rails console. Some graphs
Was this incident triggered by a change? In a sense, yes. One from two months ago.
Was there an existing issue that would have either prevented this incident or reduced the impact? No.

Root Cause Analysis

At 09:00 UTC database load went 100%.

Why did database load went up? background migration was removing a lot of rows, causing high load.
Why was backround migraiton happening? It was rescheduled from 2 months ago
Why was it rescheduled? It failed originally, and was set to be retried later.

...

What went well

We quickly identified the source of a problem and mitigated it.

What can be improved

We need visibility into postponed background migrations: how many are there now, what impact they will have, some sort of audit.
We need a controlled way to execute them, not "some time in the future".
We need a way to quickly pause offending queue

Corrective actions

TODO
Apart from the corrective actions, before closing this one, we should revert the alert threshold bump: gitlab-com/runbooks!478 (merged)

Edited Dec 24, 2017 by Ilya Frolov