High database load on primary database
Context
Re-scheduled background migration caused high database load.
Timeline
On date: 2017-12-24
- 09:00 UTC - Database load gone up
- 11:05 UTC - Alert generated by PagerDuty, investigation started, slack conversation start: https://gitlab.slack.com/archives/C101F3796/p1514113546000018
- 11:17 UTC - @_stark identified the offending query to be:
DELETE FROM "redirect_routes" WHERE (EXISTS (
SELECT 1 FROM routes
WHERE (
LOWER(redirect_routes.path) = LOWER(routes.path)
OR LOWER(redirect_routes.path) LIKE LOWER(CONCAT(routes.path, '/%'))
)
AND routes.id BETWEEN 3295386 AND 3295646
)
- 11:35 UTC - We're disabling
background_migrationsqueue to ease the pressure (MR: https://dev.gitlab.org/cookbooks/chef-repo/merge_requests/1437) - 11:37 UTC - While we do it, we're trying to pause the queue with
Sidekiq::Queue.new('background_migration').pause, but that didn't work. - 11:45 UTC - change is live
- 11:55 UTC -
project_cachequeue starts to drop. Sidekiq stats: https://performance.gitlab.net/dashboard/db/sidekiq-stats?orgId=1&from=1514105230636&to=1514117920523 - 12:01 UTC - @ayufan finds out that the backround migration that got picked up today is from
1508191326.1909287 => 10/16/2017 @ 10:02pm (UTC) - 12:01 UTC - Alarm cleared for db load
- 12:06 UTC - Queue sizes are back to zeroes, everything is normal (with respect to new meanings of zero and normal)
Timeline graphs:
- sidekiq: https://performance.gitlab.net/dashboard/db/sidekiq-stats?orgId=1&from=1514105230636&to=1514117920523
- database: https://performance.gitlab.net/dashboard/db/postgres-stats?orgId=1&from=1514105230636&to=1514117920523
Incident Analysis
-
How was the incident detected? PagerDuty alert was generated for the high database load, exactly 2 hours after load started.
-
Is there anything that could have been done to improve the time to detection? @ilyaf: No. However, I think that detecting this by database load is not optimal approach, and we need more visibility into the background_migrations state.
-
How was the root cause discovered? First from monitoring, then from querying database and rails console. Some graphs
-
Was this incident triggered by a change? In a sense, yes. One from two months ago.
-
Was there an existing issue that would have either prevented this incident or reduced the impact? No.
Root Cause Analysis
At 09:00 UTC database load went 100%.
- Why did database load went up? background migration was removing a lot of rows, causing high load.
- Why was backround migraiton happening? It was rescheduled from 2 months ago
- Why was it rescheduled? It failed originally, and was set to be retried later.
...
What went well
- We quickly identified the source of a problem and mitigated it.
What can be improved
- We need visibility into postponed background migrations: how many are there now, what impact they will have, some sort of audit.
- We need a controlled way to execute them, not "some time in the future".
- We need a way to quickly pause offending queue
Corrective actions
- TODO
- Apart from the corrective actions, before closing this one, we should revert the alert threshold bump: gitlab-com/runbooks!478 (merged)