High database load on primary database

Context

Re-scheduled background migration caused high database load.

Timeline

On date: 2017-12-24

DELETE FROM "redirect_routes" WHERE (EXISTS (
  SELECT 1 FROM routes
  WHERE (
    LOWER(redirect_routes.path) = LOWER(routes.path)
    OR LOWER(redirect_routes.path) LIKE LOWER(CONCAT(routes.path, '/%'))  
  )
  AND routes.id BETWEEN 3295386 AND 3295646
)

Timeline graphs:

Incident Analysis

  • How was the incident detected? PagerDuty alert was generated for the high database load, exactly 2 hours after load started.

  • Is there anything that could have been done to improve the time to detection? @ilyaf: No. However, I think that detecting this by database load is not optimal approach, and we need more visibility into the background_migrations state.

  • How was the root cause discovered? First from monitoring, then from querying database and rails console. Some graphs

  • Was this incident triggered by a change? In a sense, yes. One from two months ago.

  • Was there an existing issue that would have either prevented this incident or reduced the impact? No.

Root Cause Analysis

At 09:00 UTC database load went 100%.

  • Why did database load went up? background migration was removing a lot of rows, causing high load.
  • Why was backround migraiton happening? It was rescheduled from 2 months ago
  • Why was it rescheduled? It failed originally, and was set to be retried later.

...

What went well

  • We quickly identified the source of a problem and mitigated it.

What can be improved

  • We need visibility into postponed background migrations: how many are there now, what impact they will have, some sort of audit.
  • We need a controlled way to execute them, not "some time in the future".
  • We need a way to quickly pause offending queue

Corrective actions

Edited by Ilya Frolov