Consider a notification when migrations are taking a long time to run
Problem Statement
Database migrations can sometimes run long for legitimate reasons. However, we have no protection or system for alerting if migrations are simply stuck. In production#4879 (closed) the migration that led to the failure was stuck and could never proceed forward. Had we been notified earlier than the 5 hour timeout, we could have at least investigated the issue prior to the timeout, allowing the release manager to have a bit less stress, and pending the timing of the investigation, enabled certain team members to assist in the investigation.
Solution
Determine a method for which we can alert ourselves when database migrations are taking longer than "normal". We need to define what normal is.
Milestones
-
Define a normal operating time for which database migrations run -
Implement notification method to alert current release managers when a database migration job has exceeded the above threshold