Outage: 2016-09-27 00:45 UTC - GitLab Database Contention & Locking

At 0:45 UTC the database experienced a locking issue with contention between a running pg_dump and a non-updated script for pg_repack. This caused the database to enter into a locking scenario which adversely affected the web experience. The pg_repack was killed and the pg_dump was allowed to continue to run. The nature of which the pg_repck was killed as not clean because the process became unresponsive. This left an unbeknownst lingering trigger that was holding the database up from normal operations copying data out of tables into unmanged temporary re_pack tables with a shared access level lock. This took an additional 8 minutes to find and remove, then the flushing of the sidekiq processes who had timed out their database connections.

We wil be taking the following actions:

  • re-enabe chef management on the database servers to that updates to function scripts are deployed to the targeted hosts
  • expedite moving to using pg_repack on the entire database instead of one-off table calls
  • wrap pg_dump to not be invoked if a pg_repack process is running
Assignee Loading
Time tracking Loading