Outage 2018-04-17 22:29 22:38UTC

Context

gitlab-ctl reconfigure corrupted the databases.ini file causing gitlab.com to be unable to access the database and serve 500 errors for 9 minutes from 22:29 until 22:38.

Timeline

On date: 2018-03-17

  • 22:29 - @_stark Running gitlab-ctl reconfigure on pgbouncer node
  • 22:30 - @ilyaf uhm... .com times out for me, wonder if its my vpn or not
  • 22:30 - Pingdom alerts
  • 22:32 - @_stark pastes pgbouncer error: 2018-04-17_22:32:05.88183 2018-04-17 22:32:05.881 109886 WARNING C-0x7f6a64400f20: gitlabhq_production_sidekiq/(nouser)@10.69.6.121:48654 Pooler Error: pgbouncer cannot connect to server
  • 22:32 - @ilyaf creates zoom
  • 22:35 - @ilyaf !tweet "We're investigating increased number of errors, and will followup with issue shortly"
  • 22:36 - @_stark - Argh The same problem we had before, the databases.ini file is empty
  • 22:36 - @_stark replaces file with old contents from terminal history and runs gitlab-ctl restart pgbouncer

Incident Analysis

  • Pingdom detected the problem quickly (though @ilyaf was faster)
  • Pingdom was not very specific though. The new pgbouncer alerts may help: gitlab-com/runbooks!554 (merged)
  • This incident was a repeat of a previous outage: https://gitlab.com/gitlab-com/infrastructure/issues/3876
  • The gitlab-ctl configure in this case was needed to make a change to pgbouncer max_clients parameter which was needed due to gitlab-com/database#70 (closed)

Root Cause Analysis

  1. Why did we have downtime: we applied a change to pgbouncer that normally requires a reload
  2. Why did this cause downtime?: details of this can be found in https://gitlab.com/gitlab-com/infrastructure/issues/3876#note_64118085, but it's not entirely clear yet what happened

What went well

The problem was detected quickly. A call was set up immediately by @ilyaf. Both @stanhu and @_stark identified the problem immediately.

What can be improved

  • When we have a known bug from a previous outage we should have a runbook documenting the specific commands needed to recover. @_stark didn't know how it was fixed previously and improvised.
  • Better monitoring of PGBouncer including connection errors and configuration errors
  • Upgrade omnibus more frequently so the process is well exercised and knowledge spread
  • Multiple PGBouncer nodes with load balanced between them using a mechanism that detects connection failures

Corrective actions

  • gitlab-com/database#12 (closed)
  • https://gitlab.com/gitlab-com/infrastructure/issues/3442
  • gitlab-com/migration#323 (moved)

Guidelines

  • Blameless Postmortems Guideline
  • 5 whys
Edited Apr 20, 2018 by Gregory Stark
Assignee Loading
Time tracking Loading