Outage 2018-04-17 22:29 22:38UTC

Context

gitlab-ctl reconfigure corrupted the databases.ini file causing gitlab.com to be unable to access the database and serve 500 errors for 9 minutes from 22:29 until 22:38.

Timeline

On date: 2018-03-17

22:29 - @_stark Running gitlab-ctl reconfigure on pgbouncer node
22:30 - @ilyaf uhm... .com times out for me, wonder if its my vpn or not
22:30 - Pingdom alerts
22:32 - @_stark pastes pgbouncer error: 2018-04-17_22:32:05.88183 2018-04-17 22:32:05.881 109886 WARNING C-0x7f6a64400f20: gitlabhq_production_sidekiq/(nouser)@10.69.6.121:48654 Pooler Error: pgbouncer cannot connect to server
22:32 - @ilyaf creates zoom
22:35 - @ilyaf !tweet "We're investigating increased number of errors, and will followup with issue shortly"
22:36 - @_stark - Argh The same problem we had before, the databases.ini file is empty
22:36 - @_stark replaces file with old contents from terminal history and runs gitlab-ctl restart pgbouncer

Incident Analysis

Pingdom detected the problem quickly (though @ilyaf was faster)
Pingdom was not very specific though. The new pgbouncer alerts may help: gitlab-com/runbooks!554 (merged)
This incident was a repeat of a previous outage: https://gitlab.com/gitlab-com/infrastructure/issues/3876
The gitlab-ctl configure in this case was needed to make a change to pgbouncer max_clients parameter which was needed due to gitlab-com/database#70 (closed)

Root Cause Analysis

Why did we have downtime: we applied a change to pgbouncer that normally requires a reload
Why did this cause downtime?: details of this can be found in https://gitlab.com/gitlab-com/infrastructure/issues/3876#note_64118085, but it's not entirely clear yet what happened

What went well

The problem was detected quickly. A call was set up immediately by @ilyaf. Both @stanhu and @_stark identified the problem immediately.

What can be improved

When we have a known bug from a previous outage we should have a runbook documenting the specific commands needed to recover. @_stark didn't know how it was fixed previously and improvised.
Better monitoring of PGBouncer including connection errors and configuration errors
Upgrade omnibus more frequently so the process is well exercised and knowledge spread
Multiple PGBouncer nodes with load balanced between them using a mechanism that detects connection failures

Corrective actions

Guidelines

Edited Apr 20, 2018 by Gregory Stark