Outage 2018-04-17 22:29 22:38UTC
Context
gitlab-ctl reconfigure corrupted the databases.ini file causing gitlab.com to be unable to access the database and serve 500 errors for 9 minutes from 22:29 until 22:38.
Timeline
On date: 2018-03-17
- 22:29 - @_stark
Running gitlab-ctl reconfigure on pgbouncer node - 22:30 - @ilyaf uhm... .com times out for me, wonder if its my vpn or not
- 22:30 - Pingdom alerts
- 22:32 - @_stark pastes pgbouncer error:
2018-04-17_22:32:05.88183 2018-04-17 22:32:05.881 109886 WARNING C-0x7f6a64400f20: gitlabhq_production_sidekiq/(nouser)@10.69.6.121:48654 Pooler Error: pgbouncer cannot connect to server - 22:32 - @ilyaf creates zoom
- 22:35 - @ilyaf !tweet "We're investigating increased number of errors, and will followup with issue shortly"
- 22:36 - @_stark - Argh The same problem we had before, the databases.ini file is empty
- 22:36 - @_stark replaces file with old contents from terminal history and runs
gitlab-ctl restart pgbouncer
Incident Analysis
- Pingdom detected the problem quickly (though @ilyaf was faster)
- Pingdom was not very specific though. The new pgbouncer alerts may help: gitlab-com/runbooks!554 (merged)
- This incident was a repeat of a previous outage: https://gitlab.com/gitlab-com/infrastructure/issues/3876
- The gitlab-ctl configure in this case was needed to make a change to pgbouncer
max_clientsparameter which was needed due to gitlab-com/database#70 (closed)
Root Cause Analysis
- Why did we have downtime: we applied a change to pgbouncer that normally requires a reload
- Why did this cause downtime?: details of this can be found in https://gitlab.com/gitlab-com/infrastructure/issues/3876#note_64118085, but it's not entirely clear yet what happened
What went well
The problem was detected quickly. A call was set up immediately by @ilyaf. Both @stanhu and @_stark identified the problem immediately.
What can be improved
- When we have a known bug from a previous outage we should have a runbook documenting the specific commands needed to recover. @_stark didn't know how it was fixed previously and improvised.
- Better monitoring of PGBouncer including connection errors and configuration errors
- Upgrade omnibus more frequently so the process is well exercised and knowledge spread
- Multiple PGBouncer nodes with load balanced between them using a mechanism that detects connection failures
Corrective actions
- gitlab-com/database#12 (closed)
- https://gitlab.com/gitlab-com/infrastructure/issues/3442
- gitlab-com/migration#323 (moved)
Guidelines
Edited by Gregory Stark