2020-06-24: Unusual logs from pgbouncer
/label incident IncidentActive
Summary
After a reboot, pgbouncer-01 in production could not resolve the master patroni node with DNS.
Unusual logs from pgbouncer
Possible pgbouncer problems? Investigating.
Timeline
All times UTC.
2020-06-24
- 21:54 - SRE announces in #production channel that they are seeing
something weirdthat then went away:
$ sudo gitlab-rails console
Traceback (most recent call last):
85: from bin/rails:4:in `<main>'
<elided>
3: from /opt/gitlab/embedded/lib/ruby/gems/2.6.0/gems/activerecord-6.0.3.1/lib/active_record/connection_adapters/postgresql_adapter.rb:46:in `postgresql_connection'
2: from /opt/gitlab/embedded/lib/ruby/gems/2.6.0/gems/pg-1.2.2/lib/pg.rb:58:in `connect'
1: from /opt/gitlab/embedded/lib/ruby/gems/2.6.0/gems/pg-1.2.2/lib/pg.rb:58:in `new'
/opt/gitlab/embedded/lib/ruby/gems/2.6.0/gems/pg-1.2.2/lib/pg.rb:58:in `initialize': ERROR: pgbouncer cannot connect to server (PG::ConnectionBad)
- 22:03 - Another SRE notes that they are seeing errors in the pgbouncer logs
- 22:16 - cmcfarland declares incident in Slack using
/incident declarecommand. - 22:?? - pgbouncer service is identified to be doing little to no real work, so it is restarted
- 22:?? - the error messages went away and pgbouncer started to show signs of performing work
Click to expand or collapse the Incident Review section.
Incident Review
Summary
- Service(s) affected:
- Team attribution:
- Minutes downtime or degradation:
Metrics
Customer Impact
- Who was impacted by this incident? (i.e. external customers, internal customers)
- What was the customer experience during the incident? (i.e. preventing them from doing X, incorrect display of Y, ...)
- How many customers were affected?
- If a precise customer impact number is unknown, what is the estimated potential impact?
Incident Response Analysis
- How was the event detected?
- How could detection time be improved?
- How did we reach the point where we knew how to mitigate the impact?
- How could time to mitigation be improved?
Post Incident Analysis
- How was the root cause diagnosed?
- How could time to diagnosis be improved?
- Do we have an existing backlog item that would've prevented or greatly reduced the impact of this incident?
- Was this incident triggered by a change (deployment of code or change to infrastructure. If yes, have you linked the issue which represents the change?)?
5 Whys
Lessons Learned
Corrective Actions
Guidelines
Edited by Cameron McFarland