2020-11-01 postgres replication lag high; sidekiq saturation
Summary
Context will be added here as we investigate.
Timeline
All times UTC.
2020-11-01
- 00:00 - Monthly CI minute reset sidekiq job kicks off, scheduling 100 batch workers to perform the actual reset.
- 00:07 -
Postgres Replication lag (in bytes) is high
fires on patroni-07 - 00:12 - cmiskell declares incident in Slack.
- 00:12-01:15 - various sundry other alerts fire, including:
PostgreSQL_ReplicationLagBytesTooLarge
,The
shard_urgent_cpu_boundcomponent of the
sidekiqservice, (
mainstage), has an apdex-score burn rate outside of SLO
,Postgres seems to be processing very few transactions
,PostgreSQL_CommitRateTooLow
,Increased Error Rate Across Fleet
,Large amount of Sidekiq Queued jobs
,Postgres Replication lag is over 2 minutes
,PostgreSQL dead tuples is too large
,PostgreSQL_TooManyDeadTuples
- 01:05 - The batch workers generally start finishing
- 01:15 - All errors have cleared, incident is deemed to be over
Click to expand or collapse the Incident Review section.
Incident Review
Summary
- Service(s) affected:
- Team attribution:
- Minutes downtime or degradation:
Metrics
Customer Impact
- Who was impacted by this incident? (i.e. external customers, internal customers)
- What was the customer experience during the incident? (i.e. preventing them from doing X, incorrect display of Y, ...)
- How many customers were affected?
- If a precise customer impact number is unknown, what is the estimated potential impact?
Incident Response Analysis
- How was the event detected?
- How could detection time be improved?
- How did we reach the point where we knew how to mitigate the impact?
- How could time to mitigation be improved?
Post Incident Analysis
- How was the root cause diagnosed?
- How could time to diagnosis be improved?
- Do we have an existing backlog item that would've prevented or greatly reduced the impact of this incident?
- Was this incident triggered by a change (deployment of code or change to infrastructure. If yes, have you linked the issue which represents the change?)?
5 Whys
Lessons Learned
Corrective Actions
Guidelines
Edited by Craig Miskell