2020-12-30: WAL-G Backup failed
Summary
The wal-g backups which happen at midnight failed
More information will be added as we investigate the issue.
Timeline
All times UTC.
2020-12-30
00:05 - ggillies gets page https://gitlab.pagerduty.com/incidents/P8Q141U saying Last successful WAL-G basebackup was seen 42.87s ago for env gprd.
00:15 - ggillies determines that the full backup job is running on patroni-08, and after discussion with @dawsmith we determine this is quite possibly a known false alert (looking at discussions around the change request to move to wal-g). ggillies tails the backup log file and continues to watch it.
03:07 - ggillies gets page https://gitlab.pagerduty.com/incidents/P4HKELQ?utm_source=slack&utm_campaign=channel saying the backup job has failed
03:12 - ggillies declares incident in Slack.
03:16 - ggillies opens up a tmux session on patroni-08 and kicks off another backup job manually to ensure we actually get a backup successful
08:25 - the manually triggered backup job finished successfully
Corrective Actions
Click to expand or collapse the Incident Review section.
Incident Review
Summary
- Service(s) affected:
- Team attribution:
- Time to detection:
- Minutes downtime or degradation:
Metrics
Customer Impact
-
Who was impacted by this incident? (i.e. external customers, internal customers)
- ...
-
What was the customer experience during the incident? (i.e. preventing them from doing X, incorrect display of Y, ...)
- ...
-
How many customers were affected?
- ...
-
If a precise customer impact number is unknown, what is the estimated impact (number and ratio of failed requests, amount of traffic drop, ...)?
- ...
What were the root causes?
Incident Response Analysis
-
How was the incident detected?
- ...
-
How could detection time be improved?
- ...
-
How was the root cause diagnosed?
- ...
-
How could time to diagnosis be improved?
- ...
-
How did we reach the point where we knew how to mitigate the impact?
- ...
-
How could time to mitigation be improved?
- ...
-
What went well?
- ...
Post Incident Analysis
-
Did we have other events in the past with the same root cause?
- ...
-
Do we have existing backlog items that would've prevented or greatly reduced the impact of this incident?
- ...
-
Was this incident triggered by a change (deployment of code or change to infrastructure)? If yes, link the issue.
- ...
Lessons Learned
Guidelines
Resources
- If the Situation Zoom room was utilised, recording will be automatically uploaded to Incident room Google Drive folder (private)