restore a backup ( PITR) and investigate the outage
The following action items were taken:
- Restored from a snapshot https://gitlab.com/gitlab-com/gl-infra/infrastructure/-/issues/12791#note_526563449
- After a detailed investigation, root cause has been established https://gitlab.com/gitlab-com/gl-infra/infrastructure/-/issues/12801#the-root-cause-of-event
Original issue description is preserved below:
@ahachete @adescoms let's proceed with the idea of the point in time recovery from a backup.
Please consider the following points:
- Compare the plans
- Check the statistics from the main tables. I think will be interesting to understand if something got "corrupted" or we had a spike of usage from resources.
- Please let's mark the main events of the incident, when we executed analyze or when we saw and what improvement.
I think we could use database labs for this research.
Acceptance criteria:
-
Create a report with the timeline for events and data ( access plans and statistics).
Edited by Marin Jankovski