Skip to content

restore a backup ( PITR) and investigate the outage

The following action items were taken:

  1. Restored from a snapshot https://gitlab.com/gitlab-com/gl-infra/infrastructure/-/issues/12791#note_526563449
  2. After a detailed investigation, root cause has been established https://gitlab.com/gitlab-com/gl-infra/infrastructure/-/issues/12801#the-root-cause-of-event

Original issue description is preserved below:

@ahachete @adescoms let's proceed with the idea of the point in time recovery from a backup.

Please consider the following points:

  • Compare the plans
  • Check the statistics from the main tables. I think will be interesting to understand if something got "corrupted" or we had a spike of usage from resources.
  • Please let's mark the main events of the incident, when we executed analyze or when we saw and what improvement.

I think we could use database labs for this research.

Acceptance criteria:

  • Create a report with the timeline for events and data ( access plans and statistics).
Edited by Marin Jankovski