RCA: Hashed Storage Migration left some repos unreachable
Summary
As part of maintenance, migrating our project repositories to a different file path format on disk did not successfully complete for some repositories, leaving them in a state where they were unable to be reached.
- Maintenance Change Issue: production#658 (closed)
- Overall Migration Issue: https://gitlab.com/gitlab-com/gl-infra/infrastructure/issues/4869
- Project Issue: https://gitlab.com/gitlab-com/gl-infra/infrastructure/issues/4772
- Reasoning for all of the above: https://gitlab.com/gitlab-org/gitlab-ee/issues/8690
- Service(s) affected: GitLab.com
- Team attribution: Geo
- Minutes downtime or degradation: varies
Impact & Metrics
Start with the following:
- What was the impact of the incident? Perceived Data Loss
- Who was impacted by this incident? External Customers
- How did the incident impact customers? Prevent customers from reaching their data
- How many customers were affected? Estimated to impact 4453 projects
Detection & Response
Start with the following:
- How was the incident detected? Via customer https://gitlab.com/gitlab-com/gl-infra/infrastructure/issues/6022
- Did alarming work as expected? There does not exist any for this type of work
- How long did it take from the start of the incident to its detection? 2 days 2 hours
- How long did it take from detection to remediation? 32 hours
Timeline
- 2018-01-19 00:48 - Migration start
- 2018-01-20 14:52 - Migration end
- 2018-01-21 02:40 - First customer reported issue: https://gitlab.com/gitlab-com/gl-infra/infrastructure/issues/6022
- 2018-01-21 07:16 - First response from GitLab on customer issue
- 2018-01-21 13:01 - A script had been run parsing all detected incorrectly configure repositories to resolve a large number of them, it was reported that 558 repos are left to be repaired
- 2018-01-22 10:36 - A script had been run to complete fixing any remaining repositories
Root Cause Analysis
Projects are moved prior to performing a validation of the project: https://gitlab.com/gitlab-com/gl-infra/infrastructure/issues/6022#note_133198237
What went well
- 7.6M projects are successfully migrated.
- 4.4k projects have a migrated repo, but the attachments aren't migrated yet.
- 3.7k projects are still on legacy storage.
- 0.05% error
What can be improved
- The work for this was being done on the staging environment at the same time. Instead the work to staging should have been done and validated prior to starting in the production environment.
- We should have known or at least predicted which errors are common during this type of migration. With this information we can better monitor Sentry for problems
- Time to correct all repos could be improved. For this we're going to modify our internal procedures to indicate a request that developers be available to help us when problems are detected.
- Not all rake tasks provide a dry-run style of enabling us to determine risk. We need to ensure that when we plan work, we are properly evaluating and know what the risk and mitigation of said risk prior to executing the task.
Corrective actions
- https://gitlab.com/gitlab-org/gitlab-ce/issues/56618
- https://gitlab.com/gitlab-com/gl-infra/infrastructure/issues/6001
- https://gitlab.com/gitlab-com/gl-infra/infrastructure/issues/6091
- https://gitlab.com/gitlab-com/gl-infra/infrastructure/issues/6092
- https://gitlab.com/gitlab-org/gitlab-ee/issues/9414
Thanks
Guidelines
Edited by John Skarbek