RCA: Hashed Storage Migration left some repos unreachable

Summary

As part of maintenance, migrating our project repositories to a different file path format on disk did not successfully complete for some repositories, leaving them in a state where they were unable to be reached.

Maintenance Change Issue: production#658 (closed)
Overall Migration Issue: https://gitlab.com/gitlab-com/gl-infra/infrastructure/issues/4869
Project Issue: https://gitlab.com/gitlab-com/gl-infra/infrastructure/issues/4772
Reasoning for all of the above: https://gitlab.com/gitlab-org/gitlab-ee/issues/8690
Service(s) affected: GitLab.com
Team attribution: Geo
Minutes downtime or degradation: varies

Impact & Metrics

Start with the following:

What was the impact of the incident? Perceived Data Loss
Who was impacted by this incident? External Customers
How did the incident impact customers? Prevent customers from reaching their data
How many customers were affected? Estimated to impact 4453 projects

Detection & Response

Start with the following:

How was the incident detected? Via customer https://gitlab.com/gitlab-com/gl-infra/infrastructure/issues/6022
Did alarming work as expected? There does not exist any for this type of work
How long did it take from the start of the incident to its detection? 2 days 2 hours
How long did it take from detection to remediation? 32 hours

Timeline

2018-01-19 00:48 - Migration start
2018-01-20 14:52 - Migration end
2018-01-21 02:40 - First customer reported issue: https://gitlab.com/gitlab-com/gl-infra/infrastructure/issues/6022
2018-01-21 07:16 - First response from GitLab on customer issue
2018-01-21 13:01 - A script had been run parsing all detected incorrectly configure repositories to resolve a large number of them, it was reported that 558 repos are left to be repaired
2018-01-22 10:36 - A script had been run to complete fixing any remaining repositories

Root Cause Analysis

Projects are moved prior to performing a validation of the project: https://gitlab.com/gitlab-com/gl-infra/infrastructure/issues/6022#note_133198237

What went well

7.6M projects are successfully migrated.
- 4.4k projects have a migrated repo, but the attachments aren't migrated yet.
- 3.7k projects are still on legacy storage.
0.05% error

What can be improved

The work for this was being done on the staging environment at the same time. Instead the work to staging should have been done and validated prior to starting in the production environment.
We should have known or at least predicted which errors are common during this type of migration. With this information we can better monitor Sentry for problems
Time to correct all repos could be improved. For this we're going to modify our internal procedures to indicate a request that developers be available to help us when problems are detected.
Not all rake tasks provide a dry-run style of enabling us to determine risk. We need to ensure that when we plan work, we are properly evaluating and know what the risk and mitigation of said risk prior to executing the task.