RCA: Hashed Storage Migration left some repos unreachable

Summary

As part of maintenance, migrating our project repositories to a different file path format on disk did not successfully complete for some repositories, leaving them in a state where they were unable to be reached.

  • Maintenance Change Issue: production#658 (closed)
  • Overall Migration Issue: https://gitlab.com/gitlab-com/gl-infra/infrastructure/issues/4869
  • Project Issue: https://gitlab.com/gitlab-com/gl-infra/infrastructure/issues/4772
  • Reasoning for all of the above: https://gitlab.com/gitlab-org/gitlab-ee/issues/8690
  • Service(s) affected: GitLab.com
  • Team attribution: Geo
  • Minutes downtime or degradation: varies

Impact & Metrics

Start with the following:

  • What was the impact of the incident? Perceived Data Loss
  • Who was impacted by this incident? External Customers
  • How did the incident impact customers? Prevent customers from reaching their data
  • How many customers were affected? Estimated to impact 4453 projects

Detection & Response

Start with the following:

  • How was the incident detected? Via customer https://gitlab.com/gitlab-com/gl-infra/infrastructure/issues/6022
  • Did alarming work as expected? There does not exist any for this type of work
  • How long did it take from the start of the incident to its detection? 2 days 2 hours
  • How long did it take from detection to remediation? 32 hours

Timeline

  • 2018-01-19 00:48 - Migration start
  • 2018-01-20 14:52 - Migration end
  • 2018-01-21 02:40 - First customer reported issue: https://gitlab.com/gitlab-com/gl-infra/infrastructure/issues/6022
  • 2018-01-21 07:16 - First response from GitLab on customer issue
  • 2018-01-21 13:01 - A script had been run parsing all detected incorrectly configure repositories to resolve a large number of them, it was reported that 558 repos are left to be repaired
  • 2018-01-22 10:36 - A script had been run to complete fixing any remaining repositories

Root Cause Analysis

Projects are moved prior to performing a validation of the project: https://gitlab.com/gitlab-com/gl-infra/infrastructure/issues/6022#note_133198237

What went well

  • 7.6M projects are successfully migrated.
    • 4.4k projects have a migrated repo, but the attachments aren't migrated yet.
    • 3.7k projects are still on legacy storage.
  • 0.05% error

What can be improved

  • The work for this was being done on the staging environment at the same time. Instead the work to staging should have been done and validated prior to starting in the production environment.
  • We should have known or at least predicted which errors are common during this type of migration. With this information we can better monitor Sentry for problems
  • Time to correct all repos could be improved. For this we're going to modify our internal procedures to indicate a request that developers be available to help us when problems are detected.
  • Not all rake tasks provide a dry-run style of enabling us to determine risk. We need to ensure that when we plan work, we are properly evaluating and know what the risk and mitigation of said risk prior to executing the task.

Corrective actions

  • https://gitlab.com/gitlab-org/gitlab-ce/issues/56618
  • https://gitlab.com/gitlab-com/gl-infra/infrastructure/issues/6001
  • https://gitlab.com/gitlab-com/gl-infra/infrastructure/issues/6091
  • https://gitlab.com/gitlab-com/gl-infra/infrastructure/issues/6092
  • https://gitlab.com/gitlab-org/gitlab-ee/issues/9414

Thanks

  • Big thanks to @stanhu @brodock and @toon for hopping on this

Guidelines

  • Blameless RCA Guideline
  • 5 whys
Edited Jan 25, 2019 by John Skarbek
Assignee Loading
Time tracking Loading