Hashed Storage rollout in GitLab dot Com

Following: https://gitlab.com/gitlab-com/infrastructure/issues/4174 and as a step to gitlab-org&75 (closed) I discussed some additional steps with @stanhu to have an initial rollout in gitlab dot com:

  • First week: Start migrating some of our own projects (gitlab-com and/or gitlab-org) #4785 (closed)
  • Second week #4866 (closed):
    • Flip the feature toggle to disable the migration when renaming/moving a project
    • Enable Hashed Storage for new projects only, for a short period of time (~2 hours)
    • Disable Hashed Storage for new projects
    • Flip the feature toggle back to the initial state
  • Third week #4867 (closed):
    • Enable Hashed Storage for new projects and the automatic migration when renaming/moving a project
    • Monitor for any issues with the renaming/moving
    • If anything pops up with the renaming/moving migration, Flip the feature toggle back, but keep the Hashed Storage for new projects enabled
  • Fourth week #4868 (closed): Migrate all our gitlab-com and gitlab-org projects
  • Fifth week and beyond #4869 (closed): Start migrating user's repository (batches of 1K? per day/or all repositories from the same storage, etc)
  • Resolve any failures #6001
  • Resolve any failures from #935

Because we've introduced a feature to also migrate hashed storage when renaming/moving projects in https://gitlab.com/gitlab-org/gitlab-ce/merge_requests/19747 we may want to reduce the risk and introduce a feature toggle to disable the behavior first (this will be done in a separate issue, I will discuss this with @vsizov - https://gitlab.com/gitlab-org/gitlab-ce/issues/50345).


After finalizing all the migration in GitLab-dot-com, this is a list of what we learned and the corner cases we found:

  • When a precondition fail, we had no visibility of what failed nor why
    • That required extra debugging and allowed us to improve the code.
    • The failures were either due to corner cases or bugs in other parts of the codebase we were relying on.
  • The rake task did not include all pagination params SRE wanted in order to throttle the execution
    • Initially SRE did some custom scripts to trigger a throttled amount of projects to comply with GitLab-dot-com load
    • We improve a few things on our rake task adding extra params it was lacking, but still most of the initial batch was triggered via some custom bash scripts calling our rake tasks and paginating
  • Files in object storage are already hashed, but we were still listing them as candidates
    • The code did not corrupt or messed up with anything on object storage, so we got only some noise as side-effect
    • The queries were fixed and we removed them from the list and from the candidates that were being triggered
    • We also added an extra bit that just ignored any attempt to schedule a migration as a precondition
  • Initial goal was to build it to be fast (that means, be over optimistic of how the environment is, but stop to prevent data-loss whenever we have 100% confidence)
    • This was necessary for the scale of GitLab dot com. The conservative approach helped us achieve no data-loss
    • Because of the conservative approach, it required multiple iterations to find and fix corner cases
  • We found out that due to some previous attempts we left an empty folder (with CarrierWave folder structure, but no files inside)
    • We now consider that OK and we still try to migrate over
    • For GitLab dot com, we found a few cases were, due to an early attempt, the tmp folder were not empty, so a few cases required a manually intervention to allow that to be retried
  • We lacked instrumentation, logging was not good from the beginning
    • We've improved the amount of information we log
    • That allowed us to follow up the migration by watching the specific worker class in our logstash instance
  • Because of the speed concerns, migration happened in sidekiq, and scheduling them was also triggered by sidekiq. This is hard to debug. Ideally we should have a background/foreground mode
    • I've created a proposal to build some internal framework with the lessons we learned here that could help us with future data migrations gitlab-org/gitlab#34427 (closed)
  • Permission problems (incorrect permissions on disk, in a few of the last remaining projects)
    • This was due to previous SRE support work, which had nothing to do with the hashed storage, but impacted the script
    • Permissions had to be fixed manually
  • SQL timeouts (during first attempts we had enough database performance, that degraded as dot com grew and we reduced timeout limit...)
    • We had to add partial indexes to help get back the speed
    • The indexes were only tracking remaining legacy storage projects, so after migrating them all the index will have zero cost
  • Repository reference counters were inconsistent in some repositories (probably due to earlier bug)
    • We had to manually reset the counter for a few projects
    • There was no code available to do that, so that was added in another MR
  • We found projects in legacy storage that are pending delete but never was removed (created: gitlab-org/gitlab#210031)
    • The solution until we found a more permanent fix is to re-schedule removal of them manually
    • There is documentation on how to do it here: https://gitlab.com/gitlab-org/gitlab/blob/a67ad6249dc784f328ce23d77bd7ae1e8ebe57b5/doc/administration/troubleshooting/gitlab_rails_cheat_sheet.md#L193-232
Edited Mar 26, 2020 by Gabriel Mazetto
Assignee Loading
Time tracking Loading