Migrate to Hashed Storage

Production Change - Criticality 4 C4

Migrate all Repositories to Hashed storage (aka storage_version: 2) https://docs.gitlab.com/ee/administration/raketasks/storage.html
Change Type Repository Storage
Services Impacted Storage
Change Team Members @skarbek
Change Severity C4
Buddy check or tested in staging Nominate @dsylva && @aamarsanaa
Schedule of the change 2019-01-18
Duration of the change Unknown

Overview

  • ssh into the console node
  • set env variables ENV_TO and ENV_FROM as appropriate
  • Execute the rake task gitlab:storage:migrate_to_hashed
  • validate all is well via gitlab:storage:legacy_projects should output 0

What to Monitor

  • HashedStorageWorker: https://dashboards.gitlab.net/d/000000124/sidekiq-workers?orgId=1&refresh=5s&var-worker=ProjectMigrateHashedStorageWorker%23perform&var-database=influxdb-01-inf-gprd
  • Sidekiq General: https://dashboards.gitlab.net/d/9GOIu9Siz/sidekiq-stats?orgId=1
  • Redis General: https://dashboards.gitlab.net/d/wccEP9Imk/redis?refresh=5m&orgId=1
  • PGBouncer Connections: https://dashboards.gitlab.net/d/000000285/pgbouncer-detail?orgId=1&var-environment=gprd&var-fqdn=patroni-04-db-gprd.c.gitlab-production.internal&var-user=gitlab&var-database=gitlabhq_production&var-prometheus=prometheus-01-inf-gprd
  • project_migrate_hashed_storage queue: https://gitlab.com/admin/background_jobs

Finer Detailed Plan of Action

  • During the steps below, if we are impacting production, throttle were appropriate
    • adjusting the amount of workers on sidekiq
    • batching up fewer jobs
  • Set the range between 0 and 1000, monitor the queue, and various dashboards to ensure production is not suffering
  • Set the range to 1000 and 4000, repeat
  • set the range to 4000 and 16000, repeat
  • set the range to 16000 and 256000, repeat
  • repeat increasing the range and batch size until no legacy storage items are left

Mitigation

  • A dedicated sidekiq node with 4 workers has been spun up to prevent overloading our existing sidekiq fleet: https://gitlab.com/gitlab-com/gl-infra/infrastructure/issues/4869#note_132212272
  • Code changes have been performed to push the read_only flag to as late as possible: https://gitlab.com/gitlab-org/gitlab-ce/merge_requests/24128

Tested?

  • Staging is currently undergoing this migration
    • with batching up 1 million projects at a time, the service appears to be running just fine
    • we do not have the above mentioned metrics available for viewing...

What is this

  • The paths for which our storage lies for repositories are changing from .../skarbek/test0 => ...<some_hash_value_based_on_our_code>
  • This is being done as the only support path to support file storage of repos in upcoming editions of GitLab: https://gitlab.com/gitlab-org/gitlab-ee/issues/8690

Something to be careful of

  • UI Issue: https://gitlab.com/gitlab-org/gitlab-ee/issues/9268
Edited Jan 18, 2019 by John Skarbek
Assignee Loading
Time tracking Loading