Skip to content

Migrate hdd_migration-tagged project to HDD shards

Production Change

Change Summary

Setting up and enabling a CI schedule (in ops) that migrates tagged projects (#3393 (closed)) to Gitaly HDD shards. This is a slow-running migration that's expected to run for at least 3 weeks non-stop.

Related to &369 (closed).

Change Details

  1. Services Impacted - ServiceAPI ServiceSidekiq ServiceGitaly
  2. Change Technician - @ahmadsherif
  3. Change Criticality - C2
  4. Change Type - changescheduled
  5. Change Reviewer - @alejandro
  6. Due Date - 2020-02-02 14:00 UTC
  7. Time tracking - 21 minutes for manual actions, at least 3 weeks for the actual migration
  8. Downtime Component - N/A

Detailed steps for the change

Pre-Change Steps - steps to be completed before execution of the change

Estimated Time to Complete (mins) - 10 minutes

Change Steps - steps to take to execute the change

Estimated Time to Complete (mins) - 1 minute

Post-Change Steps - steps to take to verify the change

Estimated Time to Complete (mins) - 10 minutes

Rollback

Rollback steps - steps to be taken in the event of a need to rollback this change

Estimated Time to Complete (mins) - Estimated Time to Complete in Minutes

  • Dectivate the HDD migration [production] schedule from https://ops.gitlab.net/gitlab-com/gl-infra/gitaly-shard-allocator/-/pipeline_schedules
  • Cancel any running pipelines in https://ops.gitlab.net/gitlab-com/gl-infra/gitaly-shard-allocator/-/pipelines
  • Run ssh console-01-sv-gprd.c.gitlab-production.internal on your workstation
  • Launch a Rails conole by running sudo gitlab-rails c
  • In the Rails console, run:
    File.open('/tmp/migrated-to-hdd-projects', 'w') do |f|
        (1..8).each do |i|
            Project.joins(:custom_attributes).where(repository_storage: "nfs-file-hdd0#{i}", project_custom_attributes: {key: 'hdd_migration', value: 'scheduled'}).pluck(:id).each do |id|
                f.write("#{id}\n")
            end
        end
    end
  • Close the Rails console and enter a tmux session by running tmux
  • Export your admin token by running export GITLAB_GPRD_ADMIN_API_PRIVATE_TOKEN=token
  • Run:
    for id in `cat /tmp/migrated-to-hdd-projects`; do
      curl --verbose --silent --compressed --request POST "https://gitlab.com/api/v4/projects/${id}/repository_storage_moves" --header "content-type: application/json" --header "private-token: ${GITLAB_GPRD_ADMIN_API_PRIVATE_TOKEN}" --data '{}' 
    done

Monitoring

Key metrics to observe

  • Log: Rails logs
    • Location: URL
    • What changes to this metric should prompt a rollback: High rate of 50x or 40x, it shouldn't trigger a full rollback, but deactivating the scheduling is preferable till investigation is performed
  • Log: Gitaly logs
    • Location: URL
    • What changes to this metric should prompt a rollback: High rate of non-OK codes, it shouldn't trigger a full rollback, but deactivating the scheduling is preferable till investigation is performed
  • Metric: Filesystem available ration
    • Location: URL
    • What changes to this metric should prompt a rollback: N/A
    • Graphs should trend down over time
  • Metric: Gitaly service error ratio
    • Location: URL
    • What changes to this metric should prompt a rollback: Graph trending upwards or constant spikes
  • Metric: Gitaly service apdex
    • Location: URL
    • What changes to this metric should prompt a rollback: Graph trending downwards or constant spikes
  • Dashboard: sidekiq: Shard Detail
    • Location: URL
    • What changes to this metric should prompt a rollback: N/A

Summary of infrastructure changes

  • Does this change introduce new compute instances?
  • Does this change re-size any existing compute instances?
  • Does this change introduce any additional usage of tooling like Elastic Search, CDNs, Cloudflare, etc?

Summary of the above

Changes checklist

  • This issue has a criticality label (e.g. C1, C2, C3, C4) and a change-type label (e.g. changeunscheduled, changescheduled) based on the Change Management Criticalities.
  • This issue has the change technician as the assignee.
  • Pre-Change, Change, Post-Change, and Rollback steps and have been filled out and reviewed.
  • Necessary approvals have been completed based on the Change Management Workflow.
  • Change has been tested in staging and results noted in a comment on this issue.
  • A dry-run has been conducted and results noted in a comment on this issue.
  • SRE on-call has been informed prior to change being rolled out. (In #production channel, mention @sre-oncall and this issue and await their acknowledgement.)
  • There are currently no active incidents.
Edited by Ahmad Sherif