Migrate hdd_migration-tagged project to HDD shards

Production Change

Change Summary

Setting up and enabling a CI schedule (in ops) that migrates tagged projects (#3393 (closed)) to Gitaly HDD shards. This is a slow-running migration that's expected to run for at least 3 weeks non-stop.

Related to &369 (closed).

Change Details

Services Impacted - ServiceAPI ServiceSidekiq ServiceGitaly
Change Technician - @ahmadsherif
Change Criticality - C2
Change Type - changescheduled
Change Reviewer - @alejandro
Due Date - 2020-02-02 14:00 UTC
Time tracking - 21 minutes for manual actions, at least 3 weeks for the actual migration
Downtime Component - N/A

Detailed steps for the change

Pre-Change Steps - steps to be completed before execution of the change

Estimated Time to Complete (mins) - 10 minutes

Merge gitaly-shard-allocator!1 (merged)
Make sure https://ops.gitlab.net/gitlab-com/gl-infra/gitaly-shard-allocator has been mirrored successfully
With an admin account, create a private token with api scope
Create a CI schedule in https://ops.gitlab.net/gitlab-com/gl-infra/gitaly-shard-allocator/-/pipeline_schedules/new with the following fields:
- Description: HDD migration [production]
- Interval Pattern: */5 * * * *
- Cron Timezone: UTC
- Target Branch: master
- Variables:
  - CUSTOM_ATTR_KEY: hdd_migration
  - CUSTOM_ATTR_VALUE: pending
  - PROJECTS_PER_PAGE: 100
  - TOTAL_PROJECTS: 1500
  - DEST_SHARD_CUTOFF: 0.20
  - ENVIRONMENT: gprd
  - GITLAB_HOST: gitlab.com
  - GITLAB_ADMIN_TOKEN: Fill in value obtained from previous step
  - HDD_MIGRATION: 1
- Active: Unchecked

Change Steps - steps to take to execute the change

Estimated Time to Complete (mins) - 1 minute

Activate the HDD migration [production] schedule from https://ops.gitlab.net/gitlab-com/gl-infra/gitaly-shard-allocator/-/pipeline_schedules

Post-Change Steps - steps to take to verify the change

Estimated Time to Complete (mins) - 10 minutes

Make sure all the jobs from the prepare stage are passing in https://ops.gitlab.net/gitlab-com/gl-infra/gitaly-shard-allocator/-/pipelines
Make sure no errors are logged in the jobs from the migrate stage in https://ops.gitlab.net/gitlab-com/gl-infra/gitaly-shard-allocator/-/pipelines
Login into the production PG archive replica ssh postgres-dr-archive-01-db-gprd.c.gitlab-production.internal
Open a psql console sudo gitlab-psql
Execute SELECT COUNT("projects".*) FROM "projects" INNER JOIN "project_custom_attributes" ca ON ca.project_id = projects.id WHERE "projects"."repository_storage" = 'nfs-file-hddXX' AND ca.key = 'hdd_migration' AND ca.value = 'scheduled'; replacing XX with (02..08), one ore more result should be greater than zero.

Rollback

Rollback steps - steps to be taken in the event of a need to rollback this change

Estimated Time to Complete (mins) - Estimated Time to Complete in Minutes

Dectivate the HDD migration [production] schedule from https://ops.gitlab.net/gitlab-com/gl-infra/gitaly-shard-allocator/-/pipeline_schedules
Cancel any running pipelines in https://ops.gitlab.net/gitlab-com/gl-infra/gitaly-shard-allocator/-/pipelines
Run ssh console-01-sv-gprd.c.gitlab-production.internal on your workstation
Launch a Rails conole by running sudo gitlab-rails c

In the Rails console, run:

File.open('/tmp/migrated-to-hdd-projects', 'w') do |f|
    (1..8).each do |i|
        Project.joins(:custom_attributes).where(repository_storage: "nfs-file-hdd0#{i}", project_custom_attributes: {key: 'hdd_migration', value: 'scheduled'}).pluck(:id).each do |id|
            f.write("#{id}\n")
        end
    end
end

Close the Rails console and enter a tmux session by running tmux
Export your admin token by running export GITLAB_GPRD_ADMIN_API_PRIVATE_TOKEN=token

Run:

for id in `cat /tmp/migrated-to-hdd-projects`; do
  curl --verbose --silent --compressed --request POST "https://gitlab.com/api/v4/projects/${id}/repository_storage_moves" --header "content-type: application/json" --header "private-token: ${GITLAB_GPRD_ADMIN_API_PRIVATE_TOKEN}" --data '{}' 
done

Monitoring

Key metrics to observe

Log: Rails logs
- Location: URL
- What changes to this metric should prompt a rollback: High rate of 50x or 40x, it shouldn't trigger a full rollback, but deactivating the scheduling is preferable till investigation is performed
Log: Gitaly logs
- Location: URL
- What changes to this metric should prompt a rollback: High rate of non-OK codes, it shouldn't trigger a full rollback, but deactivating the scheduling is preferable till investigation is performed
Metric: Filesystem available ration
- Location: URL
- What changes to this metric should prompt a rollback: N/A
- Graphs should trend down over time
Metric: Gitaly service error ratio
- Location: URL
- What changes to this metric should prompt a rollback: Graph trending upwards or constant spikes
Metric: Gitaly service apdex
- Location: URL
- What changes to this metric should prompt a rollback: Graph trending downwards or constant spikes
Dashboard: sidekiq: Shard Detail
- Location: URL
- What changes to this metric should prompt a rollback: N/A

Summary of infrastructure changes

Does this change introduce new compute instances?
Does this change re-size any existing compute instances?
Does this change introduce any additional usage of tooling like Elastic Search, CDNs, Cloudflare, etc?

Summary of the above

Changes checklist

This issue has a criticality label (e.g. C1, C2, C3, C4) and a change-type label (e.g. changeunscheduled, changescheduled) based on the Change Management Criticalities.
This issue has the change technician as the assignee.
Pre-Change, Change, Post-Change, and Rollback steps and have been filled out and reviewed.
Necessary approvals have been completed based on the Change Management Workflow.
Change has been tested in staging and results noted in a comment on this issue.
A dry-run has been conducted and results noted in a comment on this issue.
SRE on-call has been informed prior to change being rolled out. (In #production channel, mention @sre-oncall and this issue and await their acknowledgement.)
There are currently no active incidents.

Edited Feb 02, 2021 by Ahmad Sherif