Migrate hdd_migration-tagged project to HDD shards
Production Change
Change Summary
Setting up and enabling a CI schedule (in ops) that migrates tagged projects (#3393 (closed)) to Gitaly HDD shards. This is a slow-running migration that's expected to run for at least 3 weeks non-stop.
Related to &369 (closed).
Change Details
- Services Impacted - ServiceAPI ServiceSidekiq ServiceGitaly
- Change Technician - @ahmadsherif
- Change Criticality - C2
- Change Type - changescheduled
- Change Reviewer - @alejandro
- Due Date - 2020-02-02 14:00 UTC
- Time tracking - 21 minutes for manual actions, at least 3 weeks for the actual migration
- Downtime Component - N/A
Detailed steps for the change
Pre-Change Steps - steps to be completed before execution of the change
Estimated Time to Complete (mins) - 10 minutes
-
Merge gitaly-shard-allocator!1 (merged) -
Make sure https://ops.gitlab.net/gitlab-com/gl-infra/gitaly-shard-allocator has been mirrored successfully -
With an admin account, create a private token with apiscope -
Create a CI schedule in https://ops.gitlab.net/gitlab-com/gl-infra/gitaly-shard-allocator/-/pipeline_schedules/new with the following fields: - Description:
HDD migration [production] - Interval Pattern:
*/5 * * * * - Cron Timezone:
UTC - Target Branch:
master - Variables:
-
CUSTOM_ATTR_KEY:hdd_migration -
CUSTOM_ATTR_VALUE:pending -
PROJECTS_PER_PAGE:100 -
TOTAL_PROJECTS:1500 -
DEST_SHARD_CUTOFF:0.20 -
ENVIRONMENT:gprd -
GITLAB_HOST:gitlab.com -
GITLAB_ADMIN_TOKEN: Fill in value obtained from previous step -
HDD_MIGRATION:1
-
- Active: Unchecked
- Description:
Change Steps - steps to take to execute the change
Estimated Time to Complete (mins) - 1 minute
-
Activate the HDD migration [production]schedule from https://ops.gitlab.net/gitlab-com/gl-infra/gitaly-shard-allocator/-/pipeline_schedules
Post-Change Steps - steps to take to verify the change
Estimated Time to Complete (mins) - 10 minutes
-
Make sure all the jobs from the preparestage are passing in https://ops.gitlab.net/gitlab-com/gl-infra/gitaly-shard-allocator/-/pipelines -
Make sure no errors are logged in the jobs from the migratestage in https://ops.gitlab.net/gitlab-com/gl-infra/gitaly-shard-allocator/-/pipelines -
Login into the production PG archive replica ssh postgres-dr-archive-01-db-gprd.c.gitlab-production.internal -
Open a psql console sudo gitlab-psql -
Execute SELECT COUNT("projects".*) FROM "projects" INNER JOIN "project_custom_attributes" ca ON ca.project_id = projects.id WHERE "projects"."repository_storage" = 'nfs-file-hddXX' AND ca.key = 'hdd_migration' AND ca.value = 'scheduled';replacing XX with (02..08), one ore more result should be greater than zero.
Rollback
Rollback steps - steps to be taken in the event of a need to rollback this change
Estimated Time to Complete (mins) - Estimated Time to Complete in Minutes
-
Dectivate the HDD migration [production]schedule from https://ops.gitlab.net/gitlab-com/gl-infra/gitaly-shard-allocator/-/pipeline_schedules -
Cancel any running pipelines in https://ops.gitlab.net/gitlab-com/gl-infra/gitaly-shard-allocator/-/pipelines -
Run ssh console-01-sv-gprd.c.gitlab-production.internalon your workstation -
Launch a Rails conole by running sudo gitlab-rails c -
In the Rails console, run: File.open('/tmp/migrated-to-hdd-projects', 'w') do |f| (1..8).each do |i| Project.joins(:custom_attributes).where(repository_storage: "nfs-file-hdd0#{i}", project_custom_attributes: {key: 'hdd_migration', value: 'scheduled'}).pluck(:id).each do |id| f.write("#{id}\n") end end end -
Close the Rails console and enter a tmux session by running tmux -
Export your admin token by running export GITLAB_GPRD_ADMIN_API_PRIVATE_TOKEN=token -
Run: for id in `cat /tmp/migrated-to-hdd-projects`; do curl --verbose --silent --compressed --request POST "https://gitlab.com/api/v4/projects/${id}/repository_storage_moves" --header "content-type: application/json" --header "private-token: ${GITLAB_GPRD_ADMIN_API_PRIVATE_TOKEN}" --data '{}' done
Monitoring
Key metrics to observe
- Log: Rails logs
- Location: URL
- What changes to this metric should prompt a rollback: High rate of 50x or 40x, it shouldn't trigger a full rollback, but deactivating the scheduling is preferable till investigation is performed
- Log: Gitaly logs
- Location: URL
- What changes to this metric should prompt a rollback: High rate of non-
OKcodes, it shouldn't trigger a full rollback, but deactivating the scheduling is preferable till investigation is performed
- Metric:
Filesystem available ration- Location: URL
- What changes to this metric should prompt a rollback: N/A
- Graphs should trend down over time
- Metric:
Gitaly service error ratio- Location: URL
- What changes to this metric should prompt a rollback: Graph trending upwards or constant spikes
- Metric:
Gitaly service apdex- Location: URL
- What changes to this metric should prompt a rollback: Graph trending downwards or constant spikes
- Dashboard:
sidekiq: Shard Detail- Location: URL
- What changes to this metric should prompt a rollback: N/A
Summary of infrastructure changes
-
Does this change introduce new compute instances? -
Does this change re-size any existing compute instances? -
Does this change introduce any additional usage of tooling like Elastic Search, CDNs, Cloudflare, etc?
Summary of the above
Changes checklist
-
This issue has a criticality label (e.g. C1, C2, C3, C4) and a change-type label (e.g. changeunscheduled, changescheduled) based on the Change Management Criticalities. -
This issue has the change technician as the assignee. -
Pre-Change, Change, Post-Change, and Rollback steps and have been filled out and reviewed. -
Necessary approvals have been completed based on the Change Management Workflow. -
Change has been tested in staging and results noted in a comment on this issue. -
A dry-run has been conducted and results noted in a comment on this issue. -
SRE on-call has been informed prior to change being rolled out. (In #production channel, mention @sre-oncalland this issue and await their acknowledgement.) -
There are currently no active incidents.
Edited by Ahmad Sherif