Enable Gitaly scheduled maintenance on all production storages
Production Change
Change Summary
Now that we have run daily maintenance in production on a single storage without issue (#2661 (closed)) it is now time to turn it on by default across the fleet. This will allow us to better understand how a maintenance strategy affects performance across a diverse set of storages.
Follow up from #2661 (closed)
Change Details
- Services Impacted - Gitaly
- Change Technician - @pokstad1
- Change Criticality - C2
- Change Type - changescheduled
- Change Reviewer - @alejandro
- Due Date - 2021-05-21 7:30 UTC
- Time tracking - 40 minutes
- Downtime Component - N/A, there should be no need for downtime with graceful restart of Gitaly
Detailed steps for the change
Pre-Change Steps - steps to be completed before execution of the change
Estimated Time to Complete (mins) - 5 minutes
-
Pick a low traffic time of day to run the maintenance task (e.g. Saturday at 10:30pm for 4 hours)
Change Steps - steps to take to execute the change
Estimated Time to Complete (mins) - 30 minutes
-
Enable maintenance on cny https://gitlab.com/gitlab-com/gl-infra/chef-repo/-/merge_requests/65 -
Run chef-client on the target nodes: knife ssh roles:gprd-base-stor-gitaly-common "sudo chef-client"
-
Enable maintenance on the whole fleet https://gitlab.com/gitlab-com/gl-infra/chef-repo/-/merge_requests/84 -
Run chef-client on the target nodes: knife ssh roles:gprd-base-stor-gitaly-common "sudo chef-client"
Post-Change Steps - steps to take to verify the change
Estimated Time to Complete (mins) - 5 minutes
-
Verify the non-canary production Gitaly servers prints a log message with the correct maintenance window (e.g. level=info msg="maintenance: daily scheduled" scheduled="2020-07-29 23:04:00 -0700 PDT"
)
Rollback
Rollback steps - steps to be taken in the event of a need to rollback this change
Estimated Time to Complete (mins) - 10 minutes
-
Revert MR with common production storage configuration change -
Apply reverted cookbook on affected storages
Monitoring
Key metrics to observe
-
Metric:
gitaly_daily_maintenance_repo_optimization_seconds
- Location: https://dashboards.gitlab.net/d/9pLfuovGz/gitaly-background-maintenance?orgId=1
- What changes to this metric should prompt a rollback: unknown
-
Metric: All performance related metrics
- Location: https://dashboards.gitlab.net/d/gitaly-host-detail/gitaly-host-detail?orgId=1
- What changes to this metric should prompt a rollback: any degradation of service during the specified maintenance window
Summary of infrastructure changes
-
Does this change introduce new compute instances? -
Does this change re-size any existing compute instances? -
Does this change introduce any additional usage of tooling like Elastic Search, CDNs, Cloudflare, etc?
Summary of the above
Changes checklist
-
This issue has a criticality label (e.g. C1, C2, C3, C4) and a change-type label (e.g. changeunscheduled, changescheduled) based on the Change Management Criticalities. -
This issue has the change technician as the assignee. -
Pre-Change, Change, Post-Change, and Rollback steps and have been filled out and reviewed. -
Necessary approvals have been completed based on the Change Management Workflow. -
Change has been tested in staging and results noted in a comment on this issue. -
A dry-run has been conducted and results noted in a comment on this issue. -
SRE on-call has been informed prior to change being rolled out. (In #production channel, mention @sre-oncall
and this issue and await their acknowledgement.) -
There are currently no active incidents.
Edited by Alejandro Rodríguez