Skip to content

GitLab Next

Why GitLab
Pricing
Contact Sales
Explore

Sign in
Get free trial

Enable Gitaly scheduled maintenance on all production storages

Production Change

Change Summary

Now that we have run daily maintenance in production on a single storage without issue (#2661 (closed)) it is now time to turn it on by default across the fleet. This will allow us to better understand how a maintenance strategy affects performance across a diverse set of storages.

Follow up from #2661 (closed)

Change Details

Services Impacted - Gitaly
Change Technician - @pokstad1
Change Criticality - C2
Change Type - changescheduled
Change Reviewer - @alejandro
Due Date - 2021-05-21 7:30 UTC
Time tracking - 40 minutes
Downtime Component - N/A, there should be no need for downtime with graceful restart of Gitaly

Detailed steps for the change

Pre-Change Steps - steps to be completed before execution of the change

Estimated Time to Complete (mins) - 5 minutes

Pick a low traffic time of day to run the maintenance task (e.g. Saturday at 10:30pm for 4 hours)

Change Steps - steps to take to execute the change

Estimated Time to Complete (mins) - 30 minutes

Enable maintenance on cny https://gitlab.com/gitlab-com/gl-infra/chef-repo/-/merge_requests/65
Run chef-client on the target nodes: knife ssh roles:gprd-base-stor-gitaly-common "sudo chef-client"
Enable maintenance on the whole fleet https://gitlab.com/gitlab-com/gl-infra/chef-repo/-/merge_requests/84
Run chef-client on the target nodes: knife ssh roles:gprd-base-stor-gitaly-common "sudo chef-client"

Post-Change Steps - steps to take to verify the change

Estimated Time to Complete (mins) - 5 minutes

Verify the non-canary production Gitaly servers prints a log message with the correct maintenance window (e.g. level=info msg="maintenance: daily scheduled" scheduled="2020-07-29 23:04:00 -0700 PDT")

Rollback

Rollback steps - steps to be taken in the event of a need to rollback this change

Estimated Time to Complete (mins) - 10 minutes

Revert MR with common production storage configuration change
Apply reverted cookbook on affected storages

Monitoring

Key metrics to observe

Metric: gitaly_daily_maintenance_repo_optimization_seconds
- Location: https://dashboards.gitlab.net/d/9pLfuovGz/gitaly-background-maintenance?orgId=1
- What changes to this metric should prompt a rollback: unknown
Metric: All performance related metrics
- Location: https://dashboards.gitlab.net/d/gitaly-host-detail/gitaly-host-detail?orgId=1
- What changes to this metric should prompt a rollback: any degradation of service during the specified maintenance window

Summary of infrastructure changes

Does this change introduce new compute instances?
Does this change re-size any existing compute instances?
Does this change introduce any additional usage of tooling like Elastic Search, CDNs, Cloudflare, etc?

Summary of the above

Changes checklist

This issue has a criticality label (e.g. C1, C2, C3, C4) and a change-type label (e.g. changeunscheduled, changescheduled) based on the Change Management Criticalities.
This issue has the change technician as the assignee.
Pre-Change, Change, Post-Change, and Rollback steps and have been filled out and reviewed.
Necessary approvals have been completed based on the Change Management Workflow.
Change has been tested in staging and results noted in a comment on this issue.
A dry-run has been conducted and results noted in a comment on this issue.
SRE on-call has been informed prior to change being rolled out. (In #production channel, mention @sre-oncall and this issue and await their acknowledgement.)
There are currently no active incidents.

Edited Jun 08, 2021 by Alejandro Rodríguez

Assignee Loading

Time tracking Loading