Skip to content

Enable Gitaly scheduled maintenance on all production storages

Production Change

Change Summary

Now that we have run daily maintenance in production on a single storage without issue (#2661 (closed)) it is now time to turn it on by default across the fleet. This will allow us to better understand how a maintenance strategy affects performance across a diverse set of storages.

Follow up from #2661 (closed)

Change Details

  1. Services Impacted - Gitaly
  2. Change Technician - @pokstad1
  3. Change Criticality - C2
  4. Change Type - changescheduled
  5. Change Reviewer - @alejandro
  6. Due Date - 2021-05-21 7:30 UTC
  7. Time tracking - 40 minutes
  8. Downtime Component - N/A, there should be no need for downtime with graceful restart of Gitaly

Detailed steps for the change

Pre-Change Steps - steps to be completed before execution of the change

Estimated Time to Complete (mins) - 5 minutes

  • Pick a low traffic time of day to run the maintenance task (e.g. Saturday at 10:30pm for 4 hours)

Change Steps - steps to take to execute the change

Estimated Time to Complete (mins) - 30 minutes

Post-Change Steps - steps to take to verify the change

Estimated Time to Complete (mins) - 5 minutes

  • Verify the non-canary production Gitaly servers prints a log message with the correct maintenance window (e.g. level=info msg="maintenance: daily scheduled" scheduled="2020-07-29 23:04:00 -0700 PDT")

Rollback

Rollback steps - steps to be taken in the event of a need to rollback this change

Estimated Time to Complete (mins) - 10 minutes

  • Revert MR with common production storage configuration change
  • Apply reverted cookbook on affected storages

Monitoring

Key metrics to observe

Summary of infrastructure changes

  • Does this change introduce new compute instances?
  • Does this change re-size any existing compute instances?
  • Does this change introduce any additional usage of tooling like Elastic Search, CDNs, Cloudflare, etc?

Summary of the above

Changes checklist

  • This issue has a criticality label (e.g. C1, C2, C3, C4) and a change-type label (e.g. changeunscheduled, changescheduled) based on the Change Management Criticalities.
  • This issue has the change technician as the assignee.
  • Pre-Change, Change, Post-Change, and Rollback steps and have been filled out and reviewed.
  • Necessary approvals have been completed based on the Change Management Workflow.
  • Change has been tested in staging and results noted in a comment on this issue.
  • A dry-run has been conducted and results noted in a comment on this issue.
  • SRE on-call has been informed prior to change being rolled out. (In #production channel, mention @sre-oncall and this issue and await their acknowledgement.)
  • There are currently no active incidents.
Edited by Alejandro Rodríguez