Execute Gitaly OS upgrade on gprd-cny (failed)

Production Change

Perform OS upgrade on Gitaly nodes by rebuilding the VMs on gprd

Nodes:

Estimated Time to Complete (mins) - 45

Coordinate with delivery an execution time that doesn't coincide with a deployment
Determine which shards will be included in the batch and add a list of the target shards in a comment below
Clone and setup https://gitlab.com/gitlab-com/gl-infra/ansible-workloads/gitaly-os-upgrade
Start the script for this batch: bin/rebuild gprd 0. When the packer image is built the runbook will present a prompt like the following: Target base image: 'packer-gitaly-gprd-XXXX'. Create an MR to update the base images of packertest and the target modules to it
Prepare a MR for https://ops.gitlab.net/gitlab-com/gl-infra/config-mgmt where you set the os_boot_image of the appropriate file modules, plus the file-packertest module, to the image name from the previous step: https://ops.gitlab.net/gitlab-com/gl-infra/config-mgmt/-/merge_requests/3519
Merge the image update MR
On the playbook, confirm that the MR is merged, and continue execution. The runbook will verify the image on packertest
When prompted Proceed to rebuild (y/n)?, wait for maintenance window before continuing
Set label changein-progress on this issue

Estimated Time to Complete (mins) - 5 per each 4 servers in the batch

Estimated Time to Complete (mins) - 5

Notify @release-managers of completion so deployments can resume
Confirm that the number of shards remaining for upgrade matches expectations
If this is the final batch, check that all shards are running the expected version

Estimated Time to Complete (mins) - 5 per server in the batch

Create a revert MR for the https://ops.gitlab.net/gitlab-com/gl-infra/config-mgmt change
Merge and apply the revert MR
Run the rebuild playbook: ansible-playbook run.yml -i inventory/gprd --limit batch0

Metric: Gitaly Per-Node Service Aggregated SLIs: Apdex, Error ratio, RPS
- Location: https://dashboards.gitlab.net/d/gitaly-host-detail/gitaly-host-detail?orgId=1&var-PROMETHEUS_DS=Global&var-environment=gprd&var-fqdn=file-cny-01-stor-gprd.c.gitlab-production.internal
- What changes to this metric should prompt a rollback: Degradation violating our SLOs after the instance rebuild has been completed.
Metic: Chef client errors
- Location: https://prometheus.gprd.gitlab.net/graph?g0.expr=chef_client_error%7Btype%3D%22gitaly%22%7D%20%3D%3D%201&g0.tab=0&g0.stacked=0&g0.show_exemplars=0&g0.range_input=1h
- What changes to this metric should prompt a rollback: Failures due to misconfiguration of the chef client
Metric: git Service Error Ratio
- Location: https://dashboards.gitlab.net/d/git-main/git-overview?orgId=1&viewPanel=1941771315&var-PROMETHEUS_DS=Global&var-environment=gprd&var-stage=main
- What changes to this metric should prompt a rollback: Degradation violating our SLOs, but pay attention to any degradation as the effect of a single server may not be dramatic on the overall service rate

Does this change introduce new compute instances? No
Does this change re-size any existing compute instances? No
Does this change introduce any additional usage of tooling like Elastic Search, CDNs, Cloudflare, etc? No

Summary of the above

The scheduled day and time of execution of the change is appropriate.
The change plan is technically accurate.
The change plan includes estimated timing values based on previous testing.
The change plan includes a viable rollback plan.
The specified metrics/monitoring dashboards provide sufficient visibility for the change.

The complexity of the plan is appropriate for the corresponding risk of the change. (i.e. the plan contains clear details).
The change plan includes success measures for all steps/milestones during the execution.
The change adequately minimizes risk within the environment/service.
The performance implications of executing the change are well-understood and documented.
The specified metrics/monitoring dashboards provide sufficient visibility for the change. - If not, is it possible (or necessary) to make changes to observability platforms for added visibility?
The change has a primary and secondary SRE with knowledge of the details available during the change window.

Edited Mar 03, 2022 by Rachel Nienaber