[gstg] Update cgroups configuration

Production Change

Change Summary

In gitlab-org/omnibus-gitlab!6076 (merged), we updated the cgroups configuration to a newer format. See https://gitlab.com/gitlab-com/gl-infra/chef-repo/-/merge_requests/1878 for some details of how this new format works. This change will update the omnibus config to utilize the new format.

Change Details

Services Impacted - ServiceGitaly
Change Technician - @alejandro
Change Reviewer - @f_santos
Time tracking - 30
Downtime Component - 0

Detailed steps for the change

Pre-Change Steps - steps to be completed before execution of the change

Estimated Time to Complete (mins) - 1

Set label changein-progress on this issue
Ensure the feature flag is disabled
- /chatops run feature get gitaly_run_cmds_in_cgroup

Change Steps - steps to take to execute the change

Estimated Time to Complete (mins) - 20

Create a silence for Chef client failures
Disable chef-client across the Gitaly fleet
- knife ssh roles:gstg-base-stor-gitaly 'sudo chef-client-disable https://gitlab.com/gitlab-com/gl-infra/production/-/issues/7144'
Merge and apply https://gitlab.com/gitlab-com/gl-infra/chef-repo/-/merge_requests/1878
Ensure cgroups mountpoint is created:
- knife ssh roles:gstg-base-stor-gitaly 'sudo mkdir -m 0700 -p /sys/fs/cgroup/cpu/gitaly && sudo chown git:git /sys/fs/cgroup/cpu/gitaly && sudo mkdir -m 0700 -p /sys/fs/cgroup/memory/gitaly && sudo chown git:git /sys/fs/cgroup/memory/gitaly'
Take a backup of /opt/gitlab/sv/gitaly/run:
- knife ssh roles:gstg-base-stor-gitaly 'sudo cp /opt/gitlab/sv/gitaly/run /tmp/sv-gitaly-run'
Patch /opt/gitlab/sv/gitaly/run to prevent Omnibus from restarting Gitaly:
- knife ssh roles:gstg-base-stor-gitaly 'sudo curl -sSf -o /tmp/run.patch https://gitlab.com/-/snippets/2294523/raw/main/run.patch && sudo patch -p0 -N /opt/gitlab/sv/gitaly/run </tmp/run.patch'
Ensure a correct patching. MD5 hash should be 5bc9c70570c7744e6993798ede93cd96:
- knife ssh roles:gstg-base-stor-gitaly 'sudo md5sum /opt/gitlab/sv/gitaly/run'
On file-01-stor-gstg.c.gitlab-staging-1.internal, run sudo chef-client-enable && sudo chef-client
Ensure Gitaly wasn't hard-restarted. This command shouldn't return an output:
- sudo grep "received signal" /var/log/gitlab/gitaly/current | grep terminated
Re-enable chef-client across the fleet gradually:
- knife ssh -C 10 roles:gstg-base-stor-gitaly 'sudo chef-client-enable && sudo chef-client'

Post-Change Steps - steps to take to verify the change

Estimated Time to Complete (mins) - 10

All Gitaly shards should have the new cgroup config
- knife ssh -C 10 roles:gstg-base-stor-gitaly 'sudo grep mountpoint /var/opt/gitlab/gitaly/config.toml' | sort

Rollback

Rollback steps - steps to be taken in the event of a need to rollback this change

Estimated Time to Complete (mins) - 5

Revert https://gitlab.com/gitlab-com/gl-infra/chef-repo/-/merge_requests/1657
Run chef-client across the fleet gradually:
- knife ssh -C 10 roles:gstg-base-stor-gitaly 'sudo chef-client

Monitoring

Key metrics to observe

Metric: Gitaly Service Apdex and Error Ratio
- Location: https://dashboards.gitlab.net/d/gitaly-main/gitaly-overview?orgId=1&var-PROMETHEUS_DS=Global&var-environment=gstg
- What changes to this metric should prompt a rollback: degradation violating our SLOs
Metric: Gitaly Fleet Overview Cgroup Processes
- Location: https://dashboards.gitlab.net/d/000000214/gitaly-fleet-overview?orgId=1&refresh=5m

Summary of infrastructure changes

Does this change introduce new compute instances? No
Does this change re-size any existing compute instances? No
Does this change introduce any additional usage of tooling like Elastic Search, CDNs, Cloudflare, etc? No

Summary of the above

Changes checklist

This issue has a criticality label (e.g. C1, C2, C3, C4) and a change-type label (e.g. changeunscheduled, changescheduled) based on the Change Management Criticalities.
This issue has the change technician as the assignee.
Pre-Change, Change, Post-Change, and Rollback steps and have been filled out and reviewed.
This Change Issue is linked to the appropriate Issue and/or Epic
Necessary approvals have been completed based on the Change Management Workflow.
Change has been tested in staging and results noted in a comment on this issue.
A dry-run has been conducted and results noted in a comment on this issue.
SRE on-call has been informed prior to change being rolled out. (In #production channel, mention @sre-oncall and this issue and await their acknowledgement.)
Release managers have been informed (If needed! Cases include DB change) prior to change being rolled out. (In #production channel, mention @release-managers and this issue and await their acknowledgment.)
There are currently no active incidents.

Edited Jun 01, 2022 by Alejandro Rodríguez

Assignee Loading

Time tracking Loading