2021-10-21: [gprd] Remove `gcs-snapshot.sh` cron job from the `root` crontab on Patroni backup replica nodes (again)

Production Change

Change Summary

See issue https://gitlab.com/gitlab-com/gl-infra/infrastructure/-/issues/14447

Change Details

Services Impacted - ServicePatroni
Change Technician - @pguinoiseau
Change Reviewer - @cmiskell
Time tracking - 45 minutes
Downtime Component - none

Detailed steps for the change

Pre-Change Steps - steps to be completed before execution of the change

Estimated Time to Complete (mins) - 1 minute

Set label changein-progress on this issue
Get approval for MR https://gitlab.com/gitlab-com/gl-infra/chef-repo/-/merge_requests/798
Get approval for MR https://gitlab.com/gitlab-com/gl-infra/chef-repo/-/merge_requests/796

Change Steps - steps to take to execute the change

Estimated Time to Complete (mins) - 15 minutes per node

Merge MR https://gitlab.com/gitlab-com/gl-infra/chef-repo/-/merge_requests/798
Merge MR https://gitlab.com/gitlab-com/gl-infra/chef-repo/-/merge_requests/796
Apply changes to production in pipeline for the second MR

For both nodes one after the other:

patroni-v12-10-db-gprd.c.gitlab-production.internal
patroni-v12-registry-03-db-gprd.c.gitlab-production.internal
Establish a secure shell session to the trafficless replica node:
```
ssh $host
```

Trigger chef-client:

sudo pkill -USR1 chef-client
sudo journalctl -f -u chef-client.service

Verify that gcloud is installed in /usr/bin/gcloud:
```
type gcloud
```
Edit the root crontab:
```
sudo crontab -e
```
Delete the entry for /usr/local/bin/gcs-snapshot.sh and save

Remove the temporary pipes:

sudo rm -f /tmp/snapshot-start-backup /tmp/snapshot-stop-backup

Fix permissions on the log file:

sudo chown gitlab-psql:gitlab-psql /var/log/gitlab/postgresql/gcs-snapshot.log

Find all running /usr/local/bin/gcs-snapshot.sh processes, review them, and post the output in this issue:

ps aux | grep -v 'grep' | grep -F '/usr/local/bin/gcs-snapshot.sh' > /tmp/to_kill.txt
cat /tmp/to_kill.txt

Terminate them all:

cat /tmp/to_kill.txt | awk '{ print $2 }' | xargs sudo kill
rm /tmp/to_kill.txt

Find all running orphan psql processes, review them, and post the output in this issue:

pgrep -afx '/usr/lib/postgresql/13/bin/psql -p 5432 -h localhost -U gitlab-superuser -d gitlabhq_production -f /tmp/snapshot-start-backup -f /tmp/snapshot-stop-backup' > /tmp/to_kill.txt
cat /tmp/to_kill.txt

Terminate them all:

cat /tmp/to_kill.txt | awk '{ print $1 }' | xargs sudo kill
rm /tmp/to_kill.txt

Post-Change Steps - steps to take to verify the change

Estimated Time to Complete (mins) - 5 minutes

Run gcs-snapshot.sh manually, verify that it's working as expected and post the output in this issue::
```
sudo -H -u gitlab-psql /usr/local/bin/gcs-snapshot.sh
```

Run chef-client and verify that the cron job entry has not been re-added to root crontab:

sudo pkill -USR1 chef-client
sudo journalctl -f -u chef-client.service
sudo crontab -l

Rollback

Rollback steps - steps to be taken in the event of a need to rollback this change

No reason to rollback. If anything has to be there, Chef will restore it eventually.

Monitoring

Key metrics to observe

Metric: patroni Service Apdex
- Location: https://dashboards.gitlab.net/d/patroni-main/patroni-overview?orgId=1
- What changes to this metric should prompt a rollback: Any sustained (more than 2-5 minutes) reduction in SLI below the 1 hour SLO.

Summary of infrastructure changes

Does this change introduce new compute instances?
Does this change re-size any existing compute instances?
Does this change introduce any additional usage of tooling like Elastic Search, CDNs, Cloudflare, etc?

Summary of the above

Changes checklist

Edited Oct 21, 2021 by Pierre Guinoiseau