[gprd] Delete the `/usr/local/bin/gcs-snapshot.sh` crontab entry in the `root` crontab on `patroni-v12-10-db-gprd`
Production Change
Change Summary
[gprd] Delete the /usr/local/bin/gcs-snapshot.sh crontab entry in the root crontab on patroni-v12-10-db-gprd.
Fulfills: https://gitlab.com/gitlab-com/gl-infra/infrastructure/-/issues/14444
Change Details
- Services Impacted - ServicePatroni
- Change Technician - @nnelson
-
Change Reviewer -
tbd -
Time tracking -
15 minutes -
Downtime Component -
No downtime
Detailed steps for the change
Pre-Change Steps - steps to be completed before execution of the change
Estimated Time to Complete (mins) - 1 minute
-
Set label changein-progress on this issue
Change Steps - steps to take to execute the change
Estimated Time to Complete (mins) - 5 minute
-
Establish a secure shell session to the trafficless replica node: ssh patroni-v12-10-db-gprd.c.gitlab-production.internal -
Switch user to root, and edit the rootcrontab.sudo su - root crontab -e -
Delete the entry for /usr/local/bin/gcs-snapshot.sh. -
Ensure that the current session is using the root user and delete the FIFO files: sudo su - root rm -f /tmp/snapshot-start-backup rm -f /tmp/snapshot-stop-backup -
Ensure that the current session is using the root user, find all running /usr/local/bin/gcs-snapshot.shprocesses, review them, and record the output as a comment on this issue.sudo su - root ps -aux | grep -v 'grep' | grep '/usr/local/bin/gcs-snapshot.sh' > /tmp/to_kill.txt cat /tmp/to_kill.txt -
Terminate them all. cat /tmp/to_kill.txt | awk '{ print $2 }' | xargs kill -9
Post-Change Steps - steps to take to verify the change
Estimated Time to Complete (mins) - 2 minutes
-
Switch user to the gitlab-psqluser and invoke thegcs-snapshot.shscript to verify that things are working correctly once again.sudo su - gitlab-psql /usr/local/bin/gcs-snapshot.sh -
Confirm that there are no errors in the log output from the script.
Rollback
Rollback steps - steps to be taken in the event of a need to rollback this change
Estimated Time to Complete (mins) - 0 minutes
Apparently, this is a mis-configuration. I suspect this is leftover from some previous attempt which configured the crontab for the root user.
There should be no reason to rollback. If this configuration is canonical according to chef, then it will get rolled-back on its own by the chef-client convergence process, and a different change will have to be made in the cookbook recipe.
Monitoring
Key metrics to observe
- Metric:
patroni Service Apdex- Location: https://dashboards.gitlab.net/d/patroni-main/patroni-overview?orgId=1
- What changes to this metric should prompt a rollback: Any sustained (more than 2-5 minutes) reduction in SLI below the 1 hour SLO.
Summary of infrastructure changes
-
Does this change introduce new compute instances? No
-
Does this change re-size any existing compute instances? No
-
Does this change introduce any additional usage of tooling like Elastic Search, CDNs, Cloudflare, etc? No
Changes checklist
-
This issue has a criticality label (e.g. C1, C2, C3, C4) and a change-type label (e.g. changeunscheduled, changescheduled) based on the Change Management Criticalities. -
This issue has the change technician as the assignee. -
Pre-Change, Change, Post-Change, and Rollback steps and have been filled out and reviewed. -
This Change Issue is linked to the appropriate Issue and/or Epic -
Necessary approvals have been completed based on the Change Management Workflow. -
Change has been tested in staging and results noted in a comment on this issue. -
A dry-run has been conducted and results noted in a comment on this issue. -
SRE on-call has been informed prior to change being rolled out. (In #production channel, mention @sre-oncalland this issue and await their acknowledgement.) -
Release managers have been informed (If needed! Cases include DB change) prior to change being rolled out. (In #production channel, mention @release-managersand this issue and await their acknowledgment.) -
There are currently no active incidents.
Edited by Nels Nelson