[gprd] Remove `gcs-snapshot.sh` cron job from the `gitlab-psql` crontab on Patroni backup replica nodes

Production Change

Change Summary

See issue https://gitlab.com/gitlab-com/gl-infra/infrastructure/-/issues/14447 and the CR #5761 (closed) which did the reverse.

Delete the /usr/local/bin/gcs-snapshot.sh cron job from the gitlab-psql crontab on the Patroni backup replica nodes patroni-v12-10-db-gprd.c.gitlab-production.internal and patroni-v12-registry-03-db-gprd.c.gitlab-production.internal, as it should only exist in the root crontab.

Change Details

  1. Services Impacted - ServicePatroni
  2. Change Technician - @pguinoiseau
  3. Change Reviewer - TBD
  4. Time tracking - 45 minutes
  5. Downtime Component - none

Detailed steps for the change

Pre-Change Steps - steps to be completed before execution of the change

Estimated Time to Complete (mins) - 1 minute

Change Steps - steps to take to execute the change

Estimated Time to Complete (mins) - 15 minutes per node

For both nodes one after the other:

  • patroni-v12-10-db-gprd.c.gitlab-production.internal

  • patroni-v12-registry-03-db-gprd.c.gitlab-production.internal

  • Establish a secure shell session to the trafficless replica node:

    ssh $host
  • Edit the gitlab-psql crontab:

    sudo crontab -e -u gitlab-psql
  • Delete the entry for /usr/local/bin/gcs-snapshot.sh and save

  • Find all running /usr/local/bin/gcs-snapshot.sh processes, review them, and record the output as a comment on this issue:

    ps aux | grep -v 'grep' | grep -F '/usr/local/bin/gcs-snapshot.sh' > /tmp/to_kill.txt
    cat /tmp/to_kill.txt
  • Terminate them all:

    cat /tmp/to_kill.txt | awk '{ print $2 }' | xargs kill
    rm /tmp/to_kill.txt
  • Find all running orphan psql processes, review them, and record the output as a comment on this issue:

    ps aux | grep -v 'grep' | grep -F '/usr/lib/postgresql/13/bin/psql' > /tmp/to_kill.txt
    cat /tmp/to_kill.txt
  • Terminate them all:

    cat /tmp/to_kill.txt | awk '{ print $2 }' | xargs kill
    rm /tmp/to_kill.txt

Post-Change Steps - steps to take to verify the change

Estimated Time to Complete (mins) - 5 minutes

  • Run chef-client and verify that the cron job entry has not been re-added to gitlab-psql's crontab:
    sudo chef-client
    sudo crontab -l -u gitlab-psql

Rollback

Rollback steps - steps to be taken in the event of a need to rollback this change

No reason to rollback, this a configuration remnant from the upgrade to 16.04. If it has to be there, Chef will restore it eventually.

Monitoring

Key metrics to observe

Summary of infrastructure changes

  • Does this change introduce new compute instances?
  • Does this change re-size any existing compute instances?
  • Does this change introduce any additional usage of tooling like Elastic Search, CDNs, Cloudflare, etc?

Summary of the above

Changes checklist

  • This issue has a criticality label (e.g. C1, C2, C3, C4) and a change-type label (e.g. changeunscheduled, changescheduled) based on the Change Management Criticalities.
  • This issue has the change technician as the assignee.
  • Pre-Change, Change, Post-Change, and Rollback steps and have been filled out and reviewed.
  • This Change Issue is linked to the appropriate Issue and/or Epic
  • Necessary approvals have been completed based on the Change Management Workflow.
  • Change has been tested in staging and results noted in a comment on this issue.
  • A dry-run has been conducted and results noted in a comment on this issue.
  • SRE on-call has been informed prior to change being rolled out. (In #production channel, mention @sre-oncall and this issue and await their acknowledgement.)
  • Release managers have been informed (If needed! Cases include DB change) prior to change being rolled out. (In #production channel, mention @release-managers and this issue and await their acknowledgment.)
  • There are currently no active incidents.
Edited by Pierre Guinoiseau