2021-10-21: [gprd] Remove `gcs-snapshot.sh` cron job from the `root` crontab on Patroni backup replica nodes (again)

Production Change

Change Summary

See issue https://gitlab.com/gitlab-com/gl-infra/infrastructure/-/issues/14447

See also:

This CR installs gcloud via APT instead of Snap and deletes the /usr/local/bin/gcs-snapshot.sh cron job from the root crontab on the Patroni backup replica nodes patroni-v12-10-db-gprd.c.gitlab-production.internal and patroni-v12-registry-03-db-gprd.c.gitlab-production.internal, as it should only exist in the gitlab-psql crontab.

Change Details

  1. Services Impacted - ServicePatroni
  2. Change Technician - @pguinoiseau
  3. Change Reviewer - @cmiskell
  4. Time tracking - 45 minutes
  5. Downtime Component - none

Detailed steps for the change

Pre-Change Steps - steps to be completed before execution of the change

Estimated Time to Complete (mins) - 1 minute

Change Steps - steps to take to execute the change

Estimated Time to Complete (mins) - 15 minutes per node

For both nodes one after the other:

  • patroni-v12-10-db-gprd.c.gitlab-production.internal

  • patroni-v12-registry-03-db-gprd.c.gitlab-production.internal

  • Establish a secure shell session to the trafficless replica node:

    ssh $host
  • Trigger chef-client:

    sudo pkill -USR1 chef-client
    sudo journalctl -f -u chef-client.service
  • Verify that gcloud is installed in /usr/bin/gcloud:

    type gcloud
  • Edit the root crontab:

    sudo crontab -e
  • Delete the entry for /usr/local/bin/gcs-snapshot.sh and save

  • Remove the temporary pipes:

    sudo rm -f /tmp/snapshot-start-backup /tmp/snapshot-stop-backup
  • Fix permissions on the log file:

    sudo chown gitlab-psql:gitlab-psql /var/log/gitlab/postgresql/gcs-snapshot.log
  • Find all running /usr/local/bin/gcs-snapshot.sh processes, review them, and post the output in this issue:

    ps aux | grep -v 'grep' | grep -F '/usr/local/bin/gcs-snapshot.sh' > /tmp/to_kill.txt
    cat /tmp/to_kill.txt
  • Terminate them all:

    cat /tmp/to_kill.txt | awk '{ print $2 }' | xargs sudo kill
    rm /tmp/to_kill.txt
  • Find all running orphan psql processes, review them, and post the output in this issue:

    pgrep -afx '/usr/lib/postgresql/13/bin/psql -p 5432 -h localhost -U gitlab-superuser -d gitlabhq_production -f /tmp/snapshot-start-backup -f /tmp/snapshot-stop-backup' > /tmp/to_kill.txt
    cat /tmp/to_kill.txt
  • Terminate them all:

    cat /tmp/to_kill.txt | awk '{ print $1 }' | xargs sudo kill
    rm /tmp/to_kill.txt

Post-Change Steps - steps to take to verify the change

Estimated Time to Complete (mins) - 5 minutes

  • Run gcs-snapshot.sh manually, verify that it's working as expected and post the output in this issue::
    sudo -H -u gitlab-psql /usr/local/bin/gcs-snapshot.sh
  • Run chef-client and verify that the cron job entry has not been re-added to root crontab:
    sudo pkill -USR1 chef-client
    sudo journalctl -f -u chef-client.service
    sudo crontab -l

Rollback

Rollback steps - steps to be taken in the event of a need to rollback this change

No reason to rollback. If anything has to be there, Chef will restore it eventually.

Monitoring

Key metrics to observe

Summary of infrastructure changes

  • Does this change introduce new compute instances?
  • Does this change re-size any existing compute instances?
  • Does this change introduce any additional usage of tooling like Elastic Search, CDNs, Cloudflare, etc?

Summary of the above

Changes checklist

  • This issue has a criticality label (e.g. C1, C2, C3, C4) and a change-type label (e.g. changeunscheduled, changescheduled) based on the Change Management Criticalities.
  • This issue has the change technician as the assignee.
  • Pre-Change, Change, Post-Change, and Rollback steps and have been filled out and reviewed.
  • This Change Issue is linked to the appropriate Issue and/or Epic
  • Necessary approvals have been completed based on the Change Management Workflow.
  • Change has been tested in staging and results noted in a comment on this issue.
  • A dry-run has been conducted and results noted in a comment on this issue.
  • SRE on-call has been informed prior to change being rolled out. (In #production channel, mention @sre-oncall and this issue and await their acknowledgement.)
  • Release managers have been informed (If needed! Cases include DB change) prior to change being rolled out. (In #production channel, mention @release-managers and this issue and await their acknowledgment.)
  • There are currently no active incidents.
Edited by Pierre Guinoiseau