[gprd] Restart patroni-ci-01-db-gprd
Production
Change
Change Summary
[gprd]
Restart the patroni-ci-01-db-gprd
node.
Once the node has finished running the startup-script and initial chef-client convergence, it is expected that the patroni service will begin to replicate from the main cluster trafficless replica, patroni-v12-10-db-gprd
. Because of this interaction with a member of the production database cluster, this C2 change is required.
Fulfills issue: Create chef roles and add required terraform modules: https://gitlab.com/gitlab-com/gl-infra/infrastructure/-/issues/14564
Part of epic: [GPRD] Provision the patroni-ci cluster in the production environment: &620 (closed)
Change Details
- Services Impacted - ServicePatroni
- Change Technician - @nnelson
- Change Reviewer - @Finotto
-
Time tracking -
~15 minutes
-
Downtime Component -
No downtime
Detailed steps for the change
Pre-Change Steps - steps to be completed before execution of the change
Estimated Time to Complete - 1 minutes
-
Add a comment to this issue with the following content: /label changein-progress
Change Steps - steps to take to execute the change
Estimated Time to Complete - 4 minutes
-
Navigate to the new patroni-ci-01-db-gprd
node console in GCP: https://console.cloud.google.com/compute/instancesDetail/zones/us-east1-c/instances/patroni-ci-01-db-gprd?q=search&referrer=search&project=gitlab-production -
Stop the node: gcloud --project="gitlab-production" compute instances stop --zone="us-east1-c" "patroni-ci-01-db-gprd"
-
Execute: gcloud --project="gitlab-production" compute instances describe --zone="us-east1-c" "patroni-ci-01-db-gprd" --format=json | jq --raw-output .status
-
Verify that the output is no longer " RUNNING
".
-
-
Start the node: gcloud --project="gitlab-production" compute instances start --zone="us-east1-c" "patroni-ci-01-db-gprd"
-
Execute: gcloud --project="gitlab-production" compute instances describe --zone="us-east1-c" "patroni-ci-01-db-gprd" --format=json | jq --raw-output .status
-
Verify that the output is now " RUNNING
". If not, repeat the previous step until the status is "RUNNING
".
-
-
Add a comment to this issue with the following content: /unlabel changein-progress
Post-Change Steps - steps to take to verify the change
Estimated Time to Complete - 10 minutes
-
Monitor the node console output logs: https://console.cloud.google.com/compute/instancesDetail/zones/us-east1-c/instances/patroni-ci-01-db-gprd/console?port=1&project=gitlab-production -
Verify that the startup-script chef-client converges successfully.
-
-
Monitor the node's replication activity.
Rollback
Rollback steps - steps to be taken in the event of a need to rollback this change
It is unlikely that a roll-back would help any failure condition here, given that the current circumstance is a malfunction.
Estimated Time to Complete - 0 minutes
-
Nothing to do.
Monitoring
Key metrics to observe
- Metric:
patroni Service Apdex
- Location: https://dashboards.gitlab.net/d/patroni-main/patroni-overview?orgId=1&from=now-3h&to=now&refresh=1m&viewPanel=3543037459
- What changes to this metric should prompt a rollback: Any reduction in apdex SLI below
99.4%
for longer than two minutes.
- Metric:
pgbouncer active client connections
- Location: https://dashboards.gitlab.net/d/PwlB97Jmk/pgbouncer-overview?orgId=1&viewPanel=1
- What changes to this metric should prompt a rollback: Any sustained reduction.
- Metric:
patroni-v12-10-db-gprd
CPU usage
- Location: https://dashboards.gitlab.net/d/bd2Kl9Imk/host-stats?orgId=1&var-env=gprd&var-node=patroni-v12-10-db-gprd.c.gitlab-production.internal
- What changes to this metric should prompt a rollback: Any sustained elevation.
Summary of infrastructure changes
-
Does this change introduce new compute instances? No
-
Does this change re-size any existing compute instances? No
-
Does this change introduce any additional usage of tooling like Elastic Search, CDNs, Cloudflare, etc? No
Changes checklist
-
This issue has a criticality label (e.g. C1, C2, C3, C4) and a change-type label (e.g. changeunscheduled, changescheduled) based on the Change Management Criticalities. -
This issue has the change technician as the assignee. -
Pre-Change, Change, Post-Change, and Rollback steps and have been filled out and reviewed. -
This Change Issue is linked to the appropriate Issue and/or Epic -
Necessary approvals have been completed based on the Change Management Workflow. -
Change has been tested in staging and results noted in a comment on this issue. -
A dry-run has been conducted and results noted in a comment on this issue. -
SRE on-call has been informed prior to change being rolled out. (In #production channel, mention @sre-oncall
and this issue and await their acknowledgement.) -
Release managers have been informed (If needed! Cases include DB change) prior to change being rolled out. (In #production channel, mention @release-managers
and this issue and await their acknowledgment.) -
There are currently no active incidents.