[GSTG] Re-create the database CI cluster

`Staging` Change

Change Summary

This change is needed to recreate the Pgbouncer CI and Patroni CI clusters using Ubuntu 16.04.

Related issue: Infrastructure#14917

Change Details

Services Impacted - ServicePatroni Database
Change Technician - @nhoppe1
Change Reviewers - @Finotto, @mchacon3
Time tracking - 145 minutes
Downtime Component - N/A

Detailed steps for the change

Pre-Change Steps - steps to be completed before execution of the change

Estimated Time to Complete (mins) - 5 minutes

Set label changein-progress on this issue

Change Steps - steps to take to execute the change

Estimated Time to Complete (mins) - 80 minutes

Merge the MR config-mgmt!3277
Merge the MR config-mgmt!3292
Merge the MR config-mgmt!3317
Merge the MR chef-repo!1170
Execute a git pull of the config-mgmt repository to update the local copy
Remove the existing clusters by manually executing in the (updated) config-mgmt repository: cd environments/gstg/ && tf destroy -target="module.patroni-ci" -target="module.pgbouncer-ci" -target="module.postgres-ci-dr-archive" -target="module.postgres-ci-dr-delayed" -target="module.patroni-zfs-ci"
Ensure the hosts from the Pgbouncer CI cluster were deregistered from Chef: knife node list | grep gstg | grep pgbouncer-ci
Ensure the hosts from the Patroni CI cluster were deregistered from Chef: knife node list | grep gstg | grep patroni-ci
Ensure the hosts from the Patroni ZFS CI cluster were deregistered from Chef: knife node list | grep gstg | grep patroni-zfs-ci
Ensure the hosts from the Postgres CI DR Archive cluster were deregistered from Chef: knife node list | grep gstg | grep postgres-ci-dr-archive
Ensure the hosts from the Postgres CI DR Delayed cluster were deregistered from Chef: knife node list | grep gstg | grep postgres-ci-dr-delayed
Re-create the clusters by manually executing in the (updated) config-mgmt repository: cd environments/gstg/ && tf apply -target="module.patroni-ci" -target="module.pgbouncer-ci" -target="module.postgres-ci-dr-archive" -target="module.postgres-ci-dr-delayed" -target="module.patroni-zfs-ci"

Post-Change Steps - steps to take to verify the change

Estimated Time to Complete (mins) - 60 minutes

Confirm Consul service exists: consul catalog services | grep patroni-ci
Confirm the new hosts are part of the Consul service: consul catalog nodes | grep patroni-ci
Connect to any of the replaced Patroni CI hosts and verify they are part of the cluster with the command gitlab-patronictl list
Connect to Postgresql and list the database sizes: gitlab-psql -c "\l+"
Confirm the Patroni ZFS CI cluster is replicating data: gitlab-psql -c "\l+"
~~Confirm the Postgres CI DR Archive cluster is working and there's low replication lag: gitlab-psql -c "\l+"~~
~~Confirm the Postgres CI DR Delayed cluster is working and there's low replication lag: gitlab-psql -c "\l+"~~
Confirm the node console-01-sv-gstg.c.gitlab-staging-1.internal can connect to the new Patroni CI hosts by following the console access and granting rails or db access runbooks
Confirm the node console-ro-01-sv-gstg.c.gitlab-staging-1.internal can connect to the new Patroni CI hosts by following the console access and granting rails or db access runbooks
Verify the backup is working and present in the Google Storage (GS)
Verify the disk snapshots for the Patroni CI cluster are being done.
Review connectivity to the Patroni CI cluster through the Pgbouncer CI and the Pgbouncer CI Sidekiq clusters
Verify monitoring

Rollback

Rollback steps - steps to be taken in the event of a need to rollback this change

Estimated Time to Complete (mins) - N/A

No rollback is available/possible for this change.

Monitoring

Key metrics to observe

Metric: Patroni-CI Overview
- Location: https://dashboards.gitlab.net/d/patroni-ci-main/patroni-ci-overview?orgId=1&var-PROMETHEUS_DS=Global&var-environment=gstg
- What changes to this metric should prompt a rollback: None. The patroni-ci Service RPS metric should be below 200 due to the cluster not being used yet.

Summary of infrastructure changes

Does this change introduce new compute instances?
Does this change re-size any existing compute instances?
Does this change introduce any additional usage of tooling like Elastic Search, CDNs, Cloudflare, etc?

Change Reviewer checklist

The scheduled day and time of execution of the change is appropriate.
The change plan is technically accurate.
The change plan includes estimated timing values based on previous testing.
The change plan includes a viable rollback plan.
The specified metrics/monitoring dashboards provide sufficient visibility for the change.

The complexity of the plan is appropriate for the corresponding risk of the change. (i.e. the plan contains clear details).
The change plan includes success measures for all steps/milestones during the execution.
The change adequately minimizes risk within the environment/service.
The performance implications of executing the change are well-understood and documented.
The specified metrics/monitoring dashboards provide sufficient visibility for the change. - If not, is it possible (or necessary) to make changes to observability platforms for added visibility?
The change has a primary and secondary SRE with knowledge of the details available during the change window.

Change Technician checklist

This issue has a criticality label (e.g. C1, C2, C3, C4) and a change-type label (e.g. changeunscheduled, changescheduled) based on the Change Management Criticalities.
This issue has the change technician as the assignee.
Pre-Change, Change, Post-Change, and Rollback steps and have been filled out and reviewed.
This Change Issue is linked to the appropriate Issue and/or Epic
Necessary approvals have been completed based on the Change Management Workflow.
Change has been tested in staging and results noted in a comment on this issue.
A dry-run has been conducted and results noted in a comment on this issue.
SRE on-call has been informed prior to change being rolled out. (In #production channel, mention @sre-oncall and this issue and await their acknowledgement.)
Release managers have been informed (If needed! Cases include DB change) prior to change being rolled out. (In #production channel, mention @release-managers and this issue and await their acknowledgment.)
There are currently no active incidents.

Edited Jan 21, 2022 by Nels Nelson

Assignee Loading

Time tracking Loading