Upgrading Postgres GCP Resources
Reference gl-infra/infrastructure#7609
C1
Production Change - Criticality 1Change Objective | Describe the objective of the change |
---|---|
Change Type | Upgrading Postgres infrastructure to n1-highmem-96
|
Services Impacted | ~"Service:Postgres" |
Change Team Members | SRE - Ongres |
Change Severity | C1 |
Buddy check or tested in staging | @emanuel_ongres |
Schedule of the change | - |
Duration of the change |
Status
-
2019-09-04 15:00: Apply changes to all replicas (02,03,04,05,06,07,09,12,13,14) -
Apply change to the leader (11)
Pre-checks
- Get the replicas list by issuing
gitlab-patroni list
ordig +short @127.0.0.1 -p8600 replica.patroni.service.consul.
Plan
On Replicas
Pick a replica:
Please update the checklist if changes before the execution of the plan.
-
patroni-02-db-gprd.c.gitlab-production.internal -
patroni-03-db-gprd.c.gitlab-production.internal -
patroni-04-db-gprd.c.gitlab-production.internal -
patroni-05-db-gprd.c.gitlab-production.internal -
patroni-06-db-gprd.c.gitlab-production.internal -
patroni-07-db-gprd.c.gitlab-production.internal -
patroni-09-db-gprd.c.gitlab-production.internal -
patroni-12-db-gprd.c.gitlab-production.internal -
patroni-13-db-gprd.c.gitlab-production.internal -
patroni-14-db-gprd.c.gitlab-production.internal -
patroni-11-db-gprd.c.gitlab-production.internal leader
-
(OnGres) Drain replica from Database traffic (See runbook):
consul maint -enable -service=patroni-replica -reason="Resource Upgrade #"
consul maint -enable -service=db-replica -reason="Resource Upgrade #"
consul maint -enable -service=db-replica-1 -reason="Resource Upgrade #"
-
Check traffic was drained (wait until this returns): while [ $(sudo pgb-console -c 'SHOW CLIENTS;' | grep gitlabhq_production | cut -d '|' -f 2 | awk '{$1=$1};1' | grep -v gitlab-monitor | wc -l) -gt 0 ]; do echo "."; sleep 1; done
-
Stop PgBouncers and Patroni:
systemctl stop pgbouncer.service
systemctl stop pgbouncer-1.service
systemctl stop patroni.service
Changes of chef/terraform for each instance:
Update chef/terraform configuration changing the machine type to n1-highmem-96
for patroni instances.
-
Here we'll need the team that works on the changes in Terraform/Chef (if necessary) side // @Finotto
On Leader
In order to perform the Leader upgrade, it is necessary to switchover to any other replica, but we'll need to perform PAUSE in the PgBouncers if we don't want a hiccup.
-
Perform graceful switchover (local from chef repo): bin/graceful-failover gprd patroni-01-db-gprd.c.gitlab-production.internal
.- See runbook for additional information.
-
Apply the same procedure as the replica (it will be a replica indeed). -
patroni-11-db-gprd.c.gitlab-production.internal gitlab-patronictl switchover
.
Solving possible issues on the leader
-
patroni-11-db-gprd.c.gitlab-production.internal gitlab-patronictl switchover
. See (runbook): -
Have patroni restart the database: gitlab-patronictl restart pg-ha-cluster $(hostname -f) --force
-
Verify database has recovered: gitlab-psql -c 'select 1'
should not return an error -
Verify replication lag is < 100 MB: while [ $(gitlab-patronictl list | grep $(hostname -f) | cut -d '|' -f 7 | awk '{$1=$1};1') -gt 100 ]; do echo "."; sleep 1; done
Finalize
-
Follow-up the overall resource usage of the instance. -
Open an Issue for considering all the necessary tweaks to the configuration, accordingly within the new resource availability (there are valid points and other considerations at #7609 (closed).