[GPRD] [2023-05-27 03:00 UTC] CR - Hardware upgrade of Patroni Primary nodes on CI and Main databases (Switchover)

Production Change

Proposed time: 2023-05-27 (Saturday) 03:00 AM UTC == 2023-05-26 (Friday) 08:00 PM PDT

Change Summary

As part of the rollout plan (see: https://gitlab.com/gitlab-com/gl-infra/reliability/-/issues/18934#steps-to-perform-in-gprd) we already replaced all Replicas from the old n1-highmem-96 VMs by the new generation n2-highmem-128 VMs in the patroni-main cluster, and n2-highmem-96 VMs in the patroni-ci cluster

The last node still running in the old N1 hardware is the current Primary/Writer patroni nodes, and we need to perform a Switchover operation, which consist into a quick change of roles from the current N1 Primary VM into a new N2 VM. This operation will block SQL statements, therefore will cause 50x errors in Gitlab.com while the N2 nodes are promoted and the master endpoints are reconfigured. We expect this operation to take between 30 seconds up to 5 minutes.

This operation is a very low risk of failure and there's no risk of data loss, as the data replication between the nodes will be synchronised before the new node promotion to Leader, and we can switchback the Leader role into the previous VM at any point.

We already have performed this process in GSTG during CR: #8757 (closed)

In this same CR will increase the PostgreSQL max_connections=670 in patroni-main because this parameter setting is made in DCS cluster level and require the instances to restart to be effective on each node, so we are taking the oportunity to avoid a further maintenance. We need to increase max_connections because we plan to reduce the number of nodes for better cost efficiency (see &851 (comment 1286123341)), thereore more workload will be routed on each node. The value of 670 is aproximately 1/3 of the current value of 500, and reflects the CPU count of patroni-main by 1/3 (from 96 to 128 vCPUs per node).

Why we should implement this change as fast as possible?

Since February 2023, Gitlab.com datastore layer is suffering from pg_primary_cpu saturation spikes in our patroni-main database, see https://gitlab.com/gitlab-com/gl-infra/capacity-planning/-/issues/892. As @tkuah mentioned "Even though the tamland now show shows no forecasted violation, today we had a series of peaks close to 80% CPU. Opened gitlab-org/gitlab#407823 (closed) for this" (https://gitlab.com/gitlab-com/gl-infra/capacity-planning/-/issues/892#note_1357859469)". With the hardware upgrade we'll be increasing the amount and speed of CPU resources of our patroni-main Primary node, which should reduce considerably the risk of CPU saturation. Therefore, a long wait to implement this change goes agains customers interest.

CSMs/TAMs message to customers

This weekend’s database hardware upgrade had to be rescheduled to next Saturday, 2023-05-27, from 03:00 to 04:00 UTC. Unfortunately, there was a long database migration running to fix an incident, that we couldn’t risk interrupting. We apologise for any inconvenience. Next weekend, users may experience temporary 50X errors for a brief period during the database maintenance window. As previously informed, the hardware upgrade is part of a Database Scalability Strategy we are implementing to improve overall database availability and performance of Gitlab.com.

More details can be found at #10694 (closed)

FAQ

Does this maintenance affect GitLab Dedicated customers?

No. It will not impact GitLab Dedicated single-tenancy environments. This maintenance targets only GitLab.com shared infrastructure.

Change Details

Services Impacted - ServicePatroni ServicePatroniCI
Change Technician - @rhenchen.gitlab
Change Reviewer - @alexander-sosna or @bshah11
Time tracking - 1 hour and 30 minutes
Downtime Component - yes

Detailed steps for the change

Prep Tasks

T minus 2 weeks (2023-05-05 02:00 UTC)

CMOC : Ensure that the maintenance window is scheduled on status.io.
CMOC : Post update from Status.io maintenance site, publish on @gitlabstatus. Workflow: https://about.gitlab.com/handbook/support/workflows/cmoc_workflows.html#sending-updates-about-maintenance-events
- Message: In 2 weeks, as part of a planned maintenance window on 2023-05-20 from 03:00 to 04:00 UTC, We will perform a hardware upgrade for the GitLab.com datastore. Users may experience temporary 50X errors for a brief period during this window. See https://gitlab.com/gitlab-com/gl-infra/production/-/issues/10694
Ask our CSMs in our #customer-success Slack channel about their preferences on how to communicate this change to our main customers:
1. Ping CSM managers using the @cs-tam-mgrs alias to request that they notify the CSMs for our top SaaS customers.
Share information and a link to the Issue in #whats-happening-at-gitlab slack channel
Create communication issue (@kwanyangu)

T minus 1 weeks (2023-05-12 02:00 UTC)

CMOC : Communicate 1 week to maintenance
- Message: Next week, as part of a planned maintenance window on 2023-05-20 from 03:00 to 04:00 UTC, We will perform a hardware upgrade for the GitLab.com datastore. Users may experience temporary 50X errors for a brief period during this window. See https://gitlab.com/gitlab-com/gl-infra/production/-/issues/10694

T minus 3 days (2023-05-17 02:00 UTC)

CMOC : Communicate 3 days to maintenance patroni
- Message: We will be conducting a maintenance database activity this Saturday, 2023-05-20, from 03:00 to 04:00 UTC. Users might see 50X errors for a very brief span of time during this window. For details see https://gitlab.com/gitlab-com/gl-infra/production/-/issues/10694
DBRE: Create a merge request in GPRD TF repo to Sync to_be_destroyed into nodes run_list in TF

T minus 1 day (2023-05-19 02:00 UTC)

CMOC : Communicate 1 day to maintenance patroni
- Message: We will be conducting a maintenance database activity Tomorrow, 2023-05-20, from 03:00 to 04:00 UTC. Users might see 50X errors for a very brief span of time during this window. For details see https://gitlab.com/gitlab-com/gl-infra/production/-/issues/10694

T minus 2 hours (2023-05-20 01:00 UTC)

CMOC : Communicate 2 hours to maintenance patroni
- Message: We will be conducting a maintenance database activity in 2 hours, from 03:00 to 04:00 UTC. Users might see 50X errors for a very brief span of time during this window. For details see https://gitlab.com/gitlab-com/gl-infra/production/-/issues/10694

T minus 1 hours (2023-05-20 02:00 UTC)

CMOC : Communicate 1 hour to maintenance patroni
- Message: We will be conducting a maintenance database activity in an hour, from 03:00 to 04:00 UTC. Users might see 50X errors for a very brief span of time during this window. For details see https://gitlab.com/gitlab-com/gl-infra/production/-/issues/10694

Abort Maintenance due to incident recovery (2023-05-20 03:00 UTC)

CMOC : Communicate maintenance aborted due to incident recovery as mentioned at #10694 (comment 1398343384)
- Message: The maintenance is rescheduled to next Saturday, 2023-05-27, from 03:00 to 04:00 UTC. There is currently a long database migration running to fix an incident (https://gitlab.com/gitlab-com/gl-infra/production/-/issues/14468) that we can't risk to interrupt. We apologise for any inconvenience.

New communication plan

T minus 3 days (2023-05-24 02:00 UTC)

CMOC : Communicate 3 days to maintenance patroni
- Message: We will be conducting a maintenance database activity this Saturday, 2023-05-27, from 03:00 to 04:00 UTC. Users might see 50X errors for a very brief span of time during this window. For details see https://gitlab.com/gitlab-com/gl-infra/production/-/issues/10694
DBRE: Create a merge request in GPRD TF repo to Sync to_be_destroyed into nodes run_list in TF

T minus 1 day (2023-05-26 02:00 UTC)

CMOC : Communicate 1 day to maintenance patroni
- Message: We will be conducting a maintenance database activity Tomorrow, 2023-05-27, from 03:00 to 04:00 UTC. Users might see 50X errors for a very brief span of time during this window. For details see https://gitlab.com/gitlab-com/gl-infra/production/-/issues/10694

T minus 2 hours (2023-05-27 01:00 UTC)

CMOC : Communicate 2 hours to maintenance patroni
- Message: We will be conducting a maintenance database activity in 2 hours, from 03:00 to 04:00 UTC. Users might see 50X errors for a very brief span of time during this window. For details see https://gitlab.com/gitlab-com/gl-infra/production/-/issues/10694

T minus 1 hours (2023-05-27 02:00 UTC)

CMOC : Communicate 1 hour to maintenance patroni
- Message: We will be conducting a maintenance database activity in an hour, from 03:00 to 04:00 UTC. Users might see 50X errors for a very brief span of time during this window. For details see https://gitlab.com/gitlab-com/gl-infra/production/-/issues/10694

T minus 15 minutes (2023-05-27 02:45 UTC)

Confirm current Primary nodes of both clusters

ssh patroni-main-2004-101-db-gprd.c.gitlab-production.internal "sudo gitlab-patronictl list"
ssh patroni-ci-2004-101-db-gprd.c.gitlab-production.internal "sudo gitlab-patronictl list"
ssh patroni-v12-registry-01-db-gprd.c.gitlab-production.internal "sudo gitlab-patronictl list"

Open Patroni logs in the old and new Primary DBs
- ssh patroni-main-2004-04-db-gprd.c.gitlab-production.internal "sudo tail -f /var/log/gitlab/patroni/patroni.log"
- ssh patroni-main-2004-101-db-gprd.c.gitlab-production.internal "sudo tail -f /var/log/gitlab/patroni/patroni.log"
- ssh patroni-ci-2004-05-db-gprd.c.gitlab-production.internal "sudo tail -f /var/log/gitlab/patroni/patroni.log"
- ssh patroni-ci-2004-101-db-gprd.c.gitlab-production.internal "sudo tail -f /var/log/gitlab/patroni/patroni.log"
Increase PostgreSQL max_connections (by 1/3 of the current value of 500) =~ 670 in patroni-main
- Merge MR https://gitlab.com/gitlab-com/gl-infra/chef-repo/-/merge_requests/3506
- Change postgresql.parameters.max_connections in patroni-main cluster DCS
```
ssh patroni-main-2004-101-db-gprd.c.gitlab-production.internal
sudo gitlab-patronictl edit-config -s "postgresql.parameters.max_connections = 670"
sudo gitlab-patronictl show-config
```

Change Steps - steps to take to execute the change

[2023-05-27 03:00 UTC] Patroni Primary Switchover

Estimated Time to Complete (mins) - 60 minutes

Set label changein-progress /label ~change::in-progress
Silence alerts: execute /chatops run pager pause in #production
CMOC : Communicate the start of the Maintenance patroni
- Message: The database maintenance is starting now. The database Primary node switchover should happen at any moment in the next hour and during the switchover users might see 50X errors for a very brief span of time. For details see https://gitlab.com/gitlab-com/gl-infra/production/-/issues/10694

Restart patroni Leader candidates to apply new settings and make them run in the latest deployed PG minor version (this might cause errors in Rails, but it should be transparent for customers due to retry in healthy replicas)

Main

ssh patroni-main-2004-101-db-gprd.c.gitlab-production.internal "sudo gitlab-patronictl restart --force gprd-patroni-main-pg12-2004 patroni-main-2004-101-db-gprd.c.gitlab-production.internal"
ssh patroni-main-2004-101-db-gprd.c.gitlab-production.internal "sudo gitlab-patronictl list"

ssh patroni-ci-2004-101-db-gprd.c.gitlab-production.internal "sudo gitlab-patronictl restart --force gprd-patroni-ci-pg12-2004 patroni-ci-2004-101-db-gprd.c.gitlab-production.internal"
ssh patroni-ci-2004-101-db-gprd.c.gitlab-production.internal "sudo gitlab-patronictl list"

Restart Registry cluster to make it run in the latest deployed PG minor version quick downtime for registry database

## Restart Registry cluster
ssh patroni-v12-registry-01-db-gprd.c.gitlab-production.internal "sudo gitlab-patronictl restart --force gprd-pg12-patroni-registry "

## Wait until Registry cluster is running again
ssh patroni-v12-registry-01-db-gprd.c.gitlab-production.internal "sudo gitlab-patronictl list"

Check if server_version of running postmasters on the future Primary nodes are now 12.14 or latest;

ssh patroni-main-2004-101-db-gprd.c.gitlab-production.internal "sudo gitlab-psql -c \"show server_version;\""
ssh patroni-ci-2004-101-db-gprd.c.gitlab-production.internal "sudo gitlab-psql -c \"show server_version;\""
ssh patroni-v12-registry-01-db-gprd.c.gitlab-production.internal "sudo gitlab-psql -c \"show server_version;\""

Shutdown Primary/Master pgbouncers (including sidekiq) for the CI and MAIN downtime start

## Disable chef-client (to avoid Chef auto starting pgbouncer)
knife ssh "role:gprd-base-db-pgbouncer-sidekiq OR role:gprd-base-db-pgbouncer OR role:gprd-base-db-pgbouncer-ci OR role:gprd-base-db-pgbouncer-sidekiq-ci"  "sudo chef-client-disable \"CR https://gitlab.com/gitlab-com/gl-infra/production/-/issues/10694\""

## Shutdown pgbouncer processes
knife ssh "role:gprd-base-db-pgbouncer-sidekiq OR role:gprd-base-db-pgbouncer OR role:gprd-base-db-pgbouncer-ci OR role:gprd-base-db-pgbouncer-sidekiq-ci" "sudo /usr/local/bin/pgb-console -c \"SHUTDOWN;\""

## Check if pgbouncer processes were killed and are not running
knife ssh "role:gprd-base-db-pgbouncer-sidekiq OR role:gprd-base-db-pgbouncer OR role:gprd-base-db-pgbouncer-ci OR role:gprd-base-db-pgbouncer-sidekiq-ci" "ps -ef | grep pgbouncer.ini | grep gitlab"
ssh_cluster_regex.sh "pgbouncer(-sidekiq)?(-ci)?-\d\d-db-gprd" "sudo tail -10 /var/log/gitlab/pgbouncer/pgbouncer.log"

Wait for patroni-main-2004-101-db-gprd and patroni-ci-2004-101-db-gprd to get back in sync

ssh patroni-main-2004-101-db-gprd.c.gitlab-production.internal "sudo gitlab-patronictl list"
ssh patroni-ci-2004-101-db-gprd.c.gitlab-production.internal "sudo gitlab-patronictl list"

Switchover both CI and Main primary nodes

Connect into patroni-ci-2004-101-db-gprd.c.gitlab-production.internal and run:

knife ssh "roles:gprd-base-db-patroni-ci-2004" "sudo gitlab-psql -c \"CHECKPOINT;\""
sudo gitlab-patronictl switchover --master patroni-ci-2004-05-db-gprd.c.gitlab-production.internal --candidate patroni-ci-2004-101-db-gprd.c.gitlab-production.internal

Connect into patroni-main-2004-101-db-gprd.c.gitlab-production.internal and run:

knife ssh "roles:gprd-base-db-patroni-main-2004" "sudo gitlab-psql -c \"CHECKPOINT;\""
sudo gitlab-patronictl switchover --master patroni-main-2004-04-db-gprd.c.gitlab-production.internal --candidate patroni-main-2004-101-db-gprd.c.gitlab-production.internal

Check new Cluster status

ssh patroni-main-2004-101-db-gprd.c.gitlab-production.internal "sudo gitlab-patronictl list"
ssh patroni-ci-2004-101-db-gprd.c.gitlab-production.internal "sudo gitlab-patronictl list"

Check if Replicas are in SYNC with the new Primary (TL should be updated)
Check in postgresql-replication-overview if the replication slots were created in the new Primary node or issue sudo gitlab-psql -c "select * from pg_replication_slots"

Validate master endpoints update in DNS

ssh patroni-main-2004-101-db-gprd.c.gitlab-production.internal "dig @localhost -p 8600 +short master.patroni.service.consul."
knife ssh "role:gprd-base-db-pgbouncer-sidekiq OR role:gprd-base-db-pgbouncer" "dig @localhost -p 8600 +short master.patroni.service.consul. SRV"

ssh patroni-main-2004-101-db-gprd.c.gitlab-production.internal "dig @localhost -p 8600 +short master.patroni-ci.service.consul."
knife ssh "role:gprd-base-db-pgbouncer-ci OR role:gprd-base-db-pgbouncer-sidekiq-ci" "dig @localhost -p 8600 +short master.patroni-ci.service.consul. SRV"

Start Primary/Master pgbouncers services (including sidekiq) for the CI and MAIN downtime finish

## Enable chef-client (to avoid Chef auto starting pgbouncer)
knife ssh "role:gprd-base-db-pgbouncer-sidekiq OR role:gprd-base-db-pgbouncer OR role:gprd-base-db-pgbouncer-ci OR role:gprd-base-db-pgbouncer-sidekiq-ci"  "sudo chef-client-enable"

## Start pgbouncer processes
knife ssh "role:gprd-base-db-pgbouncer-sidekiq OR role:gprd-base-db-pgbouncer OR role:gprd-base-db-pgbouncer-ci OR role:gprd-base-db-pgbouncer-sidekiq-ci" "sudo systemctl start pgbouncer"

## Check if pgbouncer processes were started and are running
knife ssh "role:gprd-base-db-pgbouncer-sidekiq OR role:gprd-base-db-pgbouncer OR role:gprd-base-db-pgbouncer-ci OR role:gprd-base-db-pgbouncer-sidekiq-ci" "ps -ef | grep pgbouncer.ini | grep gitlab"
ssh_cluster_regex.sh "pgbouncer(-sidekiq)?(-ci)?-\d\d-db-gprd" "sudo tail -10 /var/log/gitlab/pgbouncer/pgbouncer.log"

Validate switch of Write/DML operations to new Primary instances
Resume alerts: execute /chatops run pager resume in #production
CMOC : Communicate the end of the Maintenance
- Message: The database switchover is now complete, we expect all SQL statements to be routed to the new nodes. The site is back up and we're continuing to verify that all systems are functioning correctly. Thank you for your patience.

Mark old Primary N1 nodes to_be_destroyed

knife node run_list add patroni-main-2004-04-db-gprd.c.gitlab-production.internal "role[gprd-base-db-patroni-to_be_destroyed]"
knife node run_list add patroni-ci-2004-05-db-gprd.c.gitlab-production.internal "role[gprd-base-db-patroni-to_be_destroyed]"
knife ssh "roles:gprd-base-db-patroni-2004" "sudo chef-client"

Merge MR to Sync to_be_destroyed into nodes run_list - MR: https://ops.gitlab.net/gitlab-com/gl-infra/config-mgmt/-/merge_requests/5816
EOC : Confirm with EOC that no more 50x Erros are being logged
CMOC : Communicate the end of the Maintenance patroni
- Click "Finish Maintenance" and send the following:
- Message: GitLab.com's database layer maintenance is complete now, and we're fully back up and running. We'll be monitoring the platform to ensure all systems are functioning correctly. Thank you for your patience.
Set label changescheduled /label ~change::scheduled
Create MR to Destroy nodes patroni-main-2004-04 and patroni-ci-2004-05
Schedule node destruction after grace period of 1 week

[2023-06-05 02:00 UTC] Destroy old N1 VMs

Merge MR to Destroy the old N1 primary nodes: https://ops.gitlab.net/gitlab-com/gl-infra/config-mgmt/-/merge_requests/5887
Set label changecomplete /label ~change::complete

Rollback

Rollback steps - steps to be taken in the event of a need to rollback this change

Estimated Time to Complete (mins) - 10 minutes

CMOC : Communicate Switchback patroni
- Message: Due to an issue during the maintenance, we have initiated a rollback of the Primary node hardware upgrade. We will send another update within the next 30 minutes.

Remove to_be_destroyed from old Primary N1 nodes run_list

knife node run_list remove patroni-main-2004-04-db-gprd.c.gitlab-production.internal "role[gprd-base-db-patroni-to_be_destroyed]"
knife node run_list remove patroni-ci-2004-05-db-gprd.c.gitlab-production.internal "role[gprd-base-db-patroni-to_be_destroyed]"
knife ssh "roles:gprd-base-db-patroni-2004" "sudo chef-client"

Shutdown Primary/Master pgbouncers (including sidekiq) for the CI and MAIN downtime start

## Disable chef-client (to avoid Chef auto starting pgbouncer)
knife ssh "role:gprd-base-db-pgbouncer-sidekiq OR role:gprd-base-db-pgbouncer OR role:gprd-base-db-pgbouncer-ci OR role:gprd-base-db-pgbouncer-sidekiq-ci"  "sudo chef-client-disable \"CR https://gitlab.com/gitlab-com/gl-infra/production/-/issues/10694\""

## Shutdown pgbouncer processes
knife ssh "role:gprd-base-db-pgbouncer-sidekiq OR role:gprd-base-db-pgbouncer OR role:gprd-base-db-pgbouncer-ci OR role:gprd-base-db-pgbouncer-sidekiq-ci" "sudo /usr/local/bin/pgb-console -c \"SHUTDOWN;\""

## Check if pgbouncer processes were killed and are not running
knife ssh "role:gprd-base-db-pgbouncer-sidekiq OR role:gprd-base-db-pgbouncer OR role:gprd-base-db-pgbouncer-ci OR role:gprd-base-db-pgbouncer-sidekiq-ci" "ps -ef | grep pgbouncer.ini | grep gitlab"
ssh_cluster_regex.sh "pgbouncer(-sidekiq)?(-ci)?-\d\d-db-gprd" "sudo tail -10 /var/log/gitlab/pgbouncer/pgbouncer.log"

Switchback Primary nodes

Connect into patroni-ci-2004-05-db-gprd.c.gitlab-production.internal and run:

knife ssh "roles:gprd-base-db-patroni-ci-2004" "sudo gitlab-psql -c \"CHECKPOINT;\""
sudo gitlab-patronictl switchover --master patroni-ci-2004-101-db-gprd.c.gitlab-production.internal --candidate patroni-ci-2004-05-db-gprd.c.gitlab-production.internal

Connect into patroni-main-2004-04-db-gprd.c.gitlab-production.internal and run:

knife ssh "roles:gprd-base-db-patroni-main-2004" "sudo gitlab-psql -c \"CHECKPOINT;\""
sudo gitlab-patronictl switchover --master patroni-main-2004-101-db-gprd.c.gitlab-production.internal --candidate patroni-main-2004-04-db-gprd.c.gitlab-production.internal

Check new Cluster status

ssh patroni-main-2004-04-db-gprd.c.gitlab-production.internal "sudo gitlab-patronictl list"
ssh patroni-ci-2004-05-db-gprd.c.gitlab-production.internal "sudo gitlab-patronictl list"

Check if Replicas are in SYNC with the new Primary (TL should be updated)
Check in postgresql-replication-overview if the replication slots were created in the new Primary node or issue sudo gitlab-psql -c "select * from pg_replication_slots"

Validate master endpoints update in DNS

ssh patroni-main-2004-101-db-gprd.c.gitlab-production.internal "dig @localhost -p 8600 +short master.patroni.service.consul."
knife ssh "role:gprd-base-db-pgbouncer-sidekiq OR role:gprd-base-db-pgbouncer" "dig @localhost -p 8600 +short master.patroni.service.consul. SRV"

ssh patroni-main-2004-101-db-gprd.c.gitlab-production.internal "dig @localhost -p 8600 +short master.patroni-ci.service.consul."
knife ssh "role:gprd-base-db-pgbouncer-ci OR role:gprd-base-db-pgbouncer-sidekiq-ci" "dig @localhost -p 8600 +short master.patroni-ci.service.consul. SRV"

Start Primary/Master pgbouncers services (including sidekiq) for the CI and MAIN downtime finish

## Enable chef-client (to avoid Chef auto starting pgbouncer)
knife ssh "role:gprd-base-db-pgbouncer-sidekiq OR role:gprd-base-db-pgbouncer OR role:gprd-base-db-pgbouncer-ci OR role:gprd-base-db-pgbouncer-sidekiq-ci"  "sudo chef-client-enable"

## Start pgbouncer processes
knife ssh "role:gprd-base-db-pgbouncer-sidekiq OR role:gprd-base-db-pgbouncer OR role:gprd-base-db-pgbouncer-ci OR role:gprd-base-db-pgbouncer-sidekiq-ci" "sudo systemctl start pgbouncer"

## Check if pgbouncer processes were started and are running
knife ssh "role:gprd-base-db-pgbouncer-sidekiq OR role:gprd-base-db-pgbouncer OR role:gprd-base-db-pgbouncer-ci OR role:gprd-base-db-pgbouncer-sidekiq-ci" "ps -ef | grep pgbouncer.ini | grep gitlab"
ssh_cluster_regex.sh "pgbouncer(-sidekiq)?(-ci)?-\d\d-db-gprd" "sudo tail -10 /var/log/gitlab/pgbouncer/pgbouncer.log"

Validate switch of Write/DML operations to new Primary instances

Resume alerts: execute /chatops run pager resume in #production
EOC : Confirm with EOC that no more 50x Erros are being logged
CMOC : Communicate Switchback Completed patroni
- Message: GitLab.com's database rollback is complete now, and we're back up and running. We'll be monitoring the platform to ensure all systems are functioning correctly. Thank you for your patience.
Revert MR to Sync to_be_destroyed into nodes run_list: https://ops.gitlab.net/gitlab-com/gl-infra/config-mgmt/-/merge_requests/5816
Set label changeaborted /label ~change::aborted

Monitoring

Key metrics to observe - during the CR execution

Metric: Patroni Log on candidate and primary nodes (not a metric)
- Location: /var/log/gitlab/patroni/patroni.log
- What changes to this metric should prompt a rollback: If the candidate node takes longer than 4 minutes to get promoted to Leader, switch back to the former Leader;
Metric: PostgreSQL Replication Overview
- Location: https://dashboards.gitlab.net/d/000000244/postgresql-replication-overview?orgId=1
- What changes to this metric should prompt a rollback: if the Primary don't change to node 101 on Main and CI cluster, switchback to the old master. Confirm with gitlab-patronictl list before proceeding as the dashboard might be outdated.
Metric: rails_primary_sql SLI Apdex
- Location: https://dashboards.gitlab.net/d/patroni-main/patroni-overview?orgId=1 and https://dashboards.gitlab.net/d/patroni-ci-main/patroni-ci-overview?orgId=1 and https://dashboards.gitlab.net/d/patroni-registry-main/patroni-registry-overview?orgId=1
- What changes to this metric should prompt a rollback: sustained violation of the rails_primary_sql SLI Apdex for more than 5 minutes can prompt a rollback (however we need to consider if the cause is known and if it can be mitigated during the maintenance period)
Metric: patroni-ci Service Apdex
- Location: https://dashboards.gitlab.net/d/patroni-main/patroni-overview?orgId=1 and https://dashboards.gitlab.net/d/patroni-ci-main/patroni-ci-overview?orgId=1 and https://dashboards.gitlab.net/d/patroni-registry-main/patroni-registry-overview?orgId=1
- What changes to this metric should prompt a rollback: if the replicas are in SYNC with the new master and we still observe sustained violation of the patroni-ci Service Apdex metrics for more than 5 minutes then we need to consider a rollback

Key metrics to observe after the CR execution, during grace period

Metric: Patroni Dashboards
- Location: https://dashboards.gitlab.net/d/patroni-main/patroni-overview?orgId=1&var-PROMETHEUS_DS=Global&var-environment=gprd and https://dashboards.gitlab.net/d/patroni-ci/patroni-overview?orgId=1&var-PROMETHEUS_DS=Global&var-environment=gprd
- What changes to this metric should prompt a rollback: sustained saturation of primary related workload or sustained violation of any SLI Apdex quality metric if this peak match saturation in at least one of the above resource usage metrics

Key resource usage metrics

Metric: Replica nodes CPU Load (processes per core)
- Location: node_load1
- What changes to this metric should prompt a rollback: CPU Load Avg > 0.7 (per core) for 15 minutes or more;
Metric: Replica nodes CPU Usage (% of all CPUs)
- Location: node_cpu_utilization
- What changes to this metric should prompt a rollback: avg CPU utilization > 70% for 15 minutes or more;
Metric: Replica nodes Memory Trashing (Swap in/out)
- Location: node_vmstat_pswpin , node_vmstat_pswpout
- What changes to this metric should prompt a rollback: Spikes of Swapping activity > 0 for 5 minutes or more;
Metric: Replica nodes I/O wait
- Location: node_disk_read_time_seconds_total , node_disk_write_time_seconds_total
- What changes to this metric should prompt a rollback: avg I/O wait > 10ms (or 0.01s) for 2 minutes or more, but only if caused by an intense I/O activity;
Metric: Replica nodes I/O Throughput in MB/s
- Location: /dev/sdb node_disk_read_bytes_total, /dev/sdb node_disk_written_bytes_total
- What changes to this metric should prompt a rollback: I/O Throughput > 560 MB/s, 70% of the limit 800 MB/s*, for 15 minutes or more;
Metric: Replica nodes IOPS
- Location: /dev/sdb node_disk_reads_completed_total , /dev/sdb node_disk_writes_completed_total
- What changes to this metric should prompt a rollback: I/O operations per second IOPS > 10500, 70% of the limit of 15000 iops*, for 15 minutes or more;
Metric: Primary nodes Network throughput
- Location: node_network_receive_bytes_total , node_network_transmit_bytes_total
- What changes to this metric should prompt a rollback: Sustained Network Throughput > 11.2 Gbps (1.4 GB/s), 70% the VM limit of 16 Gbps (2 GB/s)*, for 15 minutes or more;

* Network and Storage I/O performance limits in GPRD are based on SSD (performance) persistent disk of 2.5 TBs and n1-standard-8 VM with 8 vCPUs, where the I/O bottleneck is the 8vCPU N1 machine type limits for pd-performance and not the block device limits

Change Reviewer checklist

C4 C3 C2 C1:

Check if the following applies:
- The scheduled day and time of execution of the change is appropriate.
- The change plan is technically accurate.
- The change plan includes estimated timing values based on previous testing.
- The change plan includes a viable rollback plan.
- The specified metrics/monitoring dashboards provide sufficient visibility for the change.

C2 C1:

Check if the following applies:
- The complexity of the plan is appropriate for the corresponding risk of the change. (i.e. the plan contains clear details).
- The change plan includes success measures for all steps/milestones during the execution.
- The change adequately minimizes risk within the environment/service.
- The performance implications of executing the change are well-understood and documented.
- The specified metrics/monitoring dashboards provide sufficient visibility for the change.
  - If not, is it possible (or necessary) to make changes to observability platforms for added visibility?
- The change has a primary and secondary SRE with knowledge of the details available during the change window.
- The labels blocks deployments and/or blocks feature-flags are applied as necessary

Change Technician checklist

Check if all items below are complete:
- The change plan is technically accurate.
- This Change Issue is linked to the appropriate Issue and/or Epic
- Change has been tested in staging and results noted in a comment on this issue.
- A dry-run has been conducted and results noted in a comment on this issue.
- The change execution window respects the Production Change Lock periods.
- For C1 and C2 change issues, the change event is added to the GitLab Production calendar.
- For C1 and C2 change issues, the SRE on-call has been informed prior to change being rolled out. (In #production channel, mention @sre-oncall and this issue and await their acknowledgement.)
- For C1 and C2 change issues, the SRE on-call provided approval with the eoc_approved label on the issue.
- For C1 and C2 change issues, the Infrastructure Manager provided approval with the manager_approved label on the issue.
- Release managers have been informed (If needed! Cases include DB change) prior to change being rolled out. (In #production channel, mention @release-managers and this issue and await their acknowledgment.)
- There are currently no active incidents that are severity1 or severity2
- If the change involves doing maintenance on a database host, an appropriate silence targeting the host(s) should be added for the duration of the change.

Edited Jun 06, 2023 by Rafael Henchen