[GPRD] [2023-05-27 03:00 UTC] CR - Hardware upgrade of Patroni Primary nodes on CI and Main databases (Switchover)
Production Change
Proposed time: 2023-05-27 (Saturday) 03:00 AM UTC == 2023-05-26 (Friday) 08:00 PM PDT
Change Summary
As part of the rollout plan (see: https://gitlab.com/gitlab-com/gl-infra/reliability/-/issues/18934#steps-to-perform-in-gprd) we already replaced all Replicas from the old n1-highmem-96 VMs by the new generation n2-highmem-128 VMs in the patroni-main cluster, and n2-highmem-96 VMs in the patroni-ci cluster
The last node still running in the old N1 hardware is the current Primary/Writer patroni nodes, and we need to perform a Switchover operation, which consist into a quick change of roles from the current N1 Primary VM into a new N2 VM. This operation will block SQL statements, therefore will cause 50x errors in Gitlab.com while the N2 nodes are promoted and the master endpoints are reconfigured. We expect this operation to take between 30 seconds up to 5 minutes.
This operation is a very low risk of failure and there's no risk of data loss, as the data replication between the nodes will be synchronised before the new node promotion to Leader, and we can switchback the Leader role into the previous VM at any point.
We already have performed this process in GSTG during CR: #8757 (closed)
In this same CR will increase the PostgreSQL max_connections=670 in patroni-main because this parameter setting is made in DCS cluster level and require the instances to restart to be effective on each node, so we are taking the oportunity to avoid a further maintenance. We need to increase max_connections because we plan to reduce the number of nodes for better cost efficiency (see &851 (comment 1286123341)), thereore more workload will be routed on each node. The value of 670 is aproximately 1/3 of the current value of 500, and reflects the CPU count of patroni-main by 1/3 (from 96 to 128 vCPUs per node).
Why we should implement this change as fast as possible?
Since February 2023, Gitlab.com datastore layer is suffering from pg_primary_cpu saturation spikes in our patroni-main database, see https://gitlab.com/gitlab-com/gl-infra/capacity-planning/-/issues/892. As @tkuah mentioned "Even though the tamland now show shows no forecasted violation, today we had a series of peaks close to 80% CPU. Opened gitlab-org/gitlab#407823 (closed) for this" (https://gitlab.com/gitlab-com/gl-infra/capacity-planning/-/issues/892#note_1357859469)". With the hardware upgrade we'll be increasing the amount and speed of CPU resources of our patroni-main Primary node, which should reduce considerably the risk of CPU saturation. Therefore, a long wait to implement this change goes agains customers interest.
CSMs/TAMs message to customers
This weekend’s database hardware upgrade had to be rescheduled to next Saturday, 2023-05-27, from 03:00 to 04:00 UTC. Unfortunately, there was a long database migration running to fix an incident, that we couldn’t risk interrupting. We apologise for any inconvenience. Next weekend, users may experience temporary 50X errors for a brief period during the database maintenance window. As previously informed, the hardware upgrade is part of a Database Scalability Strategy we are implementing to improve overall database availability and performance of Gitlab.com.
More details can be found at #10694 (closed)
FAQ
Does this maintenance affect GitLab Dedicated customers?
No. It will not impact GitLab Dedicated single-tenancy environments. This maintenance targets only GitLab.com shared infrastructure.
Change Details
- Services Impacted - ServicePatroni ServicePatroniCI
- Change Technician - @rhenchen.gitlab
- Change Reviewer - @alexander-sosna or @bshah11
- Time tracking - 1 hour and 30 minutes
- Downtime Component - yes
Detailed steps for the change
Prep Tasks
T minus 2 weeks (2023-05-05 02:00 UTC)
-
CMOC : Ensure that the maintenance window is scheduled on status.io. -
CMOC : Post update from Status.io maintenance site, publish on @gitlabstatus. Workflow: https://about.gitlab.com/handbook/support/workflows/cmoc_workflows.html#sending-updates-about-maintenance-events- Message:
In 2 weeks, as part of a planned maintenance window on 2023-05-20 from 03:00 to 04:00 UTC, We will perform a hardware upgrade for the GitLab.com datastore. Users may experience temporary 50X errors for a brief period during this window. See https://gitlab.com/gitlab-com/gl-infra/production/-/issues/10694
- Message:
-
Ask our CSMs in our #customer-success Slack channel about their preferences on how to communicate this change to our main customers: -
Ping CSM managers using the @cs-tam-mgrs alias to request that they notify the CSMs for our top SaaS customers.
-
-
Share information and a link to the Issue in #whats-happening-at-gitlab slack channel -
Create communication issue (@kwanyangu)
T minus 1 weeks (2023-05-12 02:00 UTC)
-
CMOC : Communicate 1 week to maintenance - Message:
Next week, as part of a planned maintenance window on 2023-05-20 from 03:00 to 04:00 UTC, We will perform a hardware upgrade for the GitLab.com datastore. Users may experience temporary 50X errors for a brief period during this window. See https://gitlab.com/gitlab-com/gl-infra/production/-/issues/10694
- Message:
T minus 3 days (2023-05-17 02:00 UTC)
-
CMOC : Communicate 3 days to maintenance patroni - Message:
We will be conducting a maintenance database activity this Saturday, 2023-05-20, from 03:00 to 04:00 UTC. Users might see 50X errors for a very brief span of time during this window. For details see https://gitlab.com/gitlab-com/gl-infra/production/-/issues/10694
- Message:
-
DBRE: Create a merge request in GPRD TF repo to Sync to_be_destroyedinto nodesrun_listin TF
T minus 1 day (2023-05-19 02:00 UTC)
-
CMOC : Communicate 1 day to maintenance patroni - Message:
We will be conducting a maintenance database activity Tomorrow, 2023-05-20, from 03:00 to 04:00 UTC. Users might see 50X errors for a very brief span of time during this window. For details see https://gitlab.com/gitlab-com/gl-infra/production/-/issues/10694
- Message:
T minus 2 hours (2023-05-20 01:00 UTC)
-
CMOC : Communicate 2 hours to maintenance patroni - Message:
We will be conducting a maintenance database activity in 2 hours, from 03:00 to 04:00 UTC. Users might see 50X errors for a very brief span of time during this window. For details see https://gitlab.com/gitlab-com/gl-infra/production/-/issues/10694
- Message:
T minus 1 hours (2023-05-20 02:00 UTC)
-
CMOC : Communicate 1 hour to maintenance patroni - Message:
We will be conducting a maintenance database activity in an hour, from 03:00 to 04:00 UTC. Users might see 50X errors for a very brief span of time during this window. For details see https://gitlab.com/gitlab-com/gl-infra/production/-/issues/10694
- Message:
Abort Maintenance due to incident recovery (2023-05-20 03:00 UTC)
-
CMOC : Communicate maintenance aborted due to incident recovery as mentioned at #10694 (comment 1398343384) - Message:
The maintenance is rescheduled to next Saturday, 2023-05-27, from 03:00 to 04:00 UTC. There is currently a long database migration running to fix an incident (https://gitlab.com/gitlab-com/gl-infra/production/-/issues/14468) that we can't risk to interrupt. We apologise for any inconvenience.
- Message:
New communication plan
T minus 3 days (2023-05-24 02:00 UTC)
-
CMOC : Communicate 3 days to maintenance patroni - Message:
We will be conducting a maintenance database activity this Saturday, 2023-05-27, from 03:00 to 04:00 UTC. Users might see 50X errors for a very brief span of time during this window. For details see https://gitlab.com/gitlab-com/gl-infra/production/-/issues/10694
- Message:
-
DBRE: Create a merge request in GPRD TF repo to Sync to_be_destroyedinto nodesrun_listin TF
T minus 1 day (2023-05-26 02:00 UTC)
-
CMOC : Communicate 1 day to maintenance patroni - Message:
We will be conducting a maintenance database activity Tomorrow, 2023-05-27, from 03:00 to 04:00 UTC. Users might see 50X errors for a very brief span of time during this window. For details see https://gitlab.com/gitlab-com/gl-infra/production/-/issues/10694
- Message:
T minus 2 hours (2023-05-27 01:00 UTC)
-
CMOC : Communicate 2 hours to maintenance patroni - Message:
We will be conducting a maintenance database activity in 2 hours, from 03:00 to 04:00 UTC. Users might see 50X errors for a very brief span of time during this window. For details see https://gitlab.com/gitlab-com/gl-infra/production/-/issues/10694
- Message:
T minus 1 hours (2023-05-27 02:00 UTC)
-
CMOC : Communicate 1 hour to maintenance patroni - Message:
We will be conducting a maintenance database activity in an hour, from 03:00 to 04:00 UTC. Users might see 50X errors for a very brief span of time during this window. For details see https://gitlab.com/gitlab-com/gl-infra/production/-/issues/10694
- Message:
T minus 15 minutes (2023-05-27 02:45 UTC)
-
Confirm current Primary nodes of both clusters ssh patroni-main-2004-101-db-gprd.c.gitlab-production.internal "sudo gitlab-patronictl list" ssh patroni-ci-2004-101-db-gprd.c.gitlab-production.internal "sudo gitlab-patronictl list" ssh patroni-v12-registry-01-db-gprd.c.gitlab-production.internal "sudo gitlab-patronictl list" -
Open Patroni logs in the old and new Primary DBs - ssh patroni-main-2004-04-db-gprd.c.gitlab-production.internal "sudo tail -f /var/log/gitlab/patroni/patroni.log"
- ssh patroni-main-2004-101-db-gprd.c.gitlab-production.internal "sudo tail -f /var/log/gitlab/patroni/patroni.log"
- ssh patroni-ci-2004-05-db-gprd.c.gitlab-production.internal "sudo tail -f /var/log/gitlab/patroni/patroni.log"
- ssh patroni-ci-2004-101-db-gprd.c.gitlab-production.internal "sudo tail -f /var/log/gitlab/patroni/patroni.log"
-
Increase PostgreSQL max_connections(by 1/3 of the current value of 500)=~ 670inpatroni-main-
Merge MR https://gitlab.com/gitlab-com/gl-infra/chef-repo/-/merge_requests/3506 -
Change postgresql.parameters.max_connectionsinpatroni-maincluster DCSssh patroni-main-2004-101-db-gprd.c.gitlab-production.internal sudo gitlab-patronictl edit-config -s "postgresql.parameters.max_connections = 670" sudo gitlab-patronictl show-config
-
Change Steps - steps to take to execute the change
[2023-05-27 03:00 UTC] Patroni Primary Switchover
Estimated Time to Complete (mins) - 60 minutes
-
Set label changein-progress /label ~change::in-progress -
Silence alerts: execute /chatops run pager pausein#production -
CMOC : Communicate the start of the Maintenance patroni - Message:
The database maintenance is starting now. The database Primary node switchover should happen at any moment in the next hour and during the switchover users might see 50X errors for a very brief span of time. For details see https://gitlab.com/gitlab-com/gl-infra/production/-/issues/10694
- Message:
-
Restart patroni Leader candidates to apply new settings and make them run in the latest deployed PG minor version (this might cause errors in Rails, but it should be transparent for customers due to retry in healthy replicas) -
Main ssh patroni-main-2004-101-db-gprd.c.gitlab-production.internal "sudo gitlab-patronictl restart --force gprd-patroni-main-pg12-2004 patroni-main-2004-101-db-gprd.c.gitlab-production.internal" ssh patroni-main-2004-101-db-gprd.c.gitlab-production.internal "sudo gitlab-patronictl list" -
CI ssh patroni-ci-2004-101-db-gprd.c.gitlab-production.internal "sudo gitlab-patronictl restart --force gprd-patroni-ci-pg12-2004 patroni-ci-2004-101-db-gprd.c.gitlab-production.internal" ssh patroni-ci-2004-101-db-gprd.c.gitlab-production.internal "sudo gitlab-patronictl list"
-
-
Restart Registry cluster to make it run in the latest deployed PG minor version quick downtime for registry database ## Restart Registry cluster ssh patroni-v12-registry-01-db-gprd.c.gitlab-production.internal "sudo gitlab-patronictl restart --force gprd-pg12-patroni-registry " ## Wait until Registry cluster is running again ssh patroni-v12-registry-01-db-gprd.c.gitlab-production.internal "sudo gitlab-patronictl list" -
Check if server_versionof running postmasters on the future Primary nodes are now 12.14 or latest;ssh patroni-main-2004-101-db-gprd.c.gitlab-production.internal "sudo gitlab-psql -c \"show server_version;\"" ssh patroni-ci-2004-101-db-gprd.c.gitlab-production.internal "sudo gitlab-psql -c \"show server_version;\"" ssh patroni-v12-registry-01-db-gprd.c.gitlab-production.internal "sudo gitlab-psql -c \"show server_version;\"" -
Shutdown Primary/Master pgbouncers (including sidekiq) for the CI and MAIN downtime start ## Disable chef-client (to avoid Chef auto starting pgbouncer) knife ssh "role:gprd-base-db-pgbouncer-sidekiq OR role:gprd-base-db-pgbouncer OR role:gprd-base-db-pgbouncer-ci OR role:gprd-base-db-pgbouncer-sidekiq-ci" "sudo chef-client-disable \"CR https://gitlab.com/gitlab-com/gl-infra/production/-/issues/10694\"" ## Shutdown pgbouncer processes knife ssh "role:gprd-base-db-pgbouncer-sidekiq OR role:gprd-base-db-pgbouncer OR role:gprd-base-db-pgbouncer-ci OR role:gprd-base-db-pgbouncer-sidekiq-ci" "sudo /usr/local/bin/pgb-console -c \"SHUTDOWN;\"" ## Check if pgbouncer processes were killed and are not running knife ssh "role:gprd-base-db-pgbouncer-sidekiq OR role:gprd-base-db-pgbouncer OR role:gprd-base-db-pgbouncer-ci OR role:gprd-base-db-pgbouncer-sidekiq-ci" "ps -ef | grep pgbouncer.ini | grep gitlab" ssh_cluster_regex.sh "pgbouncer(-sidekiq)?(-ci)?-\d\d-db-gprd" "sudo tail -10 /var/log/gitlab/pgbouncer/pgbouncer.log" -
Wait for patroni-main-2004-101-db-gprdandpatroni-ci-2004-101-db-gprdto get back in syncssh patroni-main-2004-101-db-gprd.c.gitlab-production.internal "sudo gitlab-patronictl list" ssh patroni-ci-2004-101-db-gprd.c.gitlab-production.internal "sudo gitlab-patronictl list" -
Switchover both CI and Main primary nodes -
Connect into patroni-ci-2004-101-db-gprd.c.gitlab-production.internaland run:knife ssh "roles:gprd-base-db-patroni-ci-2004" "sudo gitlab-psql -c \"CHECKPOINT;\"" sudo gitlab-patronictl switchover --master patroni-ci-2004-05-db-gprd.c.gitlab-production.internal --candidate patroni-ci-2004-101-db-gprd.c.gitlab-production.internal -
Connect into patroni-main-2004-101-db-gprd.c.gitlab-production.internaland run:knife ssh "roles:gprd-base-db-patroni-main-2004" "sudo gitlab-psql -c \"CHECKPOINT;\"" sudo gitlab-patronictl switchover --master patroni-main-2004-04-db-gprd.c.gitlab-production.internal --candidate patroni-main-2004-101-db-gprd.c.gitlab-production.internal -
Check new Cluster status ssh patroni-main-2004-101-db-gprd.c.gitlab-production.internal "sudo gitlab-patronictl list" ssh patroni-ci-2004-101-db-gprd.c.gitlab-production.internal "sudo gitlab-patronictl list" -
Check if Replicas are in SYNC with the new Primary (TL should be updated) -
Check in postgresql-replication-overview if the replication slots were created in the new Primary node or issue sudo gitlab-psql -c "select * from pg_replication_slots" -
Validate masterendpoints update in DNSssh patroni-main-2004-101-db-gprd.c.gitlab-production.internal "dig @localhost -p 8600 +short master.patroni.service.consul." knife ssh "role:gprd-base-db-pgbouncer-sidekiq OR role:gprd-base-db-pgbouncer" "dig @localhost -p 8600 +short master.patroni.service.consul. SRV" ssh patroni-main-2004-101-db-gprd.c.gitlab-production.internal "dig @localhost -p 8600 +short master.patroni-ci.service.consul." knife ssh "role:gprd-base-db-pgbouncer-ci OR role:gprd-base-db-pgbouncer-sidekiq-ci" "dig @localhost -p 8600 +short master.patroni-ci.service.consul. SRV"
-
-
Start Primary/Master pgbouncers services (including sidekiq) for the CI and MAIN downtime finish ## Enable chef-client (to avoid Chef auto starting pgbouncer) knife ssh "role:gprd-base-db-pgbouncer-sidekiq OR role:gprd-base-db-pgbouncer OR role:gprd-base-db-pgbouncer-ci OR role:gprd-base-db-pgbouncer-sidekiq-ci" "sudo chef-client-enable" ## Start pgbouncer processes knife ssh "role:gprd-base-db-pgbouncer-sidekiq OR role:gprd-base-db-pgbouncer OR role:gprd-base-db-pgbouncer-ci OR role:gprd-base-db-pgbouncer-sidekiq-ci" "sudo systemctl start pgbouncer" ## Check if pgbouncer processes were started and are running knife ssh "role:gprd-base-db-pgbouncer-sidekiq OR role:gprd-base-db-pgbouncer OR role:gprd-base-db-pgbouncer-ci OR role:gprd-base-db-pgbouncer-sidekiq-ci" "ps -ef | grep pgbouncer.ini | grep gitlab" ssh_cluster_regex.sh "pgbouncer(-sidekiq)?(-ci)?-\d\d-db-gprd" "sudo tail -10 /var/log/gitlab/pgbouncer/pgbouncer.log" -
Validate switch of Write/DML operations to new Primary instances -
Resume alerts: execute /chatops run pager resumein#production -
CMOC : Communicate the end of the Maintenance - Message:
The database switchover is now complete, we expect all SQL statements to be routed to the new nodes. The site is back up and we're continuing to verify that all systems are functioning correctly. Thank you for your patience.
- Message:
-
Mark old Primary N1 nodes to_be_destroyedknife node run_list add patroni-main-2004-04-db-gprd.c.gitlab-production.internal "role[gprd-base-db-patroni-to_be_destroyed]" knife node run_list add patroni-ci-2004-05-db-gprd.c.gitlab-production.internal "role[gprd-base-db-patroni-to_be_destroyed]" knife ssh "roles:gprd-base-db-patroni-2004" "sudo chef-client" -
Merge MR to Sync to_be_destroyedinto nodesrun_list- MR: https://ops.gitlab.net/gitlab-com/gl-infra/config-mgmt/-/merge_requests/5816 -
EOC : Confirm with EOC that no more 50x Erros are being logged -
CMOC : Communicate the end of the Maintenance patroni - Click "Finish Maintenance" and send the following:
- Message:
GitLab.com's database layer maintenance is complete now, and we're fully back up and running. We'll be monitoring the platform to ensure all systems are functioning correctly. Thank you for your patience.
-
Set label changescheduled /label ~change::scheduled -
Create MR to Destroy nodes patroni-main-2004-04andpatroni-ci-2004-05 -
Schedule node destruction after grace period of 1 week
[2023-06-05 02:00 UTC] Destroy old N1 VMs
-
Merge MR to Destroy the old N1 primary nodes: https://ops.gitlab.net/gitlab-com/gl-infra/config-mgmt/-/merge_requests/5887 -
Set label changecomplete /label ~change::complete
Rollback
Rollback steps - steps to be taken in the event of a need to rollback this change
Estimated Time to Complete (mins) - 10 minutes
-
CMOC : Communicate Switchback patroni -
Message: Due to an issue during the maintenance, we have initiated a rollback of the Primary node hardware upgrade. We will send another update within the next 30 minutes.
-
-
Remove to_be_destroyedfrom old Primary N1 nodesrun_listknife node run_list remove patroni-main-2004-04-db-gprd.c.gitlab-production.internal "role[gprd-base-db-patroni-to_be_destroyed]" knife node run_list remove patroni-ci-2004-05-db-gprd.c.gitlab-production.internal "role[gprd-base-db-patroni-to_be_destroyed]" knife ssh "roles:gprd-base-db-patroni-2004" "sudo chef-client" -
Shutdown Primary/Master pgbouncers (including sidekiq) for the CI and MAIN downtime start ## Disable chef-client (to avoid Chef auto starting pgbouncer) knife ssh "role:gprd-base-db-pgbouncer-sidekiq OR role:gprd-base-db-pgbouncer OR role:gprd-base-db-pgbouncer-ci OR role:gprd-base-db-pgbouncer-sidekiq-ci" "sudo chef-client-disable \"CR https://gitlab.com/gitlab-com/gl-infra/production/-/issues/10694\"" ## Shutdown pgbouncer processes knife ssh "role:gprd-base-db-pgbouncer-sidekiq OR role:gprd-base-db-pgbouncer OR role:gprd-base-db-pgbouncer-ci OR role:gprd-base-db-pgbouncer-sidekiq-ci" "sudo /usr/local/bin/pgb-console -c \"SHUTDOWN;\"" ## Check if pgbouncer processes were killed and are not running knife ssh "role:gprd-base-db-pgbouncer-sidekiq OR role:gprd-base-db-pgbouncer OR role:gprd-base-db-pgbouncer-ci OR role:gprd-base-db-pgbouncer-sidekiq-ci" "ps -ef | grep pgbouncer.ini | grep gitlab" ssh_cluster_regex.sh "pgbouncer(-sidekiq)?(-ci)?-\d\d-db-gprd" "sudo tail -10 /var/log/gitlab/pgbouncer/pgbouncer.log" -
Switchback Primary nodes -
Connect into patroni-ci-2004-05-db-gprd.c.gitlab-production.internaland run:
knife ssh "roles:gprd-base-db-patroni-ci-2004" "sudo gitlab-psql -c \"CHECKPOINT;\"" sudo gitlab-patronictl switchover --master patroni-ci-2004-101-db-gprd.c.gitlab-production.internal --candidate patroni-ci-2004-05-db-gprd.c.gitlab-production.internal-
Connect into patroni-main-2004-04-db-gprd.c.gitlab-production.internaland run:
knife ssh "roles:gprd-base-db-patroni-main-2004" "sudo gitlab-psql -c \"CHECKPOINT;\"" sudo gitlab-patronictl switchover --master patroni-main-2004-101-db-gprd.c.gitlab-production.internal --candidate patroni-main-2004-04-db-gprd.c.gitlab-production.internal-
Check new Cluster status ssh patroni-main-2004-04-db-gprd.c.gitlab-production.internal "sudo gitlab-patronictl list" ssh patroni-ci-2004-05-db-gprd.c.gitlab-production.internal "sudo gitlab-patronictl list" -
Check if Replicas are in SYNC with the new Primary (TL should be updated) -
Check in postgresql-replication-overview if the replication slots were created in the new Primary node or issue sudo gitlab-psql -c "select * from pg_replication_slots" -
Validate masterendpoints update in DNSssh patroni-main-2004-101-db-gprd.c.gitlab-production.internal "dig @localhost -p 8600 +short master.patroni.service.consul." knife ssh "role:gprd-base-db-pgbouncer-sidekiq OR role:gprd-base-db-pgbouncer" "dig @localhost -p 8600 +short master.patroni.service.consul. SRV" ssh patroni-main-2004-101-db-gprd.c.gitlab-production.internal "dig @localhost -p 8600 +short master.patroni-ci.service.consul." knife ssh "role:gprd-base-db-pgbouncer-ci OR role:gprd-base-db-pgbouncer-sidekiq-ci" "dig @localhost -p 8600 +short master.patroni-ci.service.consul. SRV"
-
-
Start Primary/Master pgbouncers services (including sidekiq) for the CI and MAIN downtime finish ## Enable chef-client (to avoid Chef auto starting pgbouncer) knife ssh "role:gprd-base-db-pgbouncer-sidekiq OR role:gprd-base-db-pgbouncer OR role:gprd-base-db-pgbouncer-ci OR role:gprd-base-db-pgbouncer-sidekiq-ci" "sudo chef-client-enable" ## Start pgbouncer processes knife ssh "role:gprd-base-db-pgbouncer-sidekiq OR role:gprd-base-db-pgbouncer OR role:gprd-base-db-pgbouncer-ci OR role:gprd-base-db-pgbouncer-sidekiq-ci" "sudo systemctl start pgbouncer" ## Check if pgbouncer processes were started and are running knife ssh "role:gprd-base-db-pgbouncer-sidekiq OR role:gprd-base-db-pgbouncer OR role:gprd-base-db-pgbouncer-ci OR role:gprd-base-db-pgbouncer-sidekiq-ci" "ps -ef | grep pgbouncer.ini | grep gitlab" ssh_cluster_regex.sh "pgbouncer(-sidekiq)?(-ci)?-\d\d-db-gprd" "sudo tail -10 /var/log/gitlab/pgbouncer/pgbouncer.log" -
Resume alerts: execute /chatops run pager resumein#production -
EOC : Confirm with EOC that no more 50x Erros are being logged -
CMOC : Communicate Switchback Completed patroni -
Message: GitLab.com's database rollback is complete now, and we're back up and running. We'll be monitoring the platform to ensure all systems are functioning correctly. Thank you for your patience.
-
-
Revert MR to Sync to_be_destroyedinto nodesrun_list: https://ops.gitlab.net/gitlab-com/gl-infra/config-mgmt/-/merge_requests/5816 -
Set label changeaborted /label ~change::aborted
Monitoring
Key metrics to observe - during the CR execution
- Metric: Patroni Log on candidate and primary nodes (not a metric)
- Location:
/var/log/gitlab/patroni/patroni.log - What changes to this metric should prompt a rollback: If the candidate node takes longer than 4 minutes to get promoted to Leader, switch back to the former Leader;
- Location:
- Metric: PostgreSQL Replication Overview
- Location: https://dashboards.gitlab.net/d/000000244/postgresql-replication-overview?orgId=1
- What changes to this metric should prompt a rollback: if the Primary don't change to node
101on Main and CI cluster, switchback to the old master. Confirm withgitlab-patronictl listbefore proceeding as the dashboard might be outdated.
- Metric:
rails_primary_sql SLI Apdex- Location: https://dashboards.gitlab.net/d/patroni-main/patroni-overview?orgId=1 and https://dashboards.gitlab.net/d/patroni-ci-main/patroni-ci-overview?orgId=1 and https://dashboards.gitlab.net/d/patroni-registry-main/patroni-registry-overview?orgId=1
- What changes to this metric should prompt a rollback: sustained violation of the
rails_primary_sql SLI Apdexfor more than 5 minutes can prompt a rollback (however we need to consider if the cause is known and if it can be mitigated during the maintenance period)
- Metric:
patroni-ci Service Apdex- Location: https://dashboards.gitlab.net/d/patroni-main/patroni-overview?orgId=1 and https://dashboards.gitlab.net/d/patroni-ci-main/patroni-ci-overview?orgId=1 and https://dashboards.gitlab.net/d/patroni-registry-main/patroni-registry-overview?orgId=1
- What changes to this metric should prompt a rollback: if the replicas are in SYNC with the new master and we still observe sustained violation of the
patroni-ci Service Apdexmetrics for more than 5 minutes then we need to consider a rollback
Key metrics to observe after the CR execution, during grace period
- Metric: Patroni Dashboards
- Location: https://dashboards.gitlab.net/d/patroni-main/patroni-overview?orgId=1&var-PROMETHEUS_DS=Global&var-environment=gprd and https://dashboards.gitlab.net/d/patroni-ci/patroni-overview?orgId=1&var-PROMETHEUS_DS=Global&var-environment=gprd
- What changes to this metric should prompt a rollback: sustained saturation of primary related workload or sustained violation of any
SLI Apdexquality metric if this peak match saturation in at least one of the above resource usage metrics
Key resource usage metrics
- Metric: Replica nodes CPU Load (processes per core)
- Location: node_load1
- What changes to this metric should prompt a rollback:
CPU Load Avg > 0.7(per core) for 15 minutes or more;
- Metric: Replica nodes CPU Usage (% of all CPUs)
- Location: node_cpu_utilization
- What changes to this metric should prompt a rollback: avg
CPU utilization > 70%for 15 minutes or more;
- Metric: Replica nodes Memory Trashing (Swap in/out)
- Location: node_vmstat_pswpin , node_vmstat_pswpout
- What changes to this metric should prompt a rollback: Spikes of
Swapping activity > 0for 5 minutes or more;
- Metric: Replica nodes I/O wait
- Location: node_disk_read_time_seconds_total , node_disk_write_time_seconds_total
- What changes to this metric should prompt a rollback: avg
I/O wait > 10ms (or 0.01s)for 2 minutes or more, but only if caused by an intense I/O activity;
- Metric: Replica nodes I/O Throughput in MB/s
- Location: /dev/sdb node_disk_read_bytes_total, /dev/sdb node_disk_written_bytes_total
- What changes to this metric should prompt a rollback:
I/O Throughput > 560 MB/s, 70% of the limit 800 MB/s*, for 15 minutes or more;
- Metric: Replica nodes IOPS
- Location: /dev/sdb node_disk_reads_completed_total , /dev/sdb node_disk_writes_completed_total
- What changes to this metric should prompt a rollback: I/O operations per second
IOPS > 10500, 70% of the limit of 15000 iops*, for 15 minutes or more;
- Metric: Primary nodes Network throughput
- Location: node_network_receive_bytes_total , node_network_transmit_bytes_total
- What changes to this metric should prompt a rollback: Sustained
Network Throughput > 11.2 Gbps (1.4 GB/s), 70% the VM limit of16 Gbps (2 GB/s)*, for 15 minutes or more;
* Network and Storage I/O performance limits in GPRD are based on SSD (performance) persistent disk of 2.5 TBs and n1-standard-8 VM with 8 vCPUs, where the I/O bottleneck is the 8vCPU N1 machine type limits for pd-performance and not the block device limits
Change Reviewer checklist
-
Check if the following applies: - The scheduled day and time of execution of the change is appropriate.
- The change plan is technically accurate.
- The change plan includes estimated timing values based on previous testing.
- The change plan includes a viable rollback plan.
- The specified metrics/monitoring dashboards provide sufficient visibility for the change.
-
Check if the following applies: - The complexity of the plan is appropriate for the corresponding risk of the change. (i.e. the plan contains clear details).
- The change plan includes success measures for all steps/milestones during the execution.
- The change adequately minimizes risk within the environment/service.
- The performance implications of executing the change are well-understood and documented.
- The specified metrics/monitoring dashboards provide sufficient visibility for the change.
- If not, is it possible (or necessary) to make changes to observability platforms for added visibility?
- The change has a primary and secondary SRE with knowledge of the details available during the change window.
- The labels blocks deployments and/or blocks feature-flags are applied as necessary
Change Technician checklist
-
Check if all items below are complete: - The change plan is technically accurate.
- This Change Issue is linked to the appropriate Issue and/or Epic
- Change has been tested in staging and results noted in a comment on this issue.
- A dry-run has been conducted and results noted in a comment on this issue.
- The change execution window respects the Production Change Lock periods.
- For C1 and C2 change issues, the change event is added to the GitLab Production calendar.
- For C1 and C2 change issues, the SRE on-call has been informed prior to change being rolled out. (In #production channel, mention
@sre-oncalland this issue and await their acknowledgement.) - For C1 and C2 change issues, the SRE on-call provided approval with the eoc_approved label on the issue.
- For C1 and C2 change issues, the Infrastructure Manager provided approval with the manager_approved label on the issue.
- Release managers have been informed (If needed! Cases include DB change) prior to change being rolled out. (In #production channel, mention
@release-managersand this issue and await their acknowledgment.) - There are currently no active incidents that are severity1 or severity2
- If the change involves doing maintenance on a database host, an appropriate silence targeting the host(s) should be added for the duration of the change.