[GSTG] [Sec Decomp] - Phase 7 rollout/rollback/rollout test
Staging Change
Change Details
- Services Impacted - ServicePatroni ServicePatroniSec
- Change Technician - @jjsisson
- Change Reviewer - @rhenchen.gitlab @bprescott_
- Scheduled Date and Time (UTC in format YYYY-MM-DD HH:MM) - 2025-04-17 23:00
- Time tracking - 6h
- Downtime Component - none
Database Decomposition SEC database in GSTG
Database Decomposition Rollout Team
Role | Assigned To |
---|---|
|
@theoretick |
|
@jjsisson |
|
- |
|
- |
|
Can be from PD schedule |
|
Can be from PD schedule |
|
Check CMOC escalation table below |
|
@ghavenga |
|
- |
|
@release-managers |
|
- |
📣 CMOC Escalation Table
Important: Just for when each window begins - else ping @cmoc
on Slack
Date and Step | Assigned To |
---|---|
2025-04-17 23:00 UTC - PCL start | TBD |
2025-04-18 01:00 UTC - Decomp start | TBD |
2025-04-18 02:00 UTC - Switchover | TBD |
2025-04-18 05:00 UTC - PCL finish | TBD |
Collaboration
During the change window, the rollout team will collaborate using the following communications channels:
App | Direct Link |
---|---|
Slack | #g_database_operations |
Video Call | Zoom link in Production Calendar event |
Immediately
Perform these steps when the issue is created.
-
🐺 Coordinator : Fill out the names of the rollout team in the table above.
Support Options
Provider | Plan | Details | Create Ticket |
---|---|---|---|
Google Cloud Platform | Gold Support | 24x7, email & phone, 1hr response on critical issues | Create GCP Support Ticket |
Entry points
Entry point | Before | Blocking mechanism | Allowlist | QA needs | Notes |
---|---|---|---|---|---|
Pages | Available via *.gitlab.io, and various custom domains | Unavailable if GitLab.com goes down for a brief time. There is a cache but it will expire in gitlab_cache_expiry minutes |
N/A | N/A | |
Database hosts
Accessing the rails and database consoles
- rails:
ssh $USER-rails@console-01-sv-gstg.c.gitlab-staging-1.internal
- main db replica:
ssh $USER-db@console-01-sv-gstg.c.gitlab-staging-1.internal
- main db primary:
ssh $USER-db-primary@console-01-sv-gstg.c.gitlab-staging-1.internal
- main db psql:
ssh -t patroni-main-v16-04-db-gstg.c.gitlab-staging-1.internal sudo gitlab-psql
- sec db replica:
ssh $USER-db-sec@console-01-sv-gstg.c.gitlab-staging-1.internal
- sec db primary:
ssh $USER-db-sec-primary@console-01-sv-gstg.c.gitlab-staging-1.internal
- sec db psql:
ssh -t patroni-sec-v16-03-db-gstg.c.gitlab-staging-1.internal sudo gitlab-psql
Dashboards and debugging
These dashboards might be useful during the rollout: sec decomp dashboard
Staging
- PostgreSQL replication overview
- Decomposition using logical overview
- Triage overview
- Sidekiq overview
- Sentry - includes application errors
- Logs (Kibana)
Destination db: sec
- monitoring_pgbouncer_gitlab_user_conns
- monitoring_chef_client_enabled
- monitoring_chef_client_last_run
- monitoring_chef_client_error
- monitoring_snapshot_last_run
- monitoring_user_tables_writes
- monitoring_user_tables_reads
- monitoring_gitlab_maintenance_mode
Source db: main
- monitoring_pgbouncer_gitlab_user_conns
- monitoring_chef_client_enabled
- monitoring_chef_client_last_run
- monitoring_chef_client_error
- monitoring_snapshot_last_run
- monitoring_user_tables_writes
- monitoring_user_tables_reads
- monitoring_gitlab_maintenance_mode
Repos used during the rollout
The following Ansible playbooks are referenced throughout this issue:
- Postgres Physical-to-Logical Replication, Decomposition, and Rollback: https://gitlab.com/gitlab-com/gl-infra/db-migration/-/tree/master/pg-physical-to-logical
High level overview
This gives an high level overview on the procedure.
Decomposition Flowchart
flowchart TB
subgraph Prepare new environment
A[Create new cluster sec as a carbon copy of main] --> B
B[Attach sec as a standby-only-cluster to main via physical replication] --> C
end
C[Make sure both clusters are in sync] --> D1
subgraph Break Physical Replication: ansible-playbook physical-to-logical.yml
D1[Disable Chef] --> D2
D2[Perform clean shutdown of sec] --> D3
D3[On main, create a replication slot and publication FOR ALL main TABLES; remember its LSN] --> D4
D4[Configure recovery_target_lsn on sec] --> D5
D5[Start sec] --> D6
D6[Let sec reach the slot's LSN, still using physical replication] --> D7
D7[Once slot's LSN is reached, promote sec leader] --> D8
D9[Create logical subscription with copy_data=false] --> D10
D10[Let sec catch up using logical replication] --> H
end
subgraph Redirect RO to sec
H[Redirect RO only to sec] --> R
R[Check if cluster is operational and metrics are normal] --"Normal"--> S
R --"Abnormal"--> GR
S[DBRE verify E2E tests run as expected with Quality help] --"Normal"--> T
S --"Abnormal"-->GR
end
T[Switchover: Redirect RW traffict to sec] --> U1
subgraph Post Switchover Verification
U1[Check if cluster is operational and metrics are normal]--"Normal"--> U2
U1 --"Abnormal"--> LR
U2[Enable Chef, run Chef-Client] --"Normal"--> U3
U2 --"Abnormal"--> LR
U3[Check if cluster is operational and metrics are normal] --"Normal"--> Success
U3 --"Abnormal"--> LR
Success[Success!]
end
subgraph GR[Gracefull Rollback - no dataloss]
GR1[Start gracefull rollback]
end
subgraph LR[Fix forward]
LR1[Fix all issues] -->LR2
LR2[Return to last failed step]
end
Playbook source: https://gitlab.com/gitlab-com/gl-infra/db-migration/-/tree/master/pg-physical-to-logical
Prep Tasks
-
PCL Start time (2025-04-17 22:00 UTC) - DECOMPOSITION minus 4 hours
-
☎️ Comms-Handler : Coordinate with@release-managers
at #g_deliveryHi @release-managers :waves:, We would like to communicate that deployments should be stopped/locked in the STAGING environment, in the next hour, as we should start the database decomposition of the MAIN and SEC PostgreSQL clusters at 2025-04-17 23:00 UTC - see https://gitlab.com/gitlab-com/gl-infra/production/-/issues/19684. :bow: This should also please include locking automated canary deployments as well if you could please do that.
-
[ ] 🏆 Quality On-Call : Confirm that QA tests are passing as a pre-decomp sanity check-
🏆 Quality On-Call : Confirm that smoke QA tests are passing on the current cluster by checking latest status forSmoke
Type tests in Staging and Staging Canary Allure reports listed in QA pipelines.-
🏆 Quality On-Call : Trigger Smoke E2E suite against the environment that was decomposed: Staging:Four hourly smoke tests
. This has an estimated duration of 15 minutes. -
🏆 Quality On-Call : If the smoke tests fail, we should re-run the failed job to see if it is reproducible. -
🏆 Quality On-Call : In parallel reach out to on-call Test Platform DRI for the help with investigation. If there is no available on-call DRI, reach out to#test-platform
and escalate with the management team.
-
-
Prepare the environment
-
[ ] 🔪 Playbook-Runner : Check that all needed MRs are rebased and contain the proper changes.-
[ ] Post-Decomp MR, to change pgbouncer configurations in sec: -
[ ] GSTG-CNY MR, to add sec configuration to gstg-cny: - MR for gstg-cny: gitlab-com/gl-infra/k8s-workloads/gitlab-com!4336 (merged)
-
[ ] GSTG-SIDEKIQ MR, to move sec read-only over to sec-db-replica - MR for gstg-sidekiq-sec: gitlab-com/gl-infra/k8s-workloads/gitlab-com!4337 (merged)
-
[ ] GSTG WEB MR, to move sec read-only over to sec-db-replica - MR for gstg-web: gitlab-com/gl-infra/k8s-workloads/gitlab-com!4338 (merged)
-
[ ] GSTG-BASE MR, to move sec read-only over to sec-db-replica - MR for gstg-base: https://gitlab.com/gitlab-com/gl-infra/chef-repo/-/merge_requests/5864
-
[ ] GSTG-PATRONI-SEC MR, to remove standby configuration - MR for gstg-base: https://gitlab.com/gitlab-com/gl-infra/chef-repo/-/merge_requests/5865
-
-
🔪 Playbook-Runner : Get the console VM ready for action-
SSH to the console VM in gstg
-
ssh console-01-sv-gstg.c.gitlab-staging-1.internal
-
-
Configure dbupgrade user -
Disable screen sharing to reduce risk of exposing private key -
Change to user dbupgrade sudo su - dbupgrade
-
Copy dbupgrade user's private key from 1Password to ~/.ssh/id_dbupgrade
-
chmod 600 ~/.ssh/id_dbupgrade
-
Use key as default ln -s /home/dbupgrade/.ssh/id_dbupgrade /home/dbupgrade/.ssh/id_rsa
-
Repeat the same steps steps on target leader (it also has to have the private key) -
Enable re-screen sharing if beneficial
-
-
Create an access_token with at least read_repository
for the next step -
Clone repos: rm -rf ~/src \ && mkdir ~/src \ && cd ~/src \ && git clone https://gitlab.com/gitlab-com/gl-infra/db-migration.git \ && cd db-migration \ && git checkout master
-
Ensure you have Ansible installed: python3 -m venv ansible source ansible/bin/activate python3 -m pip install --upgrade pip python3 -m pip install ansible python3 -m pip install jmespath ansible --version
-
Ensure that Ansible can talk to all the hosts in gstg-main and gstg-sec cd ~/src/db-migration/pg-physical-to-logical ansible -e "ansible_ssh_private_key_file=/home/dbupgrade/.ssh/id_dbupgrade" -i inventory/gstg-sec-decomp.yml all -m ping
You shouldn't see any failed hosts!
-
-
[ ] 🔪 Playbook-Runner : Add the following silences at https://alerts.gitlab.net to silence alerts in main and sec nodes until 4 hours after the switchover time:- Start time: 2025-04-17 22:00
- Duration:
4h
- Matchers
- main
env="gstg"
fqdn=~"patroni-main-v16.*"
- sec
env="gstg"
fqdn=~"patroni-sec-v16.*"
- main
-
🐺 Coordinator : Get a green light from the🚑 EOC
SEC Decomposition Prep Work
-
Prepare Environment
-
[ ] ☎️ Comms-Handler : Coordinate with@release-managers
at #g_deliveryHi @release-managers :waves:, We would like to make sure that deployments have been stopped for our `MAIN` and `SEC` database in the `STAGING` environment, until 2025-04-18 05:00 UTC. Be aware that we are deactivating certain feature flags during this time. All details can be found in the CR. Please be so kind and comment the acknowledgement on https://gitlab.com/gitlab-com/gl-infra/production/-/issues/19684. :bow:
-
☎️ Comms-Handler : Inform the database team at #g_database_frameworks and #g_database_operationsHi @dbo and @db_team, Please note that we started the operational block for the `MAIN` and `SEC` clusters for SEC Decomposition, therefore we are blocking database model/structure modifications, by disabling the following tasks (`execute_batched_migrations_on_schedule` and `execute_background_migrations`, reindexing, async_foreign_key, async_index features and partition_manager_sync_partitions) in the `STAGING` environment. We will re-enable DDLs once the CR is finished and the rollback window is closed at 2025-04-18 05:00 UTC. Thanks!
-
🔪 Playbook-Runner : Disable the DDL-related feature flags:-
Disable feature flags by typing the following into #production
:-
/chatops run feature set disallow_database_ddl_feature_flags true --staging
-
-
-
Prechecks
-
🐺 Coordinator : Check if disallow_database_ddl_feature_flags is ENABLED:-
On slack /chatops run feature get disallow_database_ddl_feature_flags --staging
-
-
🔪 Playbook-Runner : Monitor what pgbouncer pool has connections: [monitoring_pgbouncer_gitlab_user_conns][monitoring_pgbouncer_gitlab_user_conns] -
🔪 Playbook-Runner : Disable chef on the main db cluster, sec db cluster and sec pgbouncersknife ssh "role:gstg*patroni*main*" "sudo /usr/local/bin/chef-client-disable 'GSTG Sec Decomp'" knife ssh "role:gstg*patroni*sec*" "sudo /usr/local/bin/chef-client-disable 'GSTG Sec Decomp'" knife ssh "role:gstg*pgbouncer*sec*" "sudo /usr/local/bin/chef-client-disable 'GSTG Sec Decomp'"
-
[ ] 🔪 Playbook-Runner : Check if anyone except application is connected to source primary and interrupt them:-
Confirm the source primary knife ssh "role:gstg*patroni*main*" "sudo gitlab-patronictl list"
-
[ ] Login to source primary ssh patroni-main-v16-04-db-gstg.c.gitlab-staging-1.internal
-
Check all connections that are not gitlab
:gitlab-psql -c " select pid, client_addr, usename, application_name, backend_type, clock_timestamp() - backend_start as connected_ago, state, left(query, 200) as query from pg_stat_activity where pid <> pg_backend_pid() and not backend_type ~ '(walsender|logical replication|pg_wait_sampling)' and usename not in ('gitlab', 'gitlab-registry', 'pgbouncer', 'postgres_exporter', 'gitlab-consul') and application_name <> 'Patroni' "
-
If there are sessions that potentially can perform any writes, spend up to 10 minutes to make an attempt to find the actors and ask them to stop. -
Finally, terminate all the remaining sessions that are not coming from application/infra components and potentially can cause writes: gitlab-psql -c " select pg_terminate_backend(pid) from pg_stat_activity where pid <> pg_backend_pid() and not backend_type ~ '(walsender|logical replication|pg_wait_sampling)' and usename not in ('gitlab', 'gitlab-registry', 'pgbouncer', 'postgres_exporter', 'gitlab-consul') and application_name <> 'Patroni' "
-
-
🔪 Playbook-Runner : Run physical_prechecks playbook:cd ~/src/db-migration/pg-physical-to-logical ansible-playbook \ -e "ansible_ssh_private_key_file=/home/dbupgrade/.ssh/id_dbupgrade" \ -i inventory/gstg-sec-decomp.yml physical_prechecks.yml 2>&1 \ | ts | tee -a ansible_physical-to-logical_gstg_sec_$(date +%Y%m%d).log
-
Playbook-Runner : Check pgpass, .pgpass are the same on both the source and target cluster primaries. -
🔪 Playbook-Runner : Verify configuration of pgbouncer-sec and pgbouncer-sidekiq-secknife ssh "role:gstg*pgbouncer*sec*" "sudo grep master.patroni /var/opt/gitlab/pgbouncer/databases.ini" # should return master.patroni.service.consul prior to switchover!
Break physical replication and configure logical replication
-
Convert Physical Replication to Logical
-
🔪 Playbook-Runner : Run physical-to-logical playbook:cd ~/src/db-migration/pg-physical-to-logical ansible-playbook \ -e "ansible_ssh_private_key_file=/home/dbupgrade/.ssh/id_dbupgrade" \ -i inventory/gstg-sec-decomp.yml physical_to_logical.yml 2>&1 \ | ts | tee -a ansible_physical-to-logical_gstg_sec_$(date +%Y%m%d).log
Read-Only Traffic Configs
-
Read Only Traffic Switchover
-
Web Node Canary Rollout
-
🔪 Playbook-Runner : Switchover gstg web configuration to newpgbouncer-sec
-
merge gitlab-com/gl-infra/k8s-workloads/gitlab-com!4336 (merged) -
Verify connectivity, monitor pgbouncer connections -
Observe logs
andprometheus
for errors (see below)
-
Observable Logs and Prometheus Metrics
All logs will split db_*_count
metrics into separate buckets describing each used connection:
-
🔪 Coordinator : Ensurejson.db_sec_count : *
logs are present
- Primary connection usage by state -
pg_stat_activity_count
pgbouncer_stats_queries_pooled_total
-
Sidekiq Node Rollout
-
🔪 Playbook-Runner : Switchover gstg sidekiq configuration to newpgbouncer-sec
-
Verify connectivity, monitor pgbouncer connections -
Observe logs
andprometheus
for errors (see below)
-
Observable Logs and Prometheus Metrics
All logs will split db_*_count
metrics into separate buckets describing each used connection:
-
🔪 Coordinator : Ensurejson.db_sec_count : *
logs are present
- Primary connection usage by state -
pg_stat_activity_count
pgbouncer_stats_queries_pooled_total
-
Web Node Rollout
-
🔪 Playbook-Runner : Switchover gstg web configuration to newpgbouncer-sec
-
🔪 Playbook-Runner : Verify connectivity, monitor pgbouncer connections -
🔪 Coordinator : Observelogs
andprometheus
for errors (see below) -
🔪 Playbook-Runner : Cleanup: Remove overrides in each configuration node and promote chef database connection configuration to gstg-base) settingsec
to newpatroni-sec-v16
DB. Writes will continue to go through PGBouncer host tomain
and reads tosec
replicas.-
https://gitlab.com/gitlab-com/gl-infra/chef-repo/-/merge_requests/5864 -
run chef-client on the console nodes
-
-
Observable Logs and Prometheus Metrics
4.4.1 Observable logs
All logs will split db_*_count
metrics into separate buckets describing each used connection:
-
🔪 Coordinator : Ensurejson.db_sec_count : *
logs are present
4.4.2. Observable prometheus metrics
- Primary connection usage by state -
pg_stat_activity_count
pgbouncer_stats_queries_pooled_total
-
Verify Read Traffic to patroni-sec
-
🔪 Playbook-Runner : monitoring_pgbouncer_gitlab_user_conns- Ensure traffic is now being seen for monitoring_pgbouncer_gitlab_user_conns
Switchover - Take 1
Phase 7 – execute!
-
Phase 7 - switchover
-
🔪 Playbook-Runner : Run Ansible playbook for Database Decomposition for the gstg-sec cluster:cd ~/src/db-migration/pg-physical-to-logical ansible-playbook \ -e "ansible_ssh_private_key_file=/home/dbupgrade/.ssh/id_dbupgrade" \ -i inventory/gstg-sec-decomp.yml switchover.yml 2>&1 \ | ts | tee -a ansible_upgrade_gstg_sec_$(date +%Y%m%d).log
-
🔪 Playbook-Runner : Edit the /var/opt/gitlab/gitlab-rails/etc/database.yml file on the console node to set database_tasks: true for the sec cluster -
🔪 Playbook-Runner : Block writes to main cluster in sec cluster and sec cluster in main cluster by running this on the console nodegitlab-rake gitlab:db:lock_writes
-
[ ] 🔪 Playbook-Runner : Verify configuration of pgbouncer-sec and pgbouncer-sidekiq-secknife ssh "role:gstg*pgbouncer*sec*" "sudo grep master.patroni /var/opt/gitlab/pgbouncer/databases.ini" # should return master.patroni-sec.service.consul after switchover!
-
🔪 Playbook-Runner : Verify reverse logical replication lag is low on patroni-sec leader:-
ssh patroni-sec-v16-03-db-gstg.c.gitlab-staging-1.internal
- sudo gitlab-psql
select pg_wal_lsn_diff(pg_current_wal_lsn(), confirmed_flush_lsn) from pg_replication_slots where slot_name like 'logical_replication_slot%' order by 1 desc limit 1;
- sudo gitlab-psql
-
Post Switchover QA Tests
-
Smoke Tests
-
Start Post Switchover QA -
🏆 Quality On-Call : Full E2E suite against the environment that was decomposed: Staging:Four hourly smoke tests
, andDaily Full QA suite
-
-
🏆 Quality On-Call : (after an hour): Check that the Smoke, and Full E2E suite has passed. If there are failures, reach out to on-call Test Platform DRI for the help with investigation. If there is no available on-call DRI, reach out to#test-platform
and escalate with the management team.-
🏆 Quality On-Call : If the Smoke or Full E2E tests fail, Quality performs an initial triage of the failure. If Quality cannot determine failure is 'unrelated', team decides on declaring an incident and following the incident process.
-
Rollback
Estimated Time to Complete (mins) - 120
-
Rollback (required for testing!)
-
🔪 Playbook-Runner : Monitor what pgbouncer pool has connections [monitoring_pgbouncer_gitlab_user_conns][monitoring_pgbouncer_gitlab_user_conns]
ROLLBACK – execute!
Goal: Set gstg-main cluster as Primary cluster
-
🔪 Playbook-Runner : Verify reverse logical replication lag is low on patroni-sec leader. This must be done using cmds run on the database not the graph. This must be done by a human. This must be done even if you have previously checked replication lag:ssh patroni-sec-v16-03-db-gstg.c.gitlab-staging-1.internal` sudo gitlab-psql `select pg_wal_lsn_diff(pg_current_wal_lsn(), confirmed_flush_lsn) from pg_replication_slots where slot_name like 'logical_replication_slot%' order by 1 desc limit 1;`
-
🔪 Playbook-Runner : Executeswitchover_rollback.yml
playbook to rollback to MAIN cluster:cd ~/src/db-migration/pg-physical-to-logical ansible-playbook \ -e "ansible_ssh_private_key_file=/home/dbupgrade/.ssh/id_dbupgrade" \ -i inventory/gstg-sec-decomp.yml \ switchover_rollback.yml 2>&1 \ | ts | tee -a ansible_switchover_rollback_gstg_sec_$(date +%Y%m%d).log
-
🔪 Playbook-Runner : Unlock writes to main cluster in sec cluster and sec cluster in main cluster by running this on the console nodegitlab-rake gitlab:db:unlock_writes
-
🔪 Playbook-Runner : Verify configuration of pgbouncer-sec and pgbouncer-sidekiq-sec step after rollbackknife ssh "role:gstg*pgbouncer*sec*" "sudo grep master.patroni /var/opt/gitlab/pgbouncer/databases.ini" # should return master.patroni.service.consul after rollback!
-
🔪 Playbook-Runner : Check WRITES going to the SOURCE cluster,patroni-main-v16
: [monitoring_user_tables_writes][monitoring_user_tables_writes] -
🔪 Playbook-Runner : Verify forward logical replication lag is low on patroni-main leader:-
ssh patroni-main-v16-04-db-gstg.c.gitlab-staging-1.internal
- sudo gitlab-psql
select pg_wal_lsn_diff(pg_current_wal_lsn(), confirmed_flush_lsn) from pg_replication_slots where slot_name like 'logical_replication_slot%' order by 1 desc limit 1;
- sudo gitlab-psql
-
Smoke Tests
-
🔪 Quality On-Call : Confirm that our smoke tests are still passing
Switchover - Take 2
-
Phase 7 - switchover take 2
-
🔪 Playbook-Runner : Run Ansible playbook for Database Decomposition for the gstg-sec cluster:cd ~/src/db-migration/pg-physical-to-logical ansible-playbook \ -e "ansible_ssh_private_key_file=/home/dbupgrade/.ssh/id_dbupgrade" \ -i inventory/gstg-sec-decomp.yml switchover.yml 2>&1 \ | ts | tee -a ansible_upgrade_gstg_sec_$(date +%Y%m%d).log
-
🔪 Playbook-Runner : Block writes to main cluster in sec cluster and sec cluster in main cluster by running this on the console nodegitlab-rake gitlab:db:lock_writes
-
🔪 Playbook-Runner : Verify configuration of pgbouncer-sec and pgbouncer-sidekiq-secknife ssh "role:gstg*pgbouncer*sec*" "sudo grep master.patroni /var/opt/gitlab/pgbouncer/databases.ini" # should return master.patroni-sec.service.consul after switchover!
-
Persist Correct configurations
-
[ ] 🔪 Playbook-Runner : Revert MR for the GSTG-CNY configuration so it uses global config -
[ ] 🔪 Playbook-Runner : Merge the MRs that reconfigure patroni/pgbouncer in Chef for patroni-sec-v16. First confirm there are no errors in the merge pipelines. If the MRs were merged and the pipeline failed, revert the MR with the failed pipeline and get it merged properly.-
MR for pgbouncer-sec: https://gitlab.com/gitlab-com/gl-infra/chef-repo/-/merge_requests/5858 -
MR for patroni-sec: https://gitlab.com/gitlab-com/gl-infra/chef-repo/-/merge_requests/5865
-
-
🔪 Playbook-Runner : Validate chef pipelines finished correctly on ops: -
🔪 Playbook-Runner : Remove standby_cluster config from patroni.yml (as needed)knife ssh "role:gstg*patroni*sec*" "sudo grep -A2 standby_cluster /var/opt/gitlab/patroni/patroni.yml" # should return nothing
- If required, ssh to each patroni-sec host and remove the standby_cluster configuration from patroni.yml
-
🔪 Playbook-Runner : Run chef-client on one pgbouncer host and verify the configuration was not changed (changes require a reload to migrate traffic, so check nothing changed. If needed, revert the MR and update to resolve)knife ssh "role:gstg*pgbouncer*sec*" "sudo grep master.patroni /var/opt/gitlab/pgbouncer/databases.ini" # should return master.patroni-sec.service.consul after switchover!
-
🔪 Playbook-Runner : Run chef-client on the backup patroni-sec host (patroni-sec-v16-02-db-gstg) and verify the configuration was not changed.knife ssh patroni-sec-v16-02-db-gstg.c.gitlab-staging-1.internal "sudo grep -A2 standby_cluster /var/opt/gitlab/patroni/patroni.yml" # should not return a value! Stop and investigate if this isn't correct!
-
🔪 Playbook-Runner : Run chef-client on the leader patroni-sec host (patroni-sec-v16-01-db-gstg) and verify the configuration was not changed.knife ssh patroni-sec-v16-01-db-gstg.c.gitlab-staging-1.internal "sudo grep -A2 standby_cluster /var/opt/gitlab/patroni/patroni.yml" # should not return a value
-
🔪 Playbook-Runner : Check WRITES going to the TARGET cluster,patroni-sec-v16
: [monitoring_user_tables_writes][monitoring_user_tables_writes] -
🔪 Playbook-Runner : Check READS going to the TARGET cluster,patroni-sec-v16
: [monitoring_user_tables_reads][monitoring_user_tables_reads]. -
🔪 Playbook-Runner : Confirm chef-client is ENABLED in all nodes [monitoring_chef_client_enabled][monitoring_chef_client_enabled] -
🔪 Playbook-Runner : Start cron.service on all gstg-sec nodes:knife ssh "role:gstg-base-db-patroni-sec-v16" "sudo systemctl is-active cron.service" knife ssh "role:gstg-base-db-patroni-sec-v16" "sudo systemctl start cron.service" knife ssh "role:gstg-base-db-patroni-sec-v16" "sudo systemctl is-active cron.service"
-
Enable databaseTasks for k8s workloads
-
🔪 Playbook-Runner : Merge the MR that enables db_database_tasks for k8s nodes
-
Enable databaseTasks for deploy nodes
-
🔪 Playbook-Runner : Merge the MR that enables db_database_tasks for deploy nodes-
MR for chef-repo: https://gitlab.com/gitlab-com/gl-infra/chef-repo/-/merge_requests/5842
-
Post Switchover QA Tests
-
Wrapping Up
-
Start Post Switchover QA -
🏆 Quality On-Call : Full E2E suite against the environment that was decomposed: Staging:Daily Full QA suite
-
-
🔪 Playbook-Runner : Create the wal-g daily restore schedule for the [gstg] - [sec] cluster at https://ops.gitlab.net/gitlab-com/gl-infra/data-access/durability/gitlab-restore/postgres-gprd/-/pipeline_schedules-
Change the following variables: PSQL_VERSION = 16
-
BACKUP_PATH = ?
(? = use the "directory" from the new v16 GCS backup location at: https://console.cloud.google.com/storage/browser/gitlab-gstg-postgres-backup/pitr-walg-sec-v16)
-
-
🏆 Quality On-Call : (after an hour): Check that the Smoke, and Full E2E suite has passed. If there are failures, reach out to on-call Test Platform DRI for the help with investigation. If there is no available on-call DRI, reach out to#test-platform
and escalate with the management team.-
🏆 Quality On-Call : If the Smoke or Full E2E tests fail, Quality performs an initial triage of the failure. If Quality cannot determine failure is 'unrelated', team decides on declaring an incident and following the incident process.
-
-
🔪 Playbook-Runner : Disable the DDL-related feature flags:-
Disable feature flags by typing the following into #production
:-
/chatops run feature set disallow_database_ddl_feature_flags false --staging
-
-
-
☎️ Comms-Handler : Coordinate with@release-managers
at #g_deliveryHi @release-managers :waves:, Sec Decomp switchover/rollback/switchover has been completed and deployments may resume! See https://gitlab.com/gitlab-com/gl-infra/production/-/issues/19639. :bow:
-
☎️ Comms-Handler : Inform the database team at #g_database_frameworks and #g_database_operationsHi, Please note that we have completed the operational block for the `MAIN` and `SEC` clusters for SEC Decomposition, therefore we are re-enabling the following tasks (`execute_batched_migrations_on_schedule` and `execute_background_migrations`, reindexing, async_foreign_key, async_index features and partition_manager_sync_partitions) in the `STAGING` environment. Thanks!
Extra details
In case the Playbook-Runner is disconnected
As most of the steps are executed in a tmux session owned by the Playbook-Runner role we need a safety net in case this person loses their internet connection or otherwise drops off half way through. Since other SREs/DBREs also have root access on the console node where everything is running they should be able to recover it in different ways. We tested the following approach to recovering the tmux session, updating the ssh agent and taking over as a new ansible user.
ssh host
- Add your public SSH key to
/home/PREVIOUS_PLAYBOOK_USERNAME/.ssh/authorized_keys
-
sudo chef-client-disable https://gitlab.com/gitlab-com/gl-infra/production/-/issues/19639
so that we don't override the above ssh -A PREVIOUS_PLAYBOOK_USERNAME@host
echo $SSH_AUTH_SOCK
tmux attach -t 0
export SSH_AUTH_SOCK=<VALUE from previous SSH_AUTH_SOCK output>
<ctrl-b> :
set-environment -g 'SSH_AUTH_SOCK' <VALUE from previous SSH_AUTH_SOCK output>
export ANSIBLE_REMOTE_USER=NEW_PLAYBOOK_USERNAME
<ctrl-b> :
set-environment -g 'ANSIBLE_REMOTE_USER' <your-user>