[GSTG][Sec Decomp] - Phase 7 switchover/rollback/switchover test
Staging Change
Change Details
- Services Impacted - ServicePatroni ServicePatroniSec
- Change Technician - @jjsisson
- Change Reviewer - @rhenchen.gitlab @alexander-sosna @bshah11
- Scheduled Date and Time (UTC in format YYYY-MM-DD HH:MM) - 2025-04-11 20:00
- Time tracking - 6h
- Downtime Component - none
Database Decomposition SEC database in GSTG
Note: This CR will be copied to ops.gitlab.net, where it will be utilized in the event of an unexpected downtime for gitlab.com Link to gitlab.com CR: #19639 (closed)
Database Decomposition Rollout Team
| Role | Assigned To | 
|---|---|
|  | @theoretick | 
|  | @jjsisson | 
|  | - | 
|  | @hmuralidhar | 
|  | Can be from PD schedule | 
|  | Can be from PD schedule | 
|  | Check CMOC escalation table below | 
|  | @ghavenga | 
|  | |
|  | @release-managers | 
|  | 
📣  CMOC Escalation Table
Important: Just for when each window begins - else ping @cmoc on Slack
| Date and Step | Assigned To | 
|---|---|
| 2025-04-10 23:00 UTC - PCL start | TBD | 
| 2025-04-11 03:00 UTC - Decomp start | TBD | 
| 2025-04-11 07:00 UTC - Switchover | TBD | 
| 2025-04-11 11:00 UTC - PCL finish | TBD | 
Collaboration
During the change window, the rollout team will collaborate using the following communications channels:
| App | Direct Link | 
|---|---|
| Slack | #g_database_operations | 
| Video Call | TBD | 
Immediately
Perform these steps when the issue is created.
- 
🐺 Coordinator : Fill out the names of the rollout team in the table above.
Support Options
| Provider | Plan | Details | Create Ticket | 
|---|---|---|---|
| Google Cloud Platform | Gold Support | 24x7, email & phone, 1hr response on critical issues | Create GCP Support Ticket | 
Entry points
| Entry point | Before | Blocking mechanism | Allowlist | QA needs | Notes | 
|---|---|---|---|---|---|
| Pages | Available via *.gitlab.io, and various custom domains | Unavailable if GitLab.com goes down for a brief time. There is a cache but it will expire in gitlab_cache_expiryminutes | N/A | N/A | |
Database hosts
Accessing the rails and database consoles
- rails: ssh $USER-rails@console-01-sv-gstg.c.gitlab-staging-1.internal
- main db replica: ssh $USER-db@console-01-sv-gstg.c.gitlab-staging-1.internal
- main db primary: ssh $USER-db-primary@console-01-sv-gstg.c.gitlab-staging-1.internal
- main db psql: ssh -t patroni-main-v16-04-db-gstg.c.gitlab-staging-1.internal sudo gitlab-psql
- sec db replica: ssh $USER-db-sec@console-01-sv-gstg.c.gitlab-staging-1.internal
- sec db primary: ssh $USER-db-sec-primary@console-01-sv-gstg.c.gitlab-staging-1.internal
- sec db psql: ssh -t patroni-sec-v16-03-db-gstg.c.gitlab-staging-1.internal sudo gitlab-psql
Dashboards and debugging
These dashboards might be useful during the rollout:
Staging
- PostgreSQL replication overview
- Triage overview
- Sidekiq overview
- Sentry - includes application errors
- Logs (Kibana)
Destination db: sec
- monitoring_pgbouncer_gitlab_user_conns
- monitoring_chef_client_enabled
- monitoring_chef_client_last_run
- monitoring_chef_client_error
- monitoring_snapshot_last_run
- monitoring_user_tables_writes
- monitoring_user_tables_reads
- monitoring_gitlab_maintenance_mode
Source db: main
- monitoring_pgbouncer_gitlab_user_conns
- monitoring_chef_client_enabled
- monitoring_chef_client_last_run
- monitoring_chef_client_error
- monitoring_snapshot_last_run
- monitoring_user_tables_writes
- monitoring_user_tables_reads
- monitoring_gitlab_maintenance_mode
Repos used during the rollout
The following Ansible playbooks are referenced throughout this issue:
- Postgres Physical-to-Logical Replication, Decomposition, and Rollback: https://gitlab.com/gitlab-com/gl-infra/db-migration/-/tree/master/pg-physical-to-logical
High level overview
This gives an high level overview on the procedure.
Decomposition Flowchart
flowchart TB
    subgraph Prepare new environment
    A[Create new cluster sec as a carbon copy of main] --> B
    B[Attach sec as a standby-only-cluster to main via physical replication] --> C
    end
    C[Make sure both clusters are in sync] --> D1
    subgraph Break Physical Replication: ansible-playbook physical-to-logical.yml
    D1[Disable Chef] --> D2
    D2[Perform clean shutdown of sec] --> D3
    D3[On main, create a replication slot and publication FOR ALL main TABLES; remember its LSN] --> D4
    D4[Configure recovery_target_lsn on sec] --> D5
    D5[Start sec] --> D6
    D6[Let sec reach the slot's LSN, still using physical replication] --> D7
    D7[Once slot's LSN is reached, promote sec leader] --> D8
    D9[Create logical subscription with copy_data=false] --> D10
    D10[Let sec catch up using logical replication] --> H
    end
    subgraph Redirect RO to sec
    H[Redirect RO only to sec] --> R
    R[Check if cluster is operational and metrics are normal] --"Normal"--> S
    R --"Abnormal"--> GR
    S[DBRE verify E2E tests run as expected with Quality help] --"Normal"--> T
    S --"Abnormal"-->GR
    end
    T[Switchover: Redirect RW traffict to sec] --> U1
    subgraph Post Switchover Verification
    U1[Check if cluster is operational and metrics are normal]--"Normal"--> U2
    U1 --"Abnormal"--> LR
    U2[Enable Chef, run Chef-Client] --"Normal"--> U3
    U2 --"Abnormal"--> LR
    U3[Check if cluster is operational and metrics are normal] --"Normal"--> Success
    U3 --"Abnormal"--> LR
    Success[Success!]
    end
    subgraph GR[Gracefull Rollback - no dataloss]
    GR1[Start gracefull rollback]
    end
    subgraph LR[Fix forward]
    LR1[Fix all issues] -->LR2
    LR2[Return to last failed step]
    endPlaybook source: https://gitlab.com/gitlab-com/gl-infra/db-migration/-/tree/master/pg-physical-to-logical
Prep Tasks
- 
- 
☎️ Comms-Handler : Coordinate with@release-managersat #g_deliveryHi @release-managers :waves:, We would like to communicate that deployments should be stopped/locked in the STAGING environment, in the next hour, as we should start the database decomposition of the MAIN and SEC PostgreSQL clusters at 2025-04-11 03:00 UTC - see https://gitlab.com/gitlab-com/gl-infra/production/-/issues/19639. :bow:
- 
🏆 Quality On-Call : Confirm that QA tests are passing as a pre-decomp sanity check- 
🏆 Quality On-Call : Confirm that smoke QA tests are passing on the current cluster by checking latest status forSmokeType tests in Staging and Staging Canary Allure reports listed in QA pipelines.- 
🏆 Quality On-Call : Trigger Smoke E2E suite against the environment that was decomposed: Staging:Four hourly smoke tests. This has an estimated duration of 15 minutes.
- 
🏆 Quality On-Call : If the smoke tests fail, we should re-run the failed job to see if it is reproducible.
- 
🏆 Quality On-Call : In parallel reach out to on-call Test Platform DRI for the help with investigation. If there is no available on-call DRI, reach out to#test-platformand escalate with the management team.
 
- 
 
- 
Prepare the environment
- 
🔪 Playbook-Runner : Check that all needed MRs are rebased and contain the proper changes.- 
Post-Decomp MR, to change pgbouncer configurations in sec: 
- 
GSTG-CNY MR, to add sec configuration to gstg-cny: - MR for gstg-cny: gitlab-com/gl-infra/k8s-workloads/gitlab-com!4299 (merged)
 
- 
GSTG-SIDEKIQ MR, to move sec read-only over to sec-db-replica - MR for gstg-sidekiq-sec: gitlab-com/gl-infra/k8s-workloads/gitlab-com!4300 (merged)
 
- 
GSTG WEB MR, to move sec read-only over to sec-db-replica - MR for gstg-web: gitlab-com/gl-infra/k8s-workloads/gitlab-com!4315 (merged)
 
- 
GSTG-BASE MR, to move sec read-only over to sec-db-replica - MR for gstg-base: https://gitlab.com/gitlab-com/gl-infra/chef-repo/-/merge_requests/5841
 
 
- 
- 
🔪 Playbook-Runner : Get the console VM ready for action- 
SSH to the console VM in gstg- 
ssh console-01-sv-gstg.c.gitlab-staging-1.internal
 
- 
- 
Configure dbupgrade user - 
Disable screen sharing to reduce risk of exposing private key 
- 
Change to user dbupgrade sudo su - dbupgrade
- 
Copy dbupgrade user's private key from 1Password to ~/.ssh/id_dbupgrade
- 
chmod 600 ~/.ssh/id_dbupgrade
- 
Use key as default ln -s /home/dbupgrade/.ssh/id_dbupgrade /home/dbupgrade/.ssh/id_rsa
- 
Repeat the same steps steps on target leader (it also has to have the private key) 
- 
Enable re-screen sharing if beneficial 
 
- 
- 
Create an access_token with at least read_repositoryfor the next step
- 
Clone repos: rm -rf ~/src \ && mkdir ~/src \ && cd ~/src \ && git clone https://gitlab.com/gitlab-com/gl-infra/db-migration.git \ && cd db-migration \ && git checkout master
- 
Ensure you have Ansible installed: python3 -m venv ansible source ansible/bin/activate python3 -m pip install --upgrade pip python3 -m pip install ansible python3 -m pip install jmespath ansible --version
- 
Ensure that Ansible can talk to all the hosts in gstg-main and gstg-sec cd ~/src/db-migration/pg-logical-to-physical ansible -e "ansible_ssh_private_key_file=/home/dbupgrade/.ssh/id_dbupgrade" -i inventory/gstg-sec-decomp.yml all -m ping
- 
In advance, run pre-checks: cd ~/src/db-migration/pg-logical-to-physical ansible-playbook \ -e "ansible_ssh_private_key_file=/home/dbupgrade/.ssh/id_dbupgrade" \ -i inventory/gstg-sec-decomp.yml physical_prechecks.yml 2>&1 \ | ts | tee -a ansible_upgrade_pre_checks_gstg_sec_$(date +%Y%m%d).logYou shouldn't see any failed hosts! 
 
- 
- 
🔪 Playbook-Runner : Add the following silences at https://alerts.gitlab.net to silence alerts in main and sec nodes until 4 hours after the switchover time:- Start time: 2025-04-19T13:00:00.000Z
- Duration: 4h
- Matchers
- main
- env="gstg"
- fqdn=~"patroni-main-v16.*"
 
- sec
- env="gstg"
- fqdn=~"patroni-sec-v16.*"
 
 
- main
 
- Start time: 
- 
🐺 Coordinator : Get a green light from the🚑 EOC
SEC Decomposition Prep Work
- 
- 
☎️ Comms-Handler : Coordinate with@release-managersat #g_deliveryHi @release-managers :waves:, We would like to make sure that deployments have been stopped for our `MAIN` and `SEC` database in the `STAGING` environment, until 2025-04-11 11:00 UTC. Be aware that we are deactivating certain feature flags during this time. All details can be found in the CR. Please be so kind and comment the acknowledgement on https://gitlab.com/gitlab-com/gl-infra/production/-/issues/19639. :bow:
- 
☎️ Comms-Handler : Inform the database team at #g_databaseHi @gl-database, Please note that we started the operational block for the `MAIN` and `SEC` clusters for SEC Decomposition, therefore we are blocking database model/structure modifications, by disabling the following tasks (`execute_batched_migrations_on_schedule` and `execute_background_migrations`, reindexing, async_foreign_key, async_index features and partition_manager_sync_partitions) in the `STAGING` environment. We will re-enable DDLs once the CR is finished and the rollback window is closed at 2025-04-11 11:00 UTC. Thanks!
- 
🔪 Playbook-Runner : Disable the DDL-related feature flags:- 
Disable feature flags by typing the following into #production:- 
/chatops run feature set disallow_database_ddl_feature_flags true --staging
 
- 
 
- 
- 
- 
🐺 Coordinator : Check if disallow_database_ddl_feature_flags is ENABLED:- 
On slack /chatops run feature get disallow_database_ddl_feature_flags --staging
 
- 
- 
🔪 Playbook-Runner : Monitor what pgbouncer pool has connections: [monitoring_pgbouncer_gitlab_user_conns][monitoring_pgbouncer_gitlab_user_conns]
- 
🔪 Playbook-Runner : Check if anyone except application is connected to source primary and interrupt them:- 
Login to source primary ssh patroni-main-v16-04-db-gstg.c.gitlab-staging-1.internal
- 
Check all connections that are not gitlab:gitlab-psql -c " select pid, client_addr, usename, application_name, backend_type, clock_timestamp() - backend_start as connected_ago, state, left(query, 200) as query from pg_stat_activity where pid <> pg_backend_pid() and not backend_type ~ '(walsender|logical replication|pg_wait_sampling)' and usename not in ('gitlab', 'gitlab-registry', 'pgbouncer', 'postgres_exporter', 'gitlab-consul') and application_name <> 'Patroni' "
- 
If there are sessions that potentially can perform any writes, spend up to 10 minutes to make an attempt to find the actors and ask them to stop. 
- 
Finally, terminate all the remaining sessions that are not coming from application/infra components and potentially can cause writes: gitlab-psql -c " select pg_terminate_backend(pid) from pg_stat_activity where pid <> pg_backend_pid() and not backend_type ~ '(walsender|logical replication|pg_wait_sampling)' and usename not in ('gitlab', 'gitlab-registry', 'pgbouncer', 'postgres_exporter', 'gitlab-consul') and application_name <> 'Patroni' "
 
- 
- 
🔪 Playbook-Runner : Run physical_prechecks playbook:cd ~/src/db-migration/pg-physical-to-logical ansible-playbook \ -e "ansible_ssh_private_key_file=/home/dbupgrade/.ssh/id_dbupgrade" \ -i inventory/gstg-sec-decomp.yml physical_prechecks.yml 2>&1 \ | ts | tee -a ansible_physical-to-logical_gstg_sec_$(date +%Y%m%d).log
- 
🔪 Playbook-Runner : Verify configuration of pgbouncer-sec and pgbouncer-sidekiq-secknife ssh "role:gstg*pgbouncer*sec*" "sudo grep master.patroni /var/opt/gitlab/pgbouncer/databases.ini" # should return master.patroni.service.consul prior to switchover!
Break physical replication and configure logical replication
- 
- 
🔪 Playbook-Runner : Run physical-to-logical playbook:cd ~/src/db-migration/pg-physical-to-logical ansible-playbook \ -e "ansible_ssh_private_key_file=/home/dbupgrade/.ssh/id_dbupgrade" \ -i inventory/gstg-sec-decomp.yml physical_to_logical.yml 2>&1 \ | ts | tee -a ansible_physical-to-logical_gstg_sec_$(date +%Y%m%d).log
Read-Only Traffic Configs
- 
- 
- 
🔪 Playbook-Runner : Simple checks to see if application can still talk to sec_replica database. Expected: db_config_name:sec_replica[10] pry(main)> ActiveRecord::Base.logger = Logger.new(STDOUT) [11] pry(main)> Gitlab::Database::SecApplicationRecord.load_balancer.read { |connection| connection.select_all("SELECT COUNT(*) FROM vulnerability_user_mentions") } (20.3ms) SELECT COUNT(*) FROM vulnerability_user_mentions /*application:console,db_config_name:main_replica,line:/data/cache/bundle-2.7.4/ruby/2.7.0/gems/marginalia-1.10.0/lib/marginalia/comment.rb:25:in `block in construct_comment'*/ => #<ActiveRecord::Result:0x00007fcfc79ccdb0 @column_types={}, @columns=["count"], @hash_rows=nil, @rows=[[1]]>
- 
1. [x] 🔪 {+ Playbook-Runner +}: Switchover [gstg web configuration](https://gitlab.com/gitlab-com/gl-infra/k8s-workloads/gitlab-com/-/blob/e1825d58ae2ae5a3892767a90edf18c1fe466b08/releases/gitlab/values/gstg-cny.yaml.gotmpl#L102) to new `pgbouncer-sec`
- [x] merge [k8s-workload MR](https://gitlab.com/gitlab-com/gl-infra/k8s-workloads/gitlab-com/-/merge_requests/4299)- 
Verify connectivity, monitor pgbouncer connections 
- 
Observe logsandprometheusfor errors
- 
All logs will split db_*_count metrics into separate buckets describing each used connection:
- 
🔪 Coordinator : Ensurejson.db_sec_count : *logs are present
- Primary connection usage by state - pg_stat_activity_count
- pgbouncer_stats_queries_pooled_total
- 
- 
🔪 Playbook-Runner : Switchover gstg sidekiq configuration to newpgbouncer-sec- 
merge k8s-workload MR 
 
- 
- 
Verify connectivity, monitor pgbouncer connections 
- 
Observe logsandprometheusfor errors
- 
All logs will split db_*_count metrics into separate buckets describing each used connection:
- 
🔪 Coordinator : Ensurejson.db_sec_count : *logs are present
- Primary connection usage by state - pg_stat_activity_count
- pgbouncer_stats_queries_pooled_total
- 
- 
🔪 Playbook-Runner : Switchover gstg web configuration to newpgbouncer-sec- 
merge k8s-workload MR 
 
- 
- 
🔪 Playbook-Runner : Verify connectivity, monitor pgbouncer connections
- 
🔪 Coordinator : Observelogsandprometheusfor errors
- 
🔪 Playbook-Runner : Cleanup: Remove overrides in each configuration node and promote chef database connection configuration to gstg-base) settingsecto newpatroni-sec-v16DB. Writes will continue to go through PGBouncer host tomainand reads tosecreplicas.
- 
4.4.1 Observable logs
All logs will split db_*_count metrics into separate buckets describing each used connection:
- 
🔪 Coordinator : Ensurejson.db_sec_count : *logs are present
4.4.2. Observable prometheus metrics
- Primary connection usage by state - pg_stat_activity_count
- pgbouncer_stats_queries_pooled_total
- 
- 
🔪 Playbook-Runner : monitoring_pgbouncer_gitlab_user_conns- Ensure traffic is now being seen for monitoring_pgbouncer_gitlab_user_conns
 
Switchover - Take 1
Phase 7 – execute!
- 
- 
🔪 Playbook-Runner : Run Ansible playbook for Database Decomposition for the gstg-sec cluster:cd ~/src/db-migration/pg-physical-to-logical ansible-playbook \ -e "ansible_ssh_private_key_file=/home/dbupgrade/.ssh/id_dbupgrade" \ -i inventory/gstg-sec-decomp.yml switchover.yml 2>&1 \ | ts | tee -a ansible_upgrade_gstg_sec_$(date +%Y%m%d).log
- 
🔪 Playbook-Runner : Verify configuration of pgbouncer-sec and pgbouncer-sidekiq-secknife ssh "role:gstg*pgbouncer*sec*" "sudo grep master.patroni /var/opt/gitlab/pgbouncer/databases.ini" # should return master.patroni-sec.service.consul after switchover!
- 
🔪 Playbook-Runner : Verify reverse logical replication lag is low on patroni-sec leader:- 
ssh patroni-sec-v16-03-db-gstg.c.gitlab-staging-1.internal- sudo gitlab-psql
- select pg_wal_lsn_diff(pg_current_wal_lsn(), confirmed_flush_lsn) from pg_replication_slots where slot_name like 'logical_replication_slot%' order by 1 desc limit 1;
 
 
- sudo gitlab-psql
 
- 
Post Switchover QA Tests
- 
- 
🏆 Quality On-Call : Trigger Smoke E2E suite against the environment that was upgraded:Staging:Four hourly smoke tests
- 
- 
Start Post Switchover QA - 
🏆 Quality On-Call : Full E2E suite against the environment that was decomposed:Staging:Four hourly smoke tests, andDaily Full QA suite
 
- 
- 
🏆 Quality On-Call : (after an hour): Check that the Smoke, and Full E2E suite has passed. If there are failures, reach out to on-call Test Platform DRI for the help with investigation. If there is no available on-call DRI, reach out to#test-platformand escalate with the management team.- 
🏆 Quality On-Call : If the Smoke or Full E2E tests fail, Quality performs an initial triage of the failure. If Quality cannot determine failure is 'unrelated', team decides on declaring an incident and following the incident process.
 
- 
Rollback
- 
- 
🔪 Playbook-Runner : Monitor what pgbouncer pool has connections [monitoring_pgbouncer_gitlab_user_conns][monitoring_pgbouncer_gitlab_user_conns]
ROLLBACK – execute!
Goal: Set gstg-main cluster as Primary cluster
- 
🔪 Playbook-Runner : Executeswitchover_rollback.ymlplaybook to rollback to MAIN cluster:cd ~/src/db-migration/pg-physical-to-logical ansible-playbook \ -e "ansible_ssh_private_key_file=/home/dbupgrade/.ssh/id_dbupgrade" \ -i inventory/gstg-sec-decomp.yml \ switchover_rollback.yml 2>&1 \ | ts | tee -a ansible_switchover_rollback_gstg_sec_$(date +%Y%m%d).log
- 
🔪 Playbook-Runner : Verify configuration of pgbouncer-sec and pgbouncer-sidekiq-sec step after rollbackknife ssh "role:gstg*pgbouncer*sec*" "sudo grep master.patroni /var/opt/gitlab/pgbouncer/databases.ini" # should return master.patroni.service.consul after rollback!
- 
🔪 Playbook-Runner : Check WRITES going to the SOURCE cluster,patroni-main-v16: [monitoring_user_tables_writes][monitoring_user_tables_writes]
- 
🔪 Playbook-Runner : Verify forward logical replication lag is low on patroni-main leader:- 
ssh patroni-main-v16-04-db-gstg.c.gitlab-staging-1.internal- sudo gitlab-psql
- select pg_wal_lsn_diff(pg_current_wal_lsn(), confirmed_flush_lsn) from pg_replication_slots where slot_name like 'logical_replication_slot%' order by 1 desc limit 1;
 
 
- sudo gitlab-psql
 
- 
Smoke Tests
- 
🔪 Quality On-Call : Confirm that our smoke tests are still passing
Switchover - Take 2
- 
- 
🔪 Playbook-Runner : Run Ansible playbook for Database Decomposition for the gstg-sec cluster:cd ~/src/db-migration/pg-physical-to-logical ansible-playbook \ -e "ansible_ssh_private_key_file=/home/dbupgrade/.ssh/id_dbupgrade" \ -i inventory/gstg-sec-decomp.yml switchover.yml 2>&1 \ | ts | tee -a ansible_upgrade_gstg_sec_$(date +%Y%m%d).log
- 
🔪 Playbook-Runner : Verify configuration of pgbouncer-sec and pgbouncer-sidekiq-secknife ssh "role:gstg*pgbouncer*sec*" "sudo grep master.patroni /var/opt/gitlab/pgbouncer/databases.ini" # should return master.patroni-sec.service.consul after switchover!
- 
- 
🔪 Playbook-Runner : Merge the MR that reconfigures pgbouncer in Chef for patroni-sec-v16. First confirm there are no errors in merge pipeline. If the MR was merged, then revert it, and get it merged properly.- 
MR for patroni-sec-v16: https://gitlab.com/gitlab-com/gl-infra/chef-repo/-/merge_requests/5838 
 
- 
- 
🔪 Playbook-Runner : Before re-enabling chef, ensure that the changes merged from the previous step have been deployed to the Chef server by confirming the linkedmasterpipeline forops.gitlab.netcompleted successfully.
- 
🔪 Playbook-Runner : Run chef-client on one pgbouncer host and verify the configuration was not changed (changes require a reload to migrate traffic, so check nothing changed. If needed, revert the MR and update to resolve)knife ssh "role:gstg*pgbouncer*sec*" "sudo grep master.patroni /var/opt/gitlab/pgbouncer/databases.ini" # should return master.patroni-sec.service.consul after switchover!
- 
🔪 Playbook-Runner : Check WRITES going to the TARGET cluster,patroni-sec-v16: [monitoring_user_tables_writes][monitoring_user_tables_writes]
- 
🔪 Playbook-Runner : Check READS going to the TARGET cluster,patroni-sec-v16: [monitoring_user_tables_reads][monitoring_user_tables_reads].
- 
🔪 Playbook-Runner : Confirm chef-client is ENABLED in all nodes [monitoring_chef_client_enabled][monitoring_chef_client_enabled]
- 
🔪 Playbook-Runner : Start cron.service on all gstg-sec nodes:knife ssh "role:gstg-base-db-patroni-sec-v16" "sudo systemctl is-active cron.service" knife ssh "role:gstg-base-db-patroni-sec-v16" "sudo systemctl start cron.service" knife ssh "role:gstg-base-db-patroni-sec-v16" "sudo systemctl is-active cron.service"
- 
- 
🔪 Playbook-Runner : Merge the MR that enables db_database_tasks for deploy nodes- 
MR for chef-repo: https://gitlab.com/gitlab-com/gl-infra/chef-repo/-/merge_requests/5842 
 
- 
Post Switchover QA Tests
- 
- 
🏆 Quality On-Call : Trigger Smoke E2E suite against the environment that was upgraded:Staging:Four hourly smoke tests
- 
- 
Start Post Switchover QA - 
🏆 Quality On-Call : Full E2E suite against the environment that was decomposed:Staging:Four hourly smoke tests, andDaily Full QA suite
 
- 
- 
🔪 Playbook-Runner : Create the wal-g daily restore schedule for the [gstg] - [sec] cluster at https://ops.gitlab.net/gitlab-com/gl-infra/data-access/durability/gitlab-restore/postgres-gprd/-/pipeline_schedules- 
Change the following variables: - PSQL_VERSION = 16
- 
BACKUP_PATH = ?(? = use the "directory" from the new v16 GCS backup location at: https://console.cloud.google.com/storage/browser/gitlab-gstg-postgres-backup/pitr-walg-sec-v16)
 
 
- 
- 
🏆 Quality On-Call : (after an hour): Check that the Smoke, and Full E2E suite has passed. If there are failures, reach out to on-call Test Platform DRI for the help with investigation. If there is no available on-call DRI, reach out to#test-platformand escalate with the management team.- 
🏆 Quality On-Call : If the Smoke or Full E2E tests fail, Quality performs an initial triage of the failure. If Quality cannot determine failure is 'unrelated', team decides on declaring an incident and following the incident process.
 
- 
- 
🔪 Playbook-Runner : Disable the DDL-related feature flags:- 
Disable feature flags by typing the following into #production:- 
/chatops run feature set disallow_database_ddl_feature_flags false --staging
 
- 
 
- 
- 
☎️ Comms-Handler : Coordinate with@release-managersat #g_deliveryHi @release-managers :waves:, Sec Decomp switchover/rollback/switchover has been completed and deployments may resume! See https://gitlab.com/gitlab-com/gl-infra/production/-/issues/19639. :bow:
- 
☎️ Comms-Handler : Inform the database team at #g_databaseHi @gl-database, Please note that we have completed the operational block for the `MAIN` and `SEC` clusters for SEC Decomposition, therefore we are re-enabling the following tasks (`execute_batched_migrations_on_schedule` and `execute_background_migrations`, reindexing, async_foreign_key, async_index features and partition_manager_sync_partitions) in the `STAGING` environment. Thanks!
Extra details
In case the Playbook-Runner is disconnected
As most of the steps are executed in a tmux session owned by the Playbook-Runner role we need a safety net in case this person loses their internet connection or otherwise drops off half way through. Since other SREs/DBREs also have root access on the console node where everything is running they should be able to recover it in different ways. We tested the following approach to recovering the tmux session, updating the ssh agent and taking over as a new ansible user.
- ssh host
- Add your public SSH key to /home/PREVIOUS_PLAYBOOK_USERNAME/.ssh/authorized_keys
- 
sudo chef-client-disable https://gitlab.com/gitlab-com/gl-infra/production/-/issues/19639so that we don't override the above
- ssh -A PREVIOUS_PLAYBOOK_USERNAME@host
- echo $SSH_AUTH_SOCK
- tmux attach -t 0
- export SSH_AUTH_SOCK=<VALUE from previous SSH_AUTH_SOCK output>
- <ctrl-b> :
- set-environment -g 'SSH_AUTH_SOCK' <VALUE from previous SSH_AUTH_SOCK output>
- export ANSIBLE_REMOTE_USER=NEW_PLAYBOOK_USERNAME
- <ctrl-b> :
- set-environment -g 'ANSIBLE_REMOTE_USER' <your-user>