[GSTG][Sec Decomp] - Phase 7 switchover/rollback/switchover test

Staging Change

Change Details

Services Impacted - ServicePatroni ServicePatroniSec
Change Technician - @jjsisson
Change Reviewer - @rhenchen.gitlab @alexander-sosna @bshah11
Scheduled Date and Time (UTC in format YYYY-MM-DD HH:MM) - 2025-04-11 20:00
Time tracking - 6h
Downtime Component - none

Database Decomposition SEC database in GSTG

Note: This CR will be copied to ops.gitlab.net, where it will be utilized in the event of an unexpected downtime for gitlab.com Link to gitlab.com CR: #19639 (closed)

Database Decomposition Rollout Team

Role	Assigned To
🐺 Coordinator	@theoretick
🔪 DB Playbook-Runner	@jjsisson
☎️ Comms-Handler	-
🏆 Quality	@hmuralidhar
🚑 EOC	Can be from PD schedule
🚒 IMOC	Can be from PD schedule
📣 CMOC	Check CMOC escalation table below
🔦 Database Maintainers	@ghavenga
💾 Database Escalation
🚚 Delivery Escalation	`@release-managers`
🎩 Head Honcho

📣 CMOC Escalation Table

Important: Just for when each window begins - else ping @cmoc on Slack

Date and Step	Assigned To
2025-04-10 23:00 UTC - PCL start	TBD
2025-04-11 03:00 UTC - Decomp start	TBD
2025-04-11 07:00 UTC - Switchover	TBD
2025-04-11 11:00 UTC - PCL finish	TBD

Collaboration

During the change window, the rollout team will collaborate using the following communications channels:

App	Direct Link
Slack	#g_database_operations
Video Call	TBD

Immediately

Perform these steps when the issue is created.

🐺 Coordinator : Fill out the names of the rollout team in the table above.

Support Options

Provider	Plan	Details	Create Ticket
Google Cloud Platform	Gold Support	24x7, email & phone, 1hr response on critical issues	Create GCP Support Ticket

Entry points

Entry point	Before	Blocking mechanism	Allowlist	QA needs	Notes
Pages	Available via *.gitlab.io, and various custom domains	Unavailable if GitLab.com goes down for a brief time. There is a cache but it will expire in `gitlab_cache_expiry` minutes	N/A	N/A

Database hosts

Accessing the rails and database consoles

rails: ssh $USER-rails@console-01-sv-gstg.c.gitlab-staging-1.internal
main db replica: ssh $USER-db@console-01-sv-gstg.c.gitlab-staging-1.internal
main db primary: ssh $USER-db-primary@console-01-sv-gstg.c.gitlab-staging-1.internal
main db psql: ssh -t patroni-main-v16-04-db-gstg.c.gitlab-staging-1.internal sudo gitlab-psql
sec db replica: ssh $USER-db-sec@console-01-sv-gstg.c.gitlab-staging-1.internal
sec db primary: ssh $USER-db-sec-primary@console-01-sv-gstg.c.gitlab-staging-1.internal
sec db psql: ssh -t patroni-sec-v16-03-db-gstg.c.gitlab-staging-1.internal sudo gitlab-psql

Dashboards and debugging

These dashboards might be useful during the rollout:

Staging

Destination db: sec

Source db: main

Repos used during the rollout

The following Ansible playbooks are referenced throughout this issue:

Postgres Physical-to-Logical Replication, Decomposition, and Rollback: https://gitlab.com/gitlab-com/gl-infra/db-migration/-/tree/master/pg-physical-to-logical

High level overview

This gives an high level overview on the procedure.

Decomposition Flowchart

flowchart TB
    subgraph Prepare new environment
    A[Create new cluster sec as a carbon copy of main] --> B
    B[Attach sec as a standby-only-cluster to main via physical replication] --> C
    end
    C[Make sure both clusters are in sync] --> D1
    subgraph Break Physical Replication: ansible-playbook physical-to-logical.yml
    D1[Disable Chef] --> D2
    D2[Perform clean shutdown of sec] --> D3
    D3[On main, create a replication slot and publication FOR ALL main TABLES; remember its LSN] --> D4
    D4[Configure recovery_target_lsn on sec] --> D5
    D5[Start sec] --> D6
    D6[Let sec reach the slot's LSN, still using physical replication] --> D7
    D7[Once slot's LSN is reached, promote sec leader] --> D8
    D9[Create logical subscription with copy_data=false] --> D10
    D10[Let sec catch up using logical replication] --> H
    end
    subgraph Redirect RO to sec
    H[Redirect RO only to sec] --> R
    R[Check if cluster is operational and metrics are normal] --"Normal"--> S
    R --"Abnormal"--> GR
    S[DBRE verify E2E tests run as expected with Quality help] --"Normal"--> T
    S --"Abnormal"-->GR
    end
    T[Switchover: Redirect RW traffict to sec] --> U1
    subgraph Post Switchover Verification
    U1[Check if cluster is operational and metrics are normal]--"Normal"--> U2
    U1 --"Abnormal"--> LR
    U2[Enable Chef, run Chef-Client] --"Normal"--> U3
    U2 --"Abnormal"--> LR
    U3[Check if cluster is operational and metrics are normal] --"Normal"--> Success
    U3 --"Abnormal"--> LR
    Success[Success!]
    end
    subgraph GR[Gracefull Rollback - no dataloss]
    GR1[Start gracefull rollback]
    end
    subgraph LR[Fix forward]
    LR1[Fix all issues] -->LR2
    LR2[Return to last failed step]
    end

Playbook source: https://gitlab.com/gitlab-com/gl-infra/db-migration/-/tree/master/pg-physical-to-logical

Prep Tasks

PCL Start time (2025-04-10 23:00 UTC) - DECOMPOSITION minus 4 hours

☎️ Comms-Handler : Coordinate with @release-managers at #g_delivery

Hi @release-managers :waves:, 

We would like to communicate that deployments should be stopped/locked in the STAGING environment, in the next hour, as we should start the database decomposition of the MAIN and SEC PostgreSQL clusters at 2025-04-11 03:00 UTC - see https://gitlab.com/gitlab-com/gl-infra/production/-/issues/19639. :bow:

🏆 Quality On-Call : Confirm that QA tests are passing as a pre-decomp sanity check
1. 🏆 Quality On-Call : Confirm that smoke QA tests are passing on the current cluster by checking latest status for Smoke Type tests in Staging and Staging Canary Allure reports listed in QA pipelines.
  - 🏆 Quality On-Call : Trigger Smoke E2E suite against the environment that was decomposed: Staging: Four hourly smoke tests. This has an estimated duration of 15 minutes.
  - 🏆 Quality On-Call : If the smoke tests fail, we should re-run the failed job to see if it is reproducible.
  - 🏆 Quality On-Call : In parallel reach out to on-call Test Platform DRI for the help with investigation. If there is no available on-call DRI, reach out to #test-platform and escalate with the management team.

Prepare the environment

🔪 Playbook-Runner : Check that all needed MRs are rebased and contain the proper changes.
1. Post-Decomp MR, to change pgbouncer configurations in sec:
  - MR for sec: https://gitlab.com/gitlab-com/gl-infra/chef-repo/-/merge_requests/5838
2. GSTG-CNY MR, to add sec configuration to gstg-cny:
  - MR for gstg-cny: gitlab-com/gl-infra/k8s-workloads/gitlab-com!4299 (merged)
3. GSTG-SIDEKIQ MR, to move sec read-only over to sec-db-replica
  - MR for gstg-sidekiq-sec: gitlab-com/gl-infra/k8s-workloads/gitlab-com!4300 (merged)
4. GSTG WEB MR, to move sec read-only over to sec-db-replica
  - MR for gstg-web: gitlab-com/gl-infra/k8s-workloads/gitlab-com!4315 (merged)
5. GSTG-BASE MR, to move sec read-only over to sec-db-replica
  - MR for gstg-base: https://gitlab.com/gitlab-com/gl-infra/chef-repo/-/merge_requests/5841

🔪 Playbook-Runner : Get the console VM ready for action

SSH to the console VM in gstg
- ssh console-01-sv-gstg.c.gitlab-staging-1.internal
Configure dbupgrade user
- Disable screen sharing to reduce risk of exposing private key
- Change to user dbupgrade sudo su - dbupgrade
- Copy dbupgrade user's private key from 1Password to ~/.ssh/id_dbupgrade
- chmod 600 ~/.ssh/id_dbupgrade
- Use key as default ln -s /home/dbupgrade/.ssh/id_dbupgrade /home/dbupgrade/.ssh/id_rsa
- Repeat the same steps steps on target leader (it also has to have the private key)
- Enable re-screen sharing if beneficial
Create an access_token with at least read_repository for the next step

Clone repos:

rm -rf ~/src \
  && mkdir ~/src \
  && cd ~/src \
  && git clone https://gitlab.com/gitlab-com/gl-infra/db-migration.git \
  && cd db-migration \
  && git checkout master

Ensure you have Ansible installed:

python3 -m venv ansible
source ansible/bin/activate
python3 -m pip install --upgrade pip
python3 -m pip install ansible
python3 -m pip install jmespath
ansible --version

Ensure that Ansible can talk to all the hosts in gstg-main and gstg-sec

cd ~/src/db-migration/pg-logical-to-physical
ansible -e "ansible_ssh_private_key_file=/home/dbupgrade/.ssh/id_dbupgrade" -i inventory/gstg-sec-decomp.yml all -m ping

In advance, run pre-checks:

cd ~/src/db-migration/pg-logical-to-physical
ansible-playbook \
  -e "ansible_ssh_private_key_file=/home/dbupgrade/.ssh/id_dbupgrade" \
  -i inventory/gstg-sec-decomp.yml physical_prechecks.yml 2>&1 \
| ts | tee -a ansible_upgrade_pre_checks_gstg_sec_$(date +%Y%m%d).log

You shouldn't see any failed hosts!

🔪 Playbook-Runner : Add the following silences at https://alerts.gitlab.net to silence alerts in main and sec nodes until 4 hours after the switchover time:
- Start time: 2025-04-19T13:00:00.000Z
- Duration: 4h
- Matchers
  - main
    - env="gstg"
    - fqdn=~"patroni-main-v16.*"
  - sec
    - env="gstg"
    - fqdn=~"patroni-sec-v16.*"

🐺 Coordinator : Get a green light from the 🚑 EOC

SEC Decomposition Prep Work

Prepare Environment

☎️ Comms-Handler : Coordinate with @release-managers at #g_delivery

Hi @release-managers :waves:, 
We would like to make sure that deployments have been stopped for our `MAIN` and `SEC` database in the `STAGING` environment, until 2025-04-11 11:00 UTC. Be aware that we are deactivating certain feature flags during this time. All details can be found in the CR. Please be so kind and comment the acknowledgement on https://gitlab.com/gitlab-com/gl-infra/production/-/issues/19639. :bow:

☎️ Comms-Handler : Inform the database team at #g_database

Hi @gl-database,

Please note that we started the operational block for the `MAIN` and `SEC` clusters for SEC Decomposition, therefore we are blocking database model/structure modifications, by disabling the following tasks (`execute_batched_migrations_on_schedule` and `execute_background_migrations`, reindexing, async_foreign_key, async_index features and partition_manager_sync_partitions) in the `STAGING` environment.

We will re-enable DDLs once the CR is finished and the rollback window is closed at 2025-04-11 11:00 UTC.

Thanks!

🔪 Playbook-Runner : Disable the DDL-related feature flags:
1. Disable feature flags by typing the following into #production:
  1. /chatops run feature set disallow_database_ddl_feature_flags true --staging

Prechecks

🐺 Coordinator : Check if disallow_database_ddl_feature_flags is ENABLED:
- On slack /chatops run feature get disallow_database_ddl_feature_flags --staging
🔪 Playbook-Runner : Monitor what pgbouncer pool has connections: [monitoring_pgbouncer_gitlab_user_conns][monitoring_pgbouncer_gitlab_user_conns]

🔪 Playbook-Runner : Check if anyone except application is connected to source primary and interrupt them:

ssh patroni-main-v16-04-db-gstg.c.gitlab-staging-1.internal

Check all connections that are not gitlab:

gitlab-psql -c "
  select
    pid, client_addr, usename, application_name, backend_type,
    clock_timestamp() - backend_start as connected_ago,
    state,
    left(query, 200) as query
  from pg_stat_activity
  where
    pid <> pg_backend_pid()
    and not backend_type ~ '(walsender|logical replication|pg_wait_sampling)'
    and usename not in ('gitlab', 'gitlab-registry', 'pgbouncer', 'postgres_exporter', 'gitlab-consul')
    and application_name <> 'Patroni'
  "

If there are sessions that potentially can perform any writes, spend up to 10 minutes to make an attempt to find the actors and ask them to stop.

Finally, terminate all the remaining sessions that are not coming from application/infra components and potentially can cause writes:

gitlab-psql -c "
  select pg_terminate_backend(pid)
  from pg_stat_activity
  where
    pid <> pg_backend_pid()
    and not backend_type ~ '(walsender|logical replication|pg_wait_sampling)'
    and usename not in ('gitlab', 'gitlab-registry', 'pgbouncer', 'postgres_exporter', 'gitlab-consul')
    and application_name <> 'Patroni'
  "

🔪 Playbook-Runner : Run physical_prechecks playbook:

cd ~/src/db-migration/pg-physical-to-logical
ansible-playbook \
  -e "ansible_ssh_private_key_file=/home/dbupgrade/.ssh/id_dbupgrade" \
  -i inventory/gstg-sec-decomp.yml physical_prechecks.yml 2>&1 \
| ts | tee -a ansible_physical-to-logical_gstg_sec_$(date +%Y%m%d).log

🔪 Playbook-Runner : Verify configuration of pgbouncer-sec and pgbouncer-sidekiq-sec

knife ssh "role:gstg*pgbouncer*sec*" "sudo grep master.patroni /var/opt/gitlab/pgbouncer/databases.ini"
# should return master.patroni.service.consul prior to switchover!

Break physical replication and configure logical replication

Convert Physical Replication to Logical

🔪 Playbook-Runner : Run physical-to-logical playbook:

cd ~/src/db-migration/pg-physical-to-logical
ansible-playbook \
  -e "ansible_ssh_private_key_file=/home/dbupgrade/.ssh/id_dbupgrade" \
  -i inventory/gstg-sec-decomp.yml physical_to_logical.yml 2>&1 \
| ts | tee -a ansible_physical-to-logical_gstg_sec_$(date +%Y%m%d).log

Read-Only Traffic Configs

Read Only Traffic Switchover

Pre-read-only rollout checks

🔪 Playbook-Runner : Simple checks to see if application can still talk to sec_replica database. Expected: db_config_name:sec_replica

[10] pry(main)> ActiveRecord::Base.logger = Logger.new(STDOUT)
[11] pry(main)> Gitlab::Database::SecApplicationRecord.load_balancer.read { |connection| connection.select_all("SELECT COUNT(*) FROM vulnerability_user_mentions") }
  (20.3ms)  SELECT COUNT(*) FROM vulnerability_user_mentions /*application:console,db_config_name:main_replica,line:/data/cache/bundle-2.7.4/ruby/2.7.0/gems/marginalia-1.10.0/lib/marginalia/comment.rb:25:in `block in construct_comment'*/
=> #<ActiveRecord::Result:0x00007fcfc79ccdb0 @column_types={}, @columns=["count"], @hash_rows=nil, @rows=[[1]]>

Web Node Canary Rollout

1. [x] 🔪 {+ Playbook-Runner +}: Switchover [gstg web configuration](https://gitlab.com/gitlab-com/gl-infra/k8s-workloads/gitlab-com/-/blob/e1825d58ae2ae5a3892767a90edf18c1fe466b08/releases/gitlab/values/gstg-cny.yaml.gotmpl#L102) to new `pgbouncer-sec`
- [x] merge [k8s-workload MR](https://gitlab.com/gitlab-com/gl-infra/k8s-workloads/gitlab-com/-/merge_requests/4299)

Verify connectivity, monitor pgbouncer connections
Observe logs and prometheus for errors

Observable Logs and Prometheus Metrics

All logs will split db_*_count metrics into separate buckets describing each used connection:

🔪 Coordinator : Ensure json.db_sec_count : * logs are present

Primary connection usage by state - pg_stat_activity_count
pgbouncer_stats_queries_pooled_total

Sidekiq Node Rollout

🔪 Playbook-Runner : Switchover gstg sidekiq configuration to new pgbouncer-sec
- merge k8s-workload MR

Verify connectivity, monitor pgbouncer connections
Observe logs and prometheus for errors

Observable Logs and Prometheus Metrics

All logs will split db_*_count metrics into separate buckets describing each used connection:

🔪 Coordinator : Ensure json.db_sec_count : * logs are present

Primary connection usage by state - pg_stat_activity_count
pgbouncer_stats_queries_pooled_total

Web Node Rollout

🔪 Playbook-Runner : Switchover gstg web configuration to new pgbouncer-sec
- merge k8s-workload MR
🔪 Playbook-Runner : Verify connectivity, monitor pgbouncer connections
🔪 Coordinator : Observe logs and prometheus for errors
🔪 Playbook-Runner : Cleanup: Remove overrides in each configuration node and promote chef database connection configuration to gstg-base) setting sec to new patroni-sec-v16 DB. Writes will continue to go through PGBouncer host to main and reads to sec replicas.
1. gstg-base MR

Observable Logs and Prometheus Metrics

4.4.1 Observable logs

All logs will split db_*_count metrics into separate buckets describing each used connection:

🔪 Coordinator : Ensure json.db_sec_count : * logs are present

4.4.2. Observable prometheus metrics

Primary connection usage by state - pg_stat_activity_count
pgbouncer_stats_queries_pooled_total

Verify Read Traffic to patroni-sec

🔪 Playbook-Runner : monitoring_pgbouncer_gitlab_user_conns
- Ensure traffic is now being seen for monitoring_pgbouncer_gitlab_user_conns

Switchover - Take 1

Phase 7 – execute!

Phase 7 - switchover

🔪 Playbook-Runner : Run Ansible playbook for Database Decomposition for the gstg-sec cluster:

cd ~/src/db-migration/pg-physical-to-logical
ansible-playbook \
  -e "ansible_ssh_private_key_file=/home/dbupgrade/.ssh/id_dbupgrade" \
  -i inventory/gstg-sec-decomp.yml switchover.yml 2>&1 \
| ts | tee -a ansible_upgrade_gstg_sec_$(date +%Y%m%d).log

🔪 Playbook-Runner : Verify configuration of pgbouncer-sec and pgbouncer-sidekiq-sec

knife ssh "role:gstg*pgbouncer*sec*" "sudo grep master.patroni /var/opt/gitlab/pgbouncer/databases.ini"
# should return master.patroni-sec.service.consul after switchover!

🔪 Playbook-Runner : Verify reverse logical replication lag is low on patroni-sec leader:
- ssh patroni-sec-v16-03-db-gstg.c.gitlab-staging-1.internal
  - sudo gitlab-psql
    - select pg_wal_lsn_diff(pg_current_wal_lsn(), confirmed_flush_lsn) from pg_replication_slots where slot_name like 'logical_replication_slot%' order by 1 desc limit 1;

Post Switchover QA Tests

Post Switchover QA Testing

🏆 Quality On-Call : Trigger Smoke E2E suite against the environment that was upgraded:Staging: Four hourly smoke tests

Smoke Tests

Start Post Switchover QA
1. 🏆 Quality On-Call : Full E2E suite against the environment that was decomposed:Staging: Four hourly smoke tests, and Daily Full QA suite
🏆 Quality On-Call : (after an hour): Check that the Smoke, and Full E2E suite has passed. If there are failures, reach out to on-call Test Platform DRI for the help with investigation. If there is no available on-call DRI, reach out to #test-platform and escalate with the management team.
1. 🏆 Quality On-Call : If the Smoke or Full E2E tests fail, Quality performs an initial triage of the failure. If Quality cannot determine failure is 'unrelated', team decides on declaring an incident and following the incident process.

Rollback

Rollback (required for testing!)

🔪 Playbook-Runner : Monitor what pgbouncer pool has connections [monitoring_pgbouncer_gitlab_user_conns][monitoring_pgbouncer_gitlab_user_conns]

ROLLBACK – execute!

Goal: Set gstg-main cluster as Primary cluster

🔪 Playbook-Runner : Execute switchover_rollback.yml playbook to rollback to MAIN cluster:

cd ~/src/db-migration/pg-physical-to-logical
ansible-playbook \
  -e "ansible_ssh_private_key_file=/home/dbupgrade/.ssh/id_dbupgrade" \
  -i inventory/gstg-sec-decomp.yml \
  switchover_rollback.yml 2>&1 \
| ts | tee -a ansible_switchover_rollback_gstg_sec_$(date +%Y%m%d).log

🔪 Playbook-Runner : Verify configuration of pgbouncer-sec and pgbouncer-sidekiq-sec step after rollback

knife ssh "role:gstg*pgbouncer*sec*" "sudo grep master.patroni /var/opt/gitlab/pgbouncer/databases.ini"
# should return master.patroni.service.consul after rollback!

🔪 Playbook-Runner : Check WRITES going to the SOURCE cluster, patroni-main-v16: [monitoring_user_tables_writes][monitoring_user_tables_writes]
🔪 Playbook-Runner : Verify forward logical replication lag is low on patroni-main leader:
- ssh patroni-main-v16-04-db-gstg.c.gitlab-staging-1.internal
  - sudo gitlab-psql
    - select pg_wal_lsn_diff(pg_current_wal_lsn(), confirmed_flush_lsn) from pg_replication_slots where slot_name like 'logical_replication_slot%' order by 1 desc limit 1;

Smoke Tests

🔪 Quality On-Call : Confirm that our smoke tests are still passing

Switchover - Take 2

Phase 7 - switchover take 2

🔪 Playbook-Runner : Run Ansible playbook for Database Decomposition for the gstg-sec cluster:

cd ~/src/db-migration/pg-physical-to-logical
ansible-playbook \
  -e "ansible_ssh_private_key_file=/home/dbupgrade/.ssh/id_dbupgrade" \
  -i inventory/gstg-sec-decomp.yml switchover.yml 2>&1 \
| ts | tee -a ansible_upgrade_gstg_sec_$(date +%Y%m%d).log

🔪 Playbook-Runner : Verify configuration of pgbouncer-sec and pgbouncer-sidekiq-sec

knife ssh "role:gstg*pgbouncer*sec*" "sudo grep master.patroni /var/opt/gitlab/pgbouncer/databases.ini"
# should return master.patroni-sec.service.consul after switchover!

Persist pgbouncer configurations

🔪 Playbook-Runner : Merge the MR that reconfigures pgbouncer in Chef for patroni-sec-v16. First confirm there are no errors in merge pipeline. If the MR was merged, then revert it, and get it merged properly.
1. MR for patroni-sec-v16: https://gitlab.com/gitlab-com/gl-infra/chef-repo/-/merge_requests/5838
🔪 Playbook-Runner : Before re-enabling chef, ensure that the changes merged from the previous step have been deployed to the Chef server by confirming the linked master pipeline for ops.gitlab.net completed successfully.
🔪 Playbook-Runner : Run chef-client on one pgbouncer host and verify the configuration was not changed (changes require a reload to migrate traffic, so check nothing changed. If needed, revert the MR and update to resolve)
```
knife ssh "role:gstg*pgbouncer*sec*" "sudo grep master.patroni /var/opt/gitlab/pgbouncer/databases.ini"
# should return master.patroni-sec.service.consul after switchover!
```
🔪 Playbook-Runner : Check WRITES going to the TARGET cluster, patroni-sec-v16: [monitoring_user_tables_writes][monitoring_user_tables_writes]
🔪 Playbook-Runner : Check READS going to the TARGET cluster, patroni-sec-v16: [monitoring_user_tables_reads][monitoring_user_tables_reads].
🔪 Playbook-Runner : Confirm chef-client is ENABLED in all nodes [monitoring_chef_client_enabled][monitoring_chef_client_enabled]

🔪 Playbook-Runner : Start cron.service on all gstg-sec nodes:

knife ssh "role:gstg-base-db-patroni-sec-v16" "sudo systemctl is-active cron.service"
knife ssh "role:gstg-base-db-patroni-sec-v16" "sudo systemctl start cron.service"
knife ssh "role:gstg-base-db-patroni-sec-v16" "sudo systemctl is-active cron.service"

Enable databaseTasks for deploy nodes

🔪 Playbook-Runner : Merge the MR that enables db_database_tasks for deploy nodes
1. MR for chef-repo: https://gitlab.com/gitlab-com/gl-infra/chef-repo/-/merge_requests/5842

Post Switchover QA Tests

Post Switchover QA Testing

🏆 Quality On-Call : Trigger Smoke E2E suite against the environment that was upgraded:Staging: Four hourly smoke tests

Wrapping Up

Start Post Switchover QA
1. 🏆 Quality On-Call : Full E2E suite against the environment that was decomposed:Staging: Four hourly smoke tests, and Daily Full QA suite
🔪 Playbook-Runner : Create the wal-g daily restore schedule for the [gstg] - [sec] cluster at https://ops.gitlab.net/gitlab-com/gl-infra/data-access/durability/gitlab-restore/postgres-gprd/-/pipeline_schedules
1. Change the following variables:
  - PSQL_VERSION = 16
  - BACKUP_PATH = ? (? = use the "directory" from the new v16 GCS backup location at: https://console.cloud.google.com/storage/browser/gitlab-gstg-postgres-backup/pitr-walg-sec-v16)
🏆 Quality On-Call : (after an hour): Check that the Smoke, and Full E2E suite has passed. If there are failures, reach out to on-call Test Platform DRI for the help with investigation. If there is no available on-call DRI, reach out to #test-platform and escalate with the management team.
1. 🏆 Quality On-Call : If the Smoke or Full E2E tests fail, Quality performs an initial triage of the failure. If Quality cannot determine failure is 'unrelated', team decides on declaring an incident and following the incident process.
🔪 Playbook-Runner : Disable the DDL-related feature flags:
1. Disable feature flags by typing the following into #production:
  1. /chatops run feature set disallow_database_ddl_feature_flags false --staging

☎️ Comms-Handler : Coordinate with @release-managers at #g_delivery

Hi @release-managers :waves:, 

Sec Decomp switchover/rollback/switchover has been completed and deployments may resume!

See https://gitlab.com/gitlab-com/gl-infra/production/-/issues/19639. :bow:

☎️ Comms-Handler : Inform the database team at #g_database

Hi @gl-database,

Please note that we have completed the operational block for the `MAIN` and `SEC` clusters for SEC Decomposition, therefore we are re-enabling the following tasks (`execute_batched_migrations_on_schedule` and `execute_background_migrations`, reindexing, async_foreign_key, async_index features and partition_manager_sync_partitions) in the `STAGING` environment.

Thanks!

Extra details

In case the Playbook-Runner is disconnected

As most of the steps are executed in a tmux session owned by the Playbook-Runner role we need a safety net in case this person loses their internet connection or otherwise drops off half way through. Since other SREs/DBREs also have root access on the console node where everything is running they should be able to recover it in different ways. We tested the following approach to recovering the tmux session, updating the ssh agent and taking over as a new ansible user.

ssh host
Add your public SSH key to /home/PREVIOUS_PLAYBOOK_USERNAME/.ssh/authorized_keys
sudo chef-client-disable https://gitlab.com/gitlab-com/gl-infra/production/-/issues/19639 so that we don't override the above
ssh -A PREVIOUS_PLAYBOOK_USERNAME@host
echo $SSH_AUTH_SOCK
tmux attach -t 0
export SSH_AUTH_SOCK=<VALUE from previous SSH_AUTH_SOCK output>
<ctrl-b> :
set-environment -g 'SSH_AUTH_SOCK' <VALUE from previous SSH_AUTH_SOCK output>
export ANSIBLE_REMOTE_USER=NEW_PLAYBOOK_USERNAME
<ctrl-b> :
set-environment -g 'ANSIBLE_REMOTE_USER' <your-user>

Edited Apr 12, 2025 by Jonathon Sisson