Skip to content

[GSTG] [Sec Decomp] - Phase 7 rollout/rollback/rollout test

Staging Change

Change Details

  1. Services Impacted - ServicePatroni ServicePatroniSec
  2. Change Technician - @jjsisson
  3. Change Reviewer - @rhenchen.gitlab @bprescott_
  4. Scheduled Date and Time (UTC in format YYYY-MM-DD HH:MM) - 2025-04-17 23:00
  5. Time tracking - 6h
  6. Downtime Component - none

Database Decomposition SEC database in GSTG

Database Decomposition Rollout Team

Role Assigned To
🐺 Coordinator @theoretick
🔪 DB Playbook-Runner @jjsisson
☎️ Comms-Handler -
🏆 Quality -
🚑 EOC Can be from PD schedule
🚒 IMOC Can be from PD schedule
📣 CMOC Check CMOC escalation table below
🔦 Database Maintainers @ghavenga
💾 Database Escalation -
🚚 Delivery Escalation @release-managers
🎩 Head Honcho -
📣 CMOC Escalation Table

Important: Just for when each window begins - else ping @cmoc on Slack

Date and Step Assigned To
2025-04-17 23:00 UTC - PCL start TBD
2025-04-18 01:00 UTC - Decomp start TBD
2025-04-18 02:00 UTC - Switchover TBD
2025-04-18 05:00 UTC - PCL finish TBD

Collaboration

During the change window, the rollout team will collaborate using the following communications channels:

App Direct Link
Slack #g_database_operations
Video Call Zoom link in Production Calendar event

Immediately

Perform these steps when the issue is created.

  • 🐺 Coordinator : Fill out the names of the rollout team in the table above.

Support Options

Provider Plan Details Create Ticket
Google Cloud Platform Gold Support 24x7, email & phone, 1hr response on critical issues Create GCP Support Ticket

Entry points

Entry point Before Blocking mechanism Allowlist QA needs Notes
Pages Available via *.gitlab.io, and various custom domains Unavailable if GitLab.com goes down for a brief time. There is a cache but it will expire in gitlab_cache_expiry minutes N/A N/A

Database hosts

Accessing the rails and database consoles

  • rails: ssh $USER-rails@console-01-sv-gstg.c.gitlab-staging-1.internal
  • main db replica: ssh $USER-db@console-01-sv-gstg.c.gitlab-staging-1.internal
  • main db primary: ssh $USER-db-primary@console-01-sv-gstg.c.gitlab-staging-1.internal
  • main db psql: ssh -t patroni-main-v16-04-db-gstg.c.gitlab-staging-1.internal sudo gitlab-psql
  • sec db replica: ssh $USER-db-sec@console-01-sv-gstg.c.gitlab-staging-1.internal
  • sec db primary: ssh $USER-db-sec-primary@console-01-sv-gstg.c.gitlab-staging-1.internal
  • sec db psql: ssh -t patroni-sec-v16-03-db-gstg.c.gitlab-staging-1.internal sudo gitlab-psql

Dashboards and debugging

These dashboards might be useful during the rollout: sec decomp dashboard

Staging

Destination db: sec

Source db: main

Repos used during the rollout

The following Ansible playbooks are referenced throughout this issue:


High level overview

This gives an high level overview on the procedure.

Decomposition Flowchart
flowchart TB
    subgraph Prepare new environment
    A[Create new cluster sec as a carbon copy of main] --> B
    B[Attach sec as a standby-only-cluster to main via physical replication] --> C
    end
    C[Make sure both clusters are in sync] --> D1
    subgraph Break Physical Replication: ansible-playbook physical-to-logical.yml
    D1[Disable Chef] --> D2
    D2[Perform clean shutdown of sec] --> D3
    D3[On main, create a replication slot and publication FOR ALL main TABLES; remember its LSN] --> D4
    D4[Configure recovery_target_lsn on sec] --> D5
    D5[Start sec] --> D6
    D6[Let sec reach the slot's LSN, still using physical replication] --> D7
    D7[Once slot's LSN is reached, promote sec leader] --> D8
    D9[Create logical subscription with copy_data=false] --> D10
    D10[Let sec catch up using logical replication] --> H
    end
    subgraph Redirect RO to sec
    H[Redirect RO only to sec] --> R
    R[Check if cluster is operational and metrics are normal] --"Normal"--> S
    R --"Abnormal"--> GR
    S[DBRE verify E2E tests run as expected with Quality help] --"Normal"--> T
    S --"Abnormal"-->GR
    end
    T[Switchover: Redirect RW traffict to sec] --> U1
    subgraph Post Switchover Verification
    U1[Check if cluster is operational and metrics are normal]--"Normal"--> U2
    U1 --"Abnormal"--> LR
    U2[Enable Chef, run Chef-Client] --"Normal"--> U3
    U2 --"Abnormal"--> LR
    U3[Check if cluster is operational and metrics are normal] --"Normal"--> Success
    U3 --"Abnormal"--> LR
    Success[Success!]
    end
    subgraph GR[Gracefull Rollback - no dataloss]
    GR1[Start gracefull rollback]
    end
    subgraph LR[Fix forward]
    LR1[Fix all issues] -->LR2
    LR2[Return to last failed step]
    end

Playbook source: https://gitlab.com/gitlab-com/gl-infra/db-migration/-/tree/master/pg-physical-to-logical

Prep Tasks

  • PCL Start time (2025-04-17 22:00 UTC) - DECOMPOSITION minus 4 hours
  1. ☎️ Comms-Handler : Coordinate with @release-managers at #g_delivery

    Hi @release-managers :waves:, 
    
    We would like to communicate that deployments should be stopped/locked in the STAGING environment, in the next hour, as we should start the database decomposition of the MAIN and SEC PostgreSQL clusters at 2025-04-17 23:00 UTC - see https://gitlab.com/gitlab-com/gl-infra/production/-/issues/19684. :bow:
    
    This should also please include locking automated canary deployments as well if you could please do that.
  2. [ ] 🏆 Quality On-Call : Confirm that QA tests are passing as a pre-decomp sanity check

    1. 🏆 Quality On-Call : Confirm that smoke QA tests are passing on the current cluster by checking latest status for Smoke Type tests in Staging and Staging Canary Allure reports listed in QA pipelines.
      • 🏆 Quality On-Call : Trigger Smoke E2E suite against the environment that was decomposed: Staging: Four hourly smoke tests. This has an estimated duration of 15 minutes.
      • 🏆 Quality On-Call : If the smoke tests fail, we should re-run the failed job to see if it is reproducible.
      • 🏆 Quality On-Call : In parallel reach out to on-call Test Platform DRI for the help with investigation. If there is no available on-call DRI, reach out to #test-platform and escalate with the management team.

Prepare the environment

  1. [ ] 🔪 Playbook-Runner : Check that all needed MRs are rebased and contain the proper changes.
    1. [ ] Post-Decomp MR, to change pgbouncer configurations in sec:
    2. [ ] GSTG-CNY MR, to add sec configuration to gstg-cny:
    3. [ ] GSTG-SIDEKIQ MR, to move sec read-only over to sec-db-replica
    4. [ ] GSTG WEB MR, to move sec read-only over to sec-db-replica
    5. [ ] GSTG-BASE MR, to move sec read-only over to sec-db-replica
    6. [ ] GSTG-PATRONI-SEC MR, to remove standby configuration
  2. 🔪 Playbook-Runner : Get the console VM ready for action
    • SSH to the console VM in gstg

      • ssh console-01-sv-gstg.c.gitlab-staging-1.internal
    • Configure dbupgrade user

      • Disable screen sharing to reduce risk of exposing private key
      • Change to user dbupgrade sudo su - dbupgrade
      • Copy dbupgrade user's private key from 1Password to ~/.ssh/id_dbupgrade
      • chmod 600 ~/.ssh/id_dbupgrade
      • Use key as default ln -s /home/dbupgrade/.ssh/id_dbupgrade /home/dbupgrade/.ssh/id_rsa
      • Repeat the same steps steps on target leader (it also has to have the private key)
      • Enable re-screen sharing if beneficial
    • Create an access_token with at least read_repository for the next step

    • Clone repos:

      rm -rf ~/src \
        && mkdir ~/src \
        && cd ~/src \
        && git clone https://gitlab.com/gitlab-com/gl-infra/db-migration.git \
        && cd db-migration \
        && git checkout master
    • Ensure you have Ansible installed:

      python3 -m venv ansible
      source ansible/bin/activate
      python3 -m pip install --upgrade pip
      python3 -m pip install ansible
      python3 -m pip install jmespath
      ansible --version
    • Ensure that Ansible can talk to all the hosts in gstg-main and gstg-sec

      cd ~/src/db-migration/pg-physical-to-logical
      ansible -e "ansible_ssh_private_key_file=/home/dbupgrade/.ssh/id_dbupgrade" -i inventory/gstg-sec-decomp.yml all -m ping

      You shouldn't see any failed hosts!

  3. [ ] 🔪 Playbook-Runner : Add the following silences at https://alerts.gitlab.net to silence alerts in main and sec nodes until 4 hours after the switchover time:
    • Start time: 2025-04-17 22:00
    • Duration: 4h
    • Matchers
      • main
        • env="gstg"
        • fqdn=~"patroni-main-v16.*"
      • sec
        • env="gstg"
        • fqdn=~"patroni-sec-v16.*"
  1. 🐺 Coordinator : Get a green light from the 🚑 EOC

SEC Decomposition Prep Work

  • Prepare Environment
  1. [ ] ☎️ Comms-Handler : Coordinate with @release-managers at #g_delivery

    Hi @release-managers :waves:, 
    We would like to make sure that deployments have been stopped for our `MAIN` and `SEC` database in the `STAGING` environment, until 2025-04-18 05:00 UTC. Be aware that we are deactivating certain feature flags during this time. All details can be found in the CR. Please be so kind and comment the acknowledgement on https://gitlab.com/gitlab-com/gl-infra/production/-/issues/19684. :bow:
  2. ☎️ Comms-Handler : Inform the database team at #g_database_frameworks and #g_database_operations

    Hi @dbo and @db_team,
    
    Please note that we started the operational block for the `MAIN` and `SEC` clusters for SEC Decomposition, therefore we are blocking database model/structure modifications, by disabling the following tasks (`execute_batched_migrations_on_schedule` and `execute_background_migrations`, reindexing, async_foreign_key, async_index features and partition_manager_sync_partitions) in the `STAGING` environment.
    
    We will re-enable DDLs once the CR is finished and the rollback window is closed at 2025-04-18 05:00 UTC.
    
    Thanks!
  3. 🔪 Playbook-Runner : Disable the DDL-related feature flags:

    1. Disable feature flags by typing the following into #production:
      1. /chatops run feature set disallow_database_ddl_feature_flags true --staging
  • Prechecks
  1. 🐺 Coordinator : Check if disallow_database_ddl_feature_flags is ENABLED:

    • On slack /chatops run feature get disallow_database_ddl_feature_flags --staging
  2. 🔪 Playbook-Runner : Monitor what pgbouncer pool has connections: [monitoring_pgbouncer_gitlab_user_conns][monitoring_pgbouncer_gitlab_user_conns]

  3. 🔪 Playbook-Runner : Disable chef on the main db cluster, sec db cluster and sec pgbouncers

    knife ssh "role:gstg*patroni*main*" "sudo /usr/local/bin/chef-client-disable 'GSTG Sec Decomp'"
    knife ssh "role:gstg*patroni*sec*" "sudo /usr/local/bin/chef-client-disable 'GSTG Sec Decomp'"
    knife ssh "role:gstg*pgbouncer*sec*" "sudo /usr/local/bin/chef-client-disable 'GSTG Sec Decomp'"
  4. [ ] 🔪 Playbook-Runner : Check if anyone except application is connected to source primary and interrupt them:

    1. Confirm the source primary

      knife ssh "role:gstg*patroni*main*" "sudo gitlab-patronictl list"
    2. [ ] Login to source primary

      ssh patroni-main-v16-04-db-gstg.c.gitlab-staging-1.internal
    3. Check all connections that are not gitlab:

      gitlab-psql -c "
        select
          pid, client_addr, usename, application_name, backend_type,
          clock_timestamp() - backend_start as connected_ago,
          state,
          left(query, 200) as query
        from pg_stat_activity
        where
          pid <> pg_backend_pid()
          and not backend_type ~ '(walsender|logical replication|pg_wait_sampling)'
          and usename not in ('gitlab', 'gitlab-registry', 'pgbouncer', 'postgres_exporter', 'gitlab-consul')
          and application_name <> 'Patroni'
        "
    4. If there are sessions that potentially can perform any writes, spend up to 10 minutes to make an attempt to find the actors and ask them to stop.

    5. Finally, terminate all the remaining sessions that are not coming from application/infra components and potentially can cause writes:

      gitlab-psql -c "
        select pg_terminate_backend(pid)
        from pg_stat_activity
        where
          pid <> pg_backend_pid()
          and not backend_type ~ '(walsender|logical replication|pg_wait_sampling)'
          and usename not in ('gitlab', 'gitlab-registry', 'pgbouncer', 'postgres_exporter', 'gitlab-consul')
          and application_name <> 'Patroni'
        "
  5. 🔪 Playbook-Runner : Run physical_prechecks playbook:

    cd ~/src/db-migration/pg-physical-to-logical
    ansible-playbook \
      -e "ansible_ssh_private_key_file=/home/dbupgrade/.ssh/id_dbupgrade" \
      -i inventory/gstg-sec-decomp.yml physical_prechecks.yml 2>&1 \
    | ts | tee -a ansible_physical-to-logical_gstg_sec_$(date +%Y%m%d).log
  6. Playbook-Runner : Check pgpass, .pgpass are the same on both the source and target cluster primaries.

  7. 🔪 Playbook-Runner : Verify configuration of pgbouncer-sec and pgbouncer-sidekiq-sec

    knife ssh "role:gstg*pgbouncer*sec*" "sudo grep master.patroni /var/opt/gitlab/pgbouncer/databases.ini"
    # should return master.patroni.service.consul prior to switchover!

Break physical replication and configure logical replication

  • Convert Physical Replication to Logical
  1. 🔪 Playbook-Runner : Run physical-to-logical playbook:

    cd ~/src/db-migration/pg-physical-to-logical
    ansible-playbook \
      -e "ansible_ssh_private_key_file=/home/dbupgrade/.ssh/id_dbupgrade" \
      -i inventory/gstg-sec-decomp.yml physical_to_logical.yml 2>&1 \
    | ts | tee -a ansible_physical-to-logical_gstg_sec_$(date +%Y%m%d).log

Read-Only Traffic Configs

  • Read Only Traffic Switchover
  • Web Node Canary Rollout
  1. 🔪 Playbook-Runner : Switchover gstg web configuration to new pgbouncer-sec
  • Observable Logs and Prometheus Metrics

All logs will split db_*_count metrics into separate buckets describing each used connection:

  1. 🔪 Coordinator : Ensure json.db_sec_count : * logs are present
  • Sidekiq Node Rollout
  1. 🔪 Playbook-Runner : Switchover gstg sidekiq configuration to new pgbouncer-sec
  • Verify connectivity, monitor pgbouncer connections
  • Observe logs and prometheus for errors (see below)
  • Observable Logs and Prometheus Metrics

All logs will split db_*_count metrics into separate buckets describing each used connection:

  1. 🔪 Coordinator : Ensure json.db_sec_count : * logs are present
  • Web Node Rollout
  1. 🔪 Playbook-Runner : Switchover gstg web configuration to new pgbouncer-sec
  2. 🔪 Playbook-Runner : Verify connectivity, monitor pgbouncer connections
  3. 🔪 Coordinator : Observe logs and prometheus for errors (see below)
  4. 🔪 Playbook-Runner : Cleanup: Remove overrides in each configuration node and promote chef database connection configuration to gstg-base) setting sec to new patroni-sec-v16 DB. Writes will continue to go through PGBouncer host to main and reads to sec replicas.
    1. https://gitlab.com/gitlab-com/gl-infra/chef-repo/-/merge_requests/5864
    2. run chef-client on the console nodes
  • Observable Logs and Prometheus Metrics

4.4.1 Observable logs

All logs will split db_*_count metrics into separate buckets describing each used connection:

  1. 🔪 Coordinator : Ensure json.db_sec_count : * logs are present

4.4.2. Observable prometheus metrics

  • Verify Read Traffic to patroni-sec
  1. 🔪 Playbook-Runner : monitoring_pgbouncer_gitlab_user_conns

    • Ensure traffic is now being seen for monitoring_pgbouncer_gitlab_user_conns

Switchover - Take 1

Phase 7 – execute!

  • Phase 7 - switchover
  1. 🔪 Playbook-Runner : Run Ansible playbook for Database Decomposition for the gstg-sec cluster:

    cd ~/src/db-migration/pg-physical-to-logical
    ansible-playbook \
      -e "ansible_ssh_private_key_file=/home/dbupgrade/.ssh/id_dbupgrade" \
      -i inventory/gstg-sec-decomp.yml switchover.yml 2>&1 \
    | ts | tee -a ansible_upgrade_gstg_sec_$(date +%Y%m%d).log
  2. 🔪Playbook-Runner : Edit the /var/opt/gitlab/gitlab-rails/etc/database.yml file on the console node to set database_tasks: true for the sec cluster

  3. 🔪Playbook-Runner : Block writes to main cluster in sec cluster and sec cluster in main cluster by running this on the console node

    gitlab-rake gitlab:db:lock_writes
  4. [ ] 🔪 Playbook-Runner : Verify configuration of pgbouncer-sec and pgbouncer-sidekiq-sec

    knife ssh "role:gstg*pgbouncer*sec*" "sudo grep master.patroni /var/opt/gitlab/pgbouncer/databases.ini"
    # should return master.patroni-sec.service.consul after switchover!
  5. 🔪 Playbook-Runner : Verify reverse logical replication lag is low on patroni-sec leader:

    • ssh patroni-sec-v16-03-db-gstg.c.gitlab-staging-1.internal
      • sudo gitlab-psql
        • select pg_wal_lsn_diff(pg_current_wal_lsn(), confirmed_flush_lsn) from pg_replication_slots where slot_name like 'logical_replication_slot%' order by 1 desc limit 1;

Post Switchover QA Tests

  • Smoke Tests
  1. Start Post Switchover QA
    1. 🏆 Quality On-Call : Full E2E suite against the environment that was decomposed: Staging: Four hourly smoke tests, and Daily Full QA suite
  2. 🏆 Quality On-Call : (after an hour): Check that the Smoke, and Full E2E suite has passed. If there are failures, reach out to on-call Test Platform DRI for the help with investigation. If there is no available on-call DRI, reach out to #test-platform and escalate with the management team.
    1. 🏆 Quality On-Call : If the Smoke or Full E2E tests fail, Quality performs an initial triage of the failure. If Quality cannot determine failure is 'unrelated', team decides on declaring an incident and following the incident process.

Rollback

Estimated Time to Complete (mins) - 120

  • Rollback (required for testing!)
  1. 🔪 Playbook-Runner : Monitor what pgbouncer pool has connections [monitoring_pgbouncer_gitlab_user_conns][monitoring_pgbouncer_gitlab_user_conns]
ROLLBACK – execute!

Goal: Set gstg-main cluster as Primary cluster

  1. 🔪Playbook-Runner : Verify reverse logical replication lag is low on patroni-sec leader. This must be done using cmds run on the database not the graph. This must be done by a human. This must be done even if you have previously checked replication lag:

    ssh patroni-sec-v16-03-db-gstg.c.gitlab-staging-1.internal`
    sudo gitlab-psql
    `select pg_wal_lsn_diff(pg_current_wal_lsn(), confirmed_flush_lsn) from pg_replication_slots where 
    slot_name like 'logical_replication_slot%' order by 1 desc limit 1;`
  2. 🔪Playbook-Runner : Execute switchover_rollback.yml playbook to rollback to MAIN cluster:

    cd ~/src/db-migration/pg-physical-to-logical
    ansible-playbook \
      -e "ansible_ssh_private_key_file=/home/dbupgrade/.ssh/id_dbupgrade" \
      -i inventory/gstg-sec-decomp.yml \
      switchover_rollback.yml 2>&1 \
    | ts | tee -a ansible_switchover_rollback_gstg_sec_$(date +%Y%m%d).log
  3. 🔪Playbook-Runner : Unlock writes to main cluster in sec cluster and sec cluster in main cluster by running this on the console node

    gitlab-rake gitlab:db:unlock_writes
  4. 🔪 Playbook-Runner : Verify configuration of pgbouncer-sec and pgbouncer-sidekiq-sec step after rollback

    knife ssh "role:gstg*pgbouncer*sec*" "sudo grep master.patroni /var/opt/gitlab/pgbouncer/databases.ini"
    # should return master.patroni.service.consul after rollback!
  5. 🔪Playbook-Runner : Check WRITES going to the SOURCE cluster, patroni-main-v16: [monitoring_user_tables_writes][monitoring_user_tables_writes]

  6. 🔪Playbook-Runner : Verify forward logical replication lag is low on patroni-main leader:

    • ssh patroni-main-v16-04-db-gstg.c.gitlab-staging-1.internal
      • sudo gitlab-psql
        • select pg_wal_lsn_diff(pg_current_wal_lsn(), confirmed_flush_lsn) from pg_replication_slots where slot_name like 'logical_replication_slot%' order by 1 desc limit 1;

Smoke Tests

  1. 🔪 Quality On-Call : Confirm that our smoke tests are still passing

Switchover - Take 2

  • Phase 7 - switchover take 2
  1. 🔪 Playbook-Runner : Run Ansible playbook for Database Decomposition for the gstg-sec cluster:

    cd ~/src/db-migration/pg-physical-to-logical
    ansible-playbook \
      -e "ansible_ssh_private_key_file=/home/dbupgrade/.ssh/id_dbupgrade" \
      -i inventory/gstg-sec-decomp.yml switchover.yml 2>&1 \
    | ts | tee -a ansible_upgrade_gstg_sec_$(date +%Y%m%d).log
  2. 🔪Playbook-Runner : Block writes to main cluster in sec cluster and sec cluster in main cluster by running this on the console node

    gitlab-rake gitlab:db:lock_writes
  3. 🔪 Playbook-Runner : Verify configuration of pgbouncer-sec and pgbouncer-sidekiq-sec

    knife ssh "role:gstg*pgbouncer*sec*" "sudo grep master.patroni /var/opt/gitlab/pgbouncer/databases.ini"
    # should return master.patroni-sec.service.consul after switchover!
  • Persist Correct configurations
  1. [ ] 🔪 Playbook-Runner : Revert MR for the GSTG-CNY configuration so it uses global config

    1. MR for k8s-workload: gitlab-com/gl-infra/k8s-workloads/gitlab-com!4336 (merged) gitlab-com/gl-infra/k8s-workloads/gitlab-com!4341 (merged)
  2. [ ] 🔪 Playbook-Runner : Merge the MRs that reconfigure patroni/pgbouncer in Chef for patroni-sec-v16. First confirm there are no errors in the merge pipelines. If the MRs were merged and the pipeline failed, revert the MR with the failed pipeline and get it merged properly.

    1. MR for pgbouncer-sec: https://gitlab.com/gitlab-com/gl-infra/chef-repo/-/merge_requests/5858
    2. MR for patroni-sec: https://gitlab.com/gitlab-com/gl-infra/chef-repo/-/merge_requests/5865
  3. 🔪 Playbook-Runner : Validate chef pipelines finished correctly on ops:

  4. 🔪 Playbook-Runner : Remove standby_cluster config from patroni.yml (as needed)

    knife ssh "role:gstg*patroni*sec*" "sudo grep -A2 standby_cluster /var/opt/gitlab/patroni/patroni.yml"
    # should return nothing
     - If required, ssh to each patroni-sec host and remove the standby_cluster configuration from patroni.yml
  5. 🔪 Playbook-Runner : Run chef-client on one pgbouncer host and verify the configuration was not changed (changes require a reload to migrate traffic, so check nothing changed. If needed, revert the MR and update to resolve)

    knife ssh "role:gstg*pgbouncer*sec*" "sudo grep master.patroni /var/opt/gitlab/pgbouncer/databases.ini"
    # should return master.patroni-sec.service.consul after switchover!
  6. 🔪 Playbook-Runner : Run chef-client on the backup patroni-sec host (patroni-sec-v16-02-db-gstg) and verify the configuration was not changed.

    knife ssh patroni-sec-v16-02-db-gstg.c.gitlab-staging-1.internal "sudo grep -A2 standby_cluster /var/opt/gitlab/patroni/patroni.yml"
    # should not return a value!  Stop and investigate if this isn't correct!
  7. 🔪 Playbook-Runner : Run chef-client on the leader patroni-sec host (patroni-sec-v16-01-db-gstg) and verify the configuration was not changed.

    knife ssh patroni-sec-v16-01-db-gstg.c.gitlab-staging-1.internal "sudo grep -A2 standby_cluster /var/opt/gitlab/patroni/patroni.yml"
    # should not return a value
  8. 🔪 Playbook-Runner : Check WRITES going to the TARGET cluster, patroni-sec-v16: [monitoring_user_tables_writes][monitoring_user_tables_writes]

  9. 🔪 Playbook-Runner : Check READS going to the TARGET cluster, patroni-sec-v16: [monitoring_user_tables_reads][monitoring_user_tables_reads].

  10. 🔪 Playbook-Runner : Confirm chef-client is ENABLED in all nodes [monitoring_chef_client_enabled][monitoring_chef_client_enabled]

  11. 🔪 Playbook-Runner : Start cron.service on all gstg-sec nodes:

    knife ssh "role:gstg-base-db-patroni-sec-v16" "sudo systemctl is-active cron.service"
    knife ssh "role:gstg-base-db-patroni-sec-v16" "sudo systemctl start cron.service"
    knife ssh "role:gstg-base-db-patroni-sec-v16" "sudo systemctl is-active cron.service"
  • Enable databaseTasks for k8s workloads
  1. 🔪 Playbook-Runner : Merge the MR that enables db_database_tasks for k8s nodes
    1. MR for k8s-workloads: SecDecomp GSTG Phase 7 - enable databaseTasks f... (gitlab-com/gl-infra/k8s-workloads/gitlab-com!4333 - merged)
  • Enable databaseTasks for deploy nodes
  1. 🔪 Playbook-Runner : Merge the MR that enables db_database_tasks for deploy nodes
    1. MR for chef-repo: https://gitlab.com/gitlab-com/gl-infra/chef-repo/-/merge_requests/5842

Post Switchover QA Tests

  • Wrapping Up
  1. Start Post Switchover QA

    1. 🏆 Quality On-Call : Full E2E suite against the environment that was decomposed: Staging: Daily Full QA suite
  2. 🔪 Playbook-Runner : Create the wal-g daily restore schedule for the [gstg] - [sec] cluster at https://ops.gitlab.net/gitlab-com/gl-infra/data-access/durability/gitlab-restore/postgres-gprd/-/pipeline_schedules

    1. Change the following variables:
  3. 🏆 Quality On-Call : (after an hour): Check that the Smoke, and Full E2E suite has passed. If there are failures, reach out to on-call Test Platform DRI for the help with investigation. If there is no available on-call DRI, reach out to #test-platform and escalate with the management team.

    1. 🏆 Quality On-Call : If the Smoke or Full E2E tests fail, Quality performs an initial triage of the failure. If Quality cannot determine failure is 'unrelated', team decides on declaring an incident and following the incident process.
  4. 🔪 Playbook-Runner : Disable the DDL-related feature flags:

    1. Disable feature flags by typing the following into #production:
      1. /chatops run feature set disallow_database_ddl_feature_flags false --staging
  5. ☎️ Comms-Handler : Coordinate with @release-managers at #g_delivery

    Hi @release-managers :waves:, 
    
    Sec Decomp switchover/rollback/switchover has been completed and deployments may resume!
    
    See https://gitlab.com/gitlab-com/gl-infra/production/-/issues/19639. :bow:
  6. ☎️ Comms-Handler : Inform the database team at #g_database_frameworks and #g_database_operations

    Hi,
    
    Please note that we have completed the operational block for the `MAIN` and `SEC` clusters for SEC Decomposition, therefore we are re-enabling the following tasks (`execute_batched_migrations_on_schedule` and `execute_background_migrations`, reindexing, async_foreign_key, async_index features and partition_manager_sync_partitions) in the `STAGING` environment.
    
    Thanks!

Extra details

In case the Playbook-Runner is disconnected

As most of the steps are executed in a tmux session owned by the Playbook-Runner role we need a safety net in case this person loses their internet connection or otherwise drops off half way through. Since other SREs/DBREs also have root access on the console node where everything is running they should be able to recover it in different ways. We tested the following approach to recovering the tmux session, updating the ssh agent and taking over as a new ansible user.

  • ssh host
  • Add your public SSH key to /home/PREVIOUS_PLAYBOOK_USERNAME/.ssh/authorized_keys
  • sudo chef-client-disable https://gitlab.com/gitlab-com/gl-infra/production/-/issues/19639 so that we don't override the above
  • ssh -A PREVIOUS_PLAYBOOK_USERNAME@host
  • echo $SSH_AUTH_SOCK
  • tmux attach -t 0
  • export SSH_AUTH_SOCK=<VALUE from previous SSH_AUTH_SOCK output>
  • <ctrl-b> :
  • set-environment -g 'SSH_AUTH_SOCK' <VALUE from previous SSH_AUTH_SOCK output>
  • export ANSIBLE_REMOTE_USER=NEW_PLAYBOOK_USERNAME
  • <ctrl-b> :
  • set-environment -g 'ANSIBLE_REMOTE_USER' <your-user>
Edited by Jonathon Sisson