Skip to content

[GPRD][Sec Decomp] 2025-04-19 05:00 UTC: Decompose GitLab.com's PostgreSQL Database into Main and Sec

Production Change

NOTE: This issue has been copied to: https://ops.gitlab.net/gitlab-com/gl-infra/production/-/issues/12 to ensure availability in the event of an unplanned outage of gitlab.com. Execution of the CR will take place via the ops CR.

Change Summary

The Sec DB Decomposition Working Group aims to move Sec tables to a separate database, in a similar fashion done in Decompose GitLab.com's database to improve scal... (gitlab-org&6168 - closed) for CI tables (Related CR).

Approximately 25% of all writes are caused by Sec-related features. In order to scale GitLab's database capacity, we are decomposing the PostgreSQL main cluster into two clusters: A Sec cluster (sec) for high-write Sec-related features and a Main cluster for other features (main). By functionally decomposing the database, we increase GitLab's database capacity by roughly 2x.

Further details available in Rollout Epic and most recent status update.

Phases

Click to expand overview diagrams

Before

Phase7.0

After

Phase7.9

For IMOC

Timing

🕘 Planned Start: Saturday, 2025-04-19, 05:00am UTC

🕔 Planned End: Saturday, 2025-04-19, 08:00am UTC

Should this maintenance appear on our Status Page?

⚠️ If Yes, add the CMOC Required label to this issue ⚠️

  • Yes
  • No

Will the CMOC need to be actively engaged during the maintenance window?

  • Yes
  • No

Will it require downtime?

  • Yes
  • No

Change Details

  1. Services Impacted - ServicePatroni ServicePatroniSec ServiceWeb ServiceSidekiq ServiceAPI
  2. Change Technician - @jjsisson
  3. Change Reviewer - @rhenchen.gitlab @bprescott\_ @zbraddock
  4. Scheduled Date and Time (UTC in format YYYY-MM-DD HH:MM) - 2025-04-19 05:00 UTC
  5. Time tracking - 120 minutes
  6. Downtime Component - NONE

Staffing

Role Assigned To
🐺 Coordinator @theoretick
🔪 Playbook-Runner @jjsisson
🐘 Database-Wrangler TBD
🏆 Developer Experience @jay_mccure
🎩 IMOC Can be from PD schedule
📣 CMOC Can be from PD schedule
🚑 EOC Can be from PD schedule
📐 Dev Functional Lead @ghavenga
💾 Database Maintainer @ghavenga

Communications Plan

Set Maintenance Mode in GitLab

If your change involves scheduled maintenance, add a step to set and unset maintenance mode per our runbooks. This will make sure SLA calculations adjust for the maintenance period.

Detailed steps for the change

Note: This CR will be copied to ops.gitlab.net, where it will be utilized in the event of an unexpected downtime for gitlab.com Link to gitlab.com CR: TBD

Collaboration

During the change window, the rollout team will collaborate using the following communications channels:

App Direct Link
Slack #g_database_operations
Video Call TBD

Immediately

Perform these steps when the issue is created.

  • 🐺 Coordinator : Fill out the names of the rollout team in the table above.

Support Options

Provider Plan Details Create Ticket
Google Cloud Platform Gold Support 24x7, email & phone, 1hr response on critical issues Create GCP Support Ticket

Entry points

Entry point Before Blocking mechanism Allowlist QA needs Notes
Pages Available via *.gitlab.io, and various custom domains Unavailable if GitLab.com goes down for a brief time. There is a cache but it will expire in gitlab_cache_expiry minutes N/A N/A

Database hosts

Accessing the rails and database consoles

  • rails: ssh $USER-rails@console-01-sv-gprd.c.gitlab-production.internal
  • main db replica: ssh $USER-db@console-01-sv-gprd.c.gitlab-production.internal
  • main db primary: ssh $USER-db-primary@console-01-sv-gprd.c.gitlab-production.internal
  • main db psql: ssh -t patroni-main-v16-103-db-gprd.c.gitlab-production.internal sudo gitlab-psql
  • sec db replica: ssh $USER-db-sec@console-01-sv-gprd.c.gitlab-production.internal
  • sec db primary: ssh $USER-db-sec-primary@console-01-sv-gprd.c.gitlab-production.internal
  • sec db psql: ssh -t patroni-sec-v16-01-db-gprd.c.gitlab-production.internal sudo gitlab-psql

Dashboards and debugging

These dashboards might be useful during the rollout: postgresql: Database Decomposition using logical

Production

Destination db: sec

Source db: main

Repos used during the rollout

The following Ansible playbooks are referenced throughout this issue:


High level overview

This gives an high level overview on the procedure.

Decomposition Flowchart
flowchart TB
    subgraph Prepare new environment
    A[Create new cluster sec as a carbon copy of main] --> B
    B[Attach sec as a standby-only-cluster to main via physical replication] --> C
    end
    C[Make sure both clusters are in sync] --> D1
    subgraph Break Physical Replication: ansible-playbook physical-to-logical.yml
    D1[Disable Chef] --> D2
    D2[Perform clean shutdown of sec] --> D3
    D3[On main, create a replication slot and publication FOR ALL main TABLES; remember its LSN] --> D4
    D4[Configure recovery_target_lsn on sec] --> D5
    D5[Start sec] --> D6
    D6[Let sec reach the slot's LSN, still using physical replication] --> D7
    D7[Once slot's LSN is reached, promote sec leader] --> D8
    D9[Create logical subscription with copy_data=false] --> D10
    D10[Let sec catch up using logical replication] --> H
    end
    subgraph Redirect RO to sec
    H[Redirect RO only to sec] --> R
    R[Check if cluster is operational and metrics are normal] --"Normal"--> S
    R --"Abnormal"--> GR
    S[DBRE verify E2E tests run as expected with DevEx help] --"Normal"--> T
    S --"Abnormal"-->GR
    end
    T[Switchover: Redirect RW traffic to sec] --> U1
    subgraph Post Switchover Verification
    U1[Check if cluster is operational and metrics are normal]--"Normal"--> U2
    U1 --"Abnormal"--> LR
    U2[Enable Chef, run Chef-Client] --"Normal"--> U3
    U2 --"Abnormal"--> LR
    U3[Check if cluster is operational and metrics are normal] --"Normal"--> Success
    U3 --"Abnormal"--> LR
    Success[Success!]
    end
    subgraph GR[Graceful Rollback - no data loss]
    GR1[Start graceful rollback]
    end
    subgraph LR[Fix forward]
    LR1[Fix all issues] -->LR2
    LR2[Return to last failed step]
    end

Playbook source: https://gitlab.com/gitlab-com/gl-infra/db-migration/-/tree/master/pg-physical-to-logical

Prep Tasks

  • [-] SWITCHOVER minus 1 week (2025-04-12 13:00 UTC)
  1. ☎️ Comms-Handler : Coordinate with @release-managers at #g_delivery .
    • Message:
      Hi @release-managers :waves:,
      
      We will be undergoing scheduled maintenance to our MAIN and SEC database layers in `PRODUCTION`. The operational lock and PCL will start at 2025-04-19 05:00 UTC and should finish at 2025-04-19 17:00 UTC (including performance regression observability period). We would like confirm that deployments that affect MAIN and SEC database clusters would need to be stopped during the window. All details can be found in the CR. Please be so kind and comment the acknowledgement on https://gitlab.com/gitlab-com/gl-infra/production/-/issues/19581. :bow:
  • SWITCHOVER minus 3 days (2025-04-16 13:00 UTC)
  1. PRODUCTION ONLY ☎️ Comms-Handler : Share message from #whats-happening-at-gitlab to the following channels:

    • #infrastructure-lounge (cc @sre-oncall)
    • #g_delivery (cc @release-managers)
    • #support_gitlab-com (Inform Support SaaS team)
      • Share with team a link to the change request regarding the maintenance
  2. 🏆 DevEx On-Call : Check that you have Maintainer or Owner permission in https://ops.gitlab.net/gitlab-org/quality to be able to trigger Smoke QA pipeline in schedules (Staging, Production). Reach out to Test Platform to get access if you don't have permission to trigger scheduled pipelines in the linked projects.

  • PCL Start time (2025-04-19 05:00 UTC) - DECOMPOSITION minus 4 hours
  1. 🔪 Playbook-Runner : Ensure the CR is reviewed by the 🚑 EOC

  2. ☎️ Comms-Handler : Coordinate with @release-managers at #g_delivery the operational lock the MAIN and SEC database

    Hi @release-managers :waves:,
    As scheduled we started the Deployment Hard PCL and enabled DDL block feature flat for the Decomposition in the MAIN and SEC databases in the GPRD environment, until 2025-04-19 17:00 UTC.
    If there’s any incident with potential necessity to revert/apply db-migrations, please reach out @dbo members during the weekend as they are on-call and will evaluate if there will be impact in the upgrade or not.
    All details can be found in the CR - https://gitlab.com/gitlab-com/gl-infra/production/-/issues/19581. :bow:
  3. ☎️ Comms-Handler : Inform the database at #g_database_operations and #g_database_frameworks

    Hi @dbo and @db-team,
    
    Please note that we started the operational block for the `MAIN` and `SEC` database decomposition, therefore we are blocking database model/structure modifications, by disabling the following tasks(`execute_batched_migrations_on_schedule` and `execute_background_migrations`, reindexing, async_foreign_key, async_index features and partition_manager_sync_partitions) in the GPRD environment.
    We will re-enable DDLs once the CR is finished and the rollback window is closed at 2025-04-19 17:00 UTC.
    
    Thanks!
  4. 🔪 Playbook-Runner : Disable the DDL-related feature flags:

    1. Disable feature flags by typing the following into #production:
      1. /chatops run feature set disallow_database_ddl_feature_flags true
  5. 🏆 DevEx On-Call : Confirm that QA tests are passing as a pre-decomp sanity check

    1. 🏆 DevEx On-Call : Confirm that smoke QA tests are passing on the current cluster by checking latest status for Smoke Type tests in Production and Canary Allure reports listed in QA pipelines.
      • 🏆 DevEx On-Call : Trigger Smoke E2E suite against the environment that was decomposed: Production: Four hourly smoke tests. This has an estimated duration of 15 minutes.
      • 🏆 DevEx On-Call : If the smoke tests fail, we should re-run the failed job to see if it is reproducible.
      • 🏆 DevEx On-Call : In parallel reach out to on-call Test Platform DRI for the help with investigation. If there is no avialable on-call DRI, reach out to #test-platform and escalate with the management team.

Prepare the environment

  1. [ ] 🔪 Playbook-Runner : Check that all needed MRs are rebased and contain the proper changes.

    1. Separate gitlab-sec DB connection for teleport-ro nodes https://gitlab.com/gitlab-com/gl-infra/chef-repo/-/merge_requests/5870
    2. GPRD-CNY MR, to add sec configuration to gprd-cny: MR for gprd-cny: gitlab-com/gl-infra/k8s-workloads/gitlab-com!4342 (merged)
    3. GPRD-SIDEKIQ MR, to move sec read-only over to sec-db-replica
    4. GPRD WEB MR, to move sec read-only over to sec-db-replica
    5. GPRD-BASE MR, to move sec read-only over to sec-db-replica
    6. Make configuration changes for pgbouncer{,-sidekiq}-sec permanent https://gitlab.com/gitlab-com/gl-infra/chef-repo/-/merge_requests/5872
    7. GPRD-PATRONI-SEC MR, to remove standby configuration https://gitlab.com/gitlab-com/gl-infra/chef-repo/-/merge_requests/5873
    8. K8s MR to set databasetasks: true gitlab-com/gl-infra/k8s-workloads/gitlab-com!4345 (merged)
    9. Chef MR to set databasetasks: true https://gitlab.com/gitlab-com/gl-infra/chef-repo/-/merge_requests/5874
  2. 🔪 Playbook-Runner : Get the console VM ready for action

    • SSH to the console VM in gprd ssh console-01-sv-gprd.c.gitlab-production.internal

    • Configure dbupgrade user

      • Disable screen sharing to reduce risk of exposing private key
      • Change to user dbupgrade sudo su - dbupgrade
      • Copy dbupgrade user's private key from 1Password to ~/.ssh/id_dbupgrade
      • chmod 600 ~/.ssh/id_dbupgrade
      • Use key as default ln -s /home/dbupgrade/.ssh/id_dbupgrade /home/dbupgrade/.ssh/id_rsa
      • Repeat the same steps steps on target leader (it also has to have the private key)
      • Enable re-screen sharing
    • Create an access_token with at least read_repository for the next step

    • Clone repos:

      rm -rf ~/src \
        && mkdir ~/src \
        && cd ~/src \
        && git clone https://gitlab.com/gitlab-com/gl-infra/db-migration.git \
        && cd db-migration \
        && git checkout master
    • Ensure you have Ansible installed:

      python3 -m venv ansible
      source ansible/bin/activate
      python3 -m pip install --upgrade pip
      python3 -m pip install ansible
      python3 -m pip install jmespath
      ansible --version
    • Ensure that Ansible can talk to all the hosts in gprd-main and gprd-sec

      cd ~/src/db-migration/pg-physical-to-logical
      ansible -e "ansible_ssh_private_key_file=/home/dbupgrade/.ssh/id_dbupgrade" -i inventory/gprd-sec-decomp.yml all -m ping

      You shouldn't see any failed hosts!

    • Ensure that Ansible is being run via modern ansible on the vm not the old system ansible. This needs to hold true for the entire CR. You can tell if you are running ansible correctly because you will be able to see (ansible) at the beginning of the line in your terminal or with the following cmd

    which ansible
  3. 🔪 Playbook-Runner : Add the following silences at https://alerts.gitlab.net to silence alerts in main and sec nodes until 4 hours after the switchover time:

    • Start time: 2025-04-19T13:00:00.000Z
    • Duration: 4h
    • Matchers
      • main
        • env="gprd"
        • fqdn=~"patroni-main-v16.*"
      • sec
        • env="gprd"
        • fqdn=~"patroni-sec-v16.*"
  1. 🐺 Coordinator : Get a green light from the 🚑 EOC

SEC Decomposition Prep Work

  • Prepare Environment
  1. [ ] [ ] ☎️ Comms-Handler : Coordinate with @release-managers at #g_delivery

    Hi @release-managers :waves:, 
    We would like to make sure that deployments have been stopped for our `MAIN` and `SEC` database in the `PRODUCTION` environment, until 2025-04-19 17:00 UTC. Be aware that we are deactivating certain feature flags during this time. All details can be found in the CR. Please be so kind and comment the acknowledgement on https://gitlab.com/gitlab-com/gl-infra/production/-/issues/19581. :bow:
  2. [ ] ☎️ Comms-Handler : Inform the database team at #g_database_frameworks and #g_database_operations

    Hi @dbo and @db-team,
    
    Please note that we started the operational block for the `MAIN` and `SEC` clusters for SEC Decomposition, therefore we are blocking database model/structure modifications, by disabling the following tasks (`execute_batched_migrations_on_schedule` and `execute_background_migrations`, reindexing, async_foreign_key, async_index features and partition_manager_sync_partitions) in the `PRODUCTION` environment.
    
    We will re-enable DDLs once the CR is finished and the rollback window is closed at 2025-04-19 17:00 UTC
    
       Thanks!
  3. 🔪 Playbook-Runner : Disable the DDL-related feature flags:

    1. Disable feature flags by typing the following into #production:
      1. /chatops run feature set disallow_database_ddl_feature_flags true
  • Prechecks
  1. 🐺 Coordinator : Check if disallow_database_ddl_feature_flags is ENABLED:

    • On slack /chatops run feature get disallow_database_ddl_feature_flags
  2. 🔪 Playbook-Runner : ADD the following silences at https://alerts.gitlab.net to silence WALGBaseBackup alerts in patroni-main-v16 until the end of the maintenance:

    • Start time: 2025-04-19T09:00:00.000Z
    • Duration: 56h
      • env="gprd"
      • type="gprd-patroni-main-v16"
      • alertname=~"WALGBaseBackupFailed|walgBaseBackupDelayed"
  3. 🔪 Playbook-Runner : ADD the following silences at https://alerts.gitlab.net to silence WALGBaseBackup alerts in patroni-sec-v16 until the end of the maintenance:

    • Start time: 2025-04-19T09:00:00.000Z
    • Duration: 56h
      • env="gprd"
      • type="gprd-patroni-sec-v16"
      • alertname=~"WALGBaseBackupFailed|walgBaseBackupDelayed"
  4. 🔪 Playbook-Runner : Monitor what pgbouncer pool has connections: [monitoring_pgbouncer_gitlab_user_conns][monitoring_pgbouncer_gitlab_user_conns]

  5. 🔪 Playbook-Runner : Disable chef on the main db cluster, sec db cluster and sec pgbouncers

   knife ssh "role:gprd-base-db-patroni-main-v16*" "sudo /usr/local/bin/chef-client-disable 'GPRD Sec Decomp CR 19581'"
   knife ssh "role:gprd-base-db-patroni-sec-v16" "sudo /usr/local/bin/chef-client-disable 'GPRD Sec Decomp CR 19581'"
   knife ssh "role:gprd*pgbouncer*sec*" "sudo /usr/local/bin/chef-client-disable 'GPRD Sec Decomp CR 19581'"
  1. 🔪 Playbook-Runner : Check if anyone except application is connected to source primary and interrupt them:

    1. Confirm the source primary (note this will only run on 101, currently)
    knife ssh "role:gprd-base-db-patroni-main-v16" "sudo gitlab-patronictl list"
    1. Login to source primary
      ssh patroni-main-v16-103-db-gprd.c.gitlab-production.internal
    2. Check all connections that are not gitlab:
      gitlab-psql -c "
        select
          pid, client_addr, usename, application_name, backend_type,
          clock_timestamp() - backend_start as connected_ago,
          state,
          left(query, 200) as query
        from pg_stat_activity
        where
          pid <> pg_backend_pid()
          and not backend_type ~ '(walsender|logical replication|pg_wait_sampling)'
          and usename not in ('gitlab', 'gitlab-registry', 'pgbouncer', 'postgres_exporter', 'gitlab-consul')
          and application_name <> 'Patroni'
        "
    3. If there are sessions that potentially can perform any writes, spend up to 10 minutes to make an attempt to find the actors and ask them to stop.
    4. Finally, terminate all the remaining sessions that are not coming from application/infra components and potentially can cause writes:
      gitlab-psql -c "
        select pg_terminate_backend(pid)
        from pg_stat_activity
        where
          pid <> pg_backend_pid()
          and not backend_type ~ '(walsender|logical replication|pg_wait_sampling)'
          and usename not in ('gitlab', 'gitlab-registry', 'pgbouncer', 'postgres_exporter', 'gitlab-consul')
          and application_name <> 'Patroni'
        "
  2. 🔪 Playbook-Runner : Run physical_prechecks playbook:

    cd ~/src/db-migration/pg-physical-to-logical
    ansible-playbook \
      -e "ansible_ssh_private_key_file=/home/dbupgrade/.ssh/id_dbupgrade" \
      -i inventory/gprd-sec-decomp.yml physical_prechecks.yml 2>&1 \
    | ts | tee -a ansible_physical-to-logical_gprd_sec_$(date +%Y%m%d).log
  3. 🔪 Playbook-Runner : Check pgpass, .pgpass are the same on both the source and target cluster primaries.

ssh patroni-sec-v16-01-db-gprd.c.gitlab-production.internal "sudo cat /var/opt/gitlab/postgresql/.pgpass /var/opt/gitlab/postgresql/pgpass"

ssh patroni-main-v16-03-db-gprd.c.gitlab-production.internal "sudo cat /var/opt/gitlab/postgresql/.pgpass /var/opt/gitlab/postgresql/pgpass"
  1. [ ] 🔪 Playbook-Runner : Verify configuration of pgbouncer-sec and pgbouncer-sidekiq-sec

    knife ssh "role:gprd*pgbouncer*sec*" "sudo grep master.patroni /var/opt/gitlab/pgbouncer/databases.ini"
    # should return master.patroni.service.consul prior to switchover!

Break physical replication and configure logical replication

  • Convert Physical Replication to Logical
  1. 🔪 Playbook-Runner : Run physical-to-logical playbook:

    cd ~/src/db-migration/pg-physical-to-logical
    ansible-playbook \
      -e "ansible_ssh_private_key_file=/home/dbupgrade/.ssh/id_dbupgrade" \
      -i inventory/gprd-sec-decomp.yml physical_to_logical.yml 2>&1 \
    | ts | tee -a ansible_physical-to-logical_gprd_sec_$(date +%Y%m%d).log
  2. 🔪 Playbook-Runner : Verify sec cluster is no longer a Standby Leader:

    ssh patroni-sec-v16-01-db-gprd.c.gitlab-production.internal "sudo gitlab-patronictl list"
  3. 🔪 Playbook-Runner : Remove the standby_cluster configuration for sec in chef:

    • Merge chef MR: https://gitlab.com/gitlab-com/gl-infra/chef-repo/-/merge_requests/5873
    • Verify chef MR pipeline completes on ops: https://ops.gitlab.net/gitlab-com/gl-infra/chef-repo/-/pipelines
    • enable and run chef-client on patroni-sec leader node:
      ssh patroni-sec-v16-01-db-gprd.c.gitlab-production.internal "sudo chef-client-enable"
      ssh patroni-sec-v16-01-db-gprd.c.gitlab-production.internal "sudo chef-client"
      ssh patroni-sec-v16-01-db-gprd.c.gitlab-production.internal "sudo grep -A2 standby_cluster /var/opt/gitlab/patroni/patroni.yml"
      # should return no values for last command!  Stop if `standby_cluster` is in output!
    • enable and run chef-client on patroni-sec remaining nodes:
      knife ssh "role:gprd*patroni*sec*" "sudo chef-client-enable"
      knife ssh "role:gprd*patroni*sec*" "sudo chef-client"
      knife ssh "role:gprd*patroni*sec*" "sudo grep -A2 standby_cluster /var/opt/gitlab/patroni/patroni.yml"
      # should return no values for last command!  Stop if `standby_cluster` is in output!
  4. 🔪 Playbook-Runner : Verify sec cluster is still healthy:

    ssh patroni-sec-v16-01-db-gprd.c.gitlab-production.internal "sudo gitlab-patronictl list"
  5. 🔪 Playbook-Runner : Re-disable chef on patroni-sec:

    knife ssh "role:gprd*patroni*sec*" "sudo chef-client-disable 'SEC Decomp #19581 '"

Read-Only Traffic Configs

  • Read Only Traffic Switchover
  • Console Node Rollout
  1. [ ] [ ] 🔪 Playbook-Runner : Switchover gprd rails console (teleport) chef connection configuration to new patroni-sec-v16 DB. Writes will go through PGBouncer host to main and reads to sec replicas.

  2. 🔪 Playbook-Runner : Simple checks if application sees a proper configuration. Expected: sec load balancer and sec_replica for read connection

    [1] pry(main)> ApplicationRecord.load_balancer.name
    => :main
    [2] pry(main)> Gitlab::Database::SecApplicationRecord.load_balancer.name
    => :sec
    [3] pry(main)> ApplicationRecord.connection.pool.db_config.name
    => "main"
    [4] pry(main)> Gitlab::Database::SecApplicationRecord.connection.pool.db_config.name
    => "sec"
    [5] pry(main)> Gitlab::Database::SecApplicationRecord.load_balancer.read { |connection| connection.pool.db_config.name }
    => "sec_replica"
    [6]  Gitlab::Database::SecApplicationRecord.load_balancer.read_write { |connection| connection.pool.db_config.name }
    => "sec"
  3. 🔪 Playbook-Runner : Simple checks to see if application can still talk to sec_replica database. Expected: db_config_name:sec_replica

    [10] pry(main)> ActiveRecord::Base.logger = Logger.new(STDOUT)
    [11] pry(main)> Gitlab::Database::SecApplicationRecord.load_balancer.read { |connection| connection.select_all("SELECT COUNT(*) FROM vulnerability_user_mentions") }
      (20.3ms)  SELECT COUNT(*) FROM vulnerability_user_mentions /*application:console,db_config_name:main_replica,line:/data/cache/bundle-2.7.4/ruby/2.7.0/gems/marginalia-1.10.0/lib/marginalia/comment.rb:25:in `block in construct_comment'*/
    => #<ActiveRecord::Result:0x00007fcfc79ccdb0 @column_types={}, @columns=["count"], @hash_rows=nil, @rows=[[1]]>
  • Web Node Canary Rollout
  1. 🔪 Playbook-Runner : Switchover gprd web configuration to new gbouncer-sec` {
  • Verify connectivity, monitor pgbouncer connections
  • Observe logs and prometheus for errors
  • Observable Logs and Prometheus Metrics

All logs will split db_*_count metrics into separate buckets describing each used connection:

  1. 🐺 Coordinator : Ensure json.db_sec_count : * logs are present (web and sidekiq)
  • Sidekiq Node Rollout
  1. 🔪 Playbook-Runner : Switchover gprd sidekiq configuration to new gbouncer-sec` {
  • Verify connectivity, monitor pgbouncer connections
  • Observe logs and prometheus for errors
  • Observable Logs and Prometheus Metrics

All logs will split db_*_count metrics into separate buckets describing each used connection:

  1. 🐺 Coordinator : Ensure json.db_sec_count : * logs are present (web and sidekiq)
  • Web Node Rollout
  1. 🔪 Playbook-Runner : Switchover gprd web configuration to new gbouncer-sec` {
  2. 🔪 Playbook-Runner : Verify connectivity, monitor pgbouncer connections
  3. 🔪 Playbook-Runner : Observe logs and prometheus for errors
  4. 🔪 Playbook-Runner : Cleanup: Remove overrides in each configuration node and promote chef database connection configuration to gstg-base.
    1. https://gitlab.com/gitlab-com/gl-infra/chef-repo/-/merge_requests/5871
    2. run chef-client on the console node
Revert MR for the GPRD-CNY configuration
  1. 🔪 Playbook-Runner : Revert MR for the GPRD-CNY configuration so it uses global config
  2. MR for k8s-workload: gitlab-com/gl-infra/k8s-workloads/gitlab-com!4342 (merged)
  • Observable Logs and Prometheus Metrics

4.4.1 Observable logs

All logs will split db_*_count metrics into separate buckets describing each used connection:

  1. 🐺 Coordinator : Ensure json.db_sec_count : * logs are present (web and sidekiq)

4.4.2. Observable prometheus metrics

  • Verify Read Traffic to patroni-sec
  1. 🔪 Playbook-Runner : monitoring_pgbouncer_gitlab_user_conns

    • Ensure traffic is now being seen for monitoring_pgbouncer_gitlab_user_conns

Phase 7 – execute!

  • Phase 7 - switchover
  1. 🔪 Playbook-Runner : Schedule a job to enable gitlab_maintenance_mode into a node exporter, during the upgrade window:

    • SSH to a console VM in gprd (eg. ssh console-01-sv-gprd.c.gitlab-production.internal )
      • Schedule jobs:
        sudo su -
        echo -e "# HELP gitlab_maintenance_mode record maintenance window\n# TYPE gitlab_maintenance_mode untyped\ngitlab_maintenance_mode 1\n" > /opt/prometheus/node_exporter/metrics/gitlab_maintenance_mode.prom | at -t 202504191300
        echo -e "# HELP gitlab_maintenance_mode record maintenance window\n# TYPE gitlab_maintenance_mode untyped\ngitlab_maintenance_mode 0\n" > /opt/prometheus/node_exporter/metrics/gitlab_maintenance_mode.prom | at -t 202504191700
        cat /opt/prometheus/node_exporter/metrics/gitlab_maintenance_mode.prom
        atq
  2. PRODUCTION ONLY ☁️ 🔪 Playbook-Runner : Create a maintenance window in PagerDuty with the following:

  3. 🔪 Playbook-Runner : Run Ansible playbook for Database Decomposition for the gprd-sec cluster:

    cd ~/src/db-migration/pg-physical-to-logical
    ansible-playbook \
      -e "ansible_ssh_private_key_file=/home/dbupgrade/.ssh/id_dbupgrade" \
      -i inventory/gprd-sec-decomp.yml switchover.yml 2>&1 \
    | ts | tee -a ansible_upgrade_gprd_sec_$(date +%Y%m%d).log

    Midway through the playbook it will ask "Are you sure you want to continue resuming on pgbouncer?" this is the time to verify configuration of pgbouncer-sec and pgbouncer-sidekiq-sec as below, before saying 'Yes' to resuming pgbouncers

  4. 🔪 Playbook-Runner : Verify configuration of pgbouncer-sec and pgbouncer-sidekiq-sec

    knife ssh "role:gprd*pgbouncer*sec*" "sudo grep master.patroni /var/opt/gitlab/pgbouncer/databases.ini"
    # should return master.patroni-sec.service.consul after switchover!
  5. 🔪Playbook-Runner : Edit the /var/opt/gitlab/gitlab-rails/etc/database.yml file on the console node to set database_tasks: true for the sec cluster

  6. 🔪Playbook-Runner : Block writes to main cluster in sec cluster and sec cluster in main cluster by running this on the console node

    • single threaded
    gitlab-rake gitlab:db:lock_writes
    • multi-threaded
    SCOPE_TO_DATABASE=sec INCLUDE_PARTITIONS=false rake gitlab::database::lock_tables
    SCOPE_TO_DATABASE=main INCLUDE_PARTITIONS=false rake gitlab::database::lock_tables
    SCOPE_TO_DATABASE=ci INCLUDE_PARTITIONS=false rake gitlab::database::lock_tables
    SCOPE_TO_DATABASE=sec INCLUDE_PARTITIONS=true rake gitlab::database::lock_tables
    SCOPE_TO_DATABASE=main INCLUDE_PARTITIONS=true rake gitlab::database::lock_tables
    SCOPE_TO_DATABASE=ci INCLUDE_PARTITIONS=true rake gitlab::database::lock_tables
  7. 🔪 Playbook-Runner : Verify reverse logical replication lag is low on patroni-sec leader:

    • ssh patroni-sec-v16-01-db-gprd.c.gitlab-production.internal
      • sudo gitlab-psql
        • select pg_wal_lsn_diff(pg_current_wal_lsn(), confirmed_flush_lsn) from pg_replication_slots where slot_name like 'logical_replication_slot%' order by 1 desc limit 1;
  • Persist Correct configurations
  1. [ ] 🔪 Playbook-Runner : Merge the MR that reconfigures patroni/pgbouncer in Chef for patroni-sec-v16. First confirm there are no errors in merge pipeline. If the MR was merged, then revert it, and get it merged properly.

    1. MR for pgbouncer-sec: https://gitlab.com/gitlab-com/gl-infra/chef-repo/-/merge_requests/5872
  2. 🔪 Playbook-Runner : Run chef-client on one pgbouncer host and verify the configuration was not changed (changes require a reload to migrate traffic, so check nothing changed. If needed, revert the MR and update to resolve)

    knife ssh "role:gprd*pgbouncer*sec*" "sudo grep master.patroni /var/opt/gitlab/pgbouncer/databases.ini"
    # should return master.patroni-sec.service.consul after switchover!
  3. 🔪 Playbook-Runner : Check WRITES going to the TARGET cluster, patroni-sec-v16: [monitoring_user_tables_writes][monitoring_user_tables_writes]

  4. 🔪 Playbook-Runner : Check READS going to the TARGET cluster, patroni-sec-v16: [monitoring_user_tables_reads][monitoring_user_tables_reads].

  5. 🔪 Playbook-Runner : Re-enable Chef in all nodes:

    knife ssh "role:gprd-base-db-patroni-main-v16*" "sudo chef-client-enable"
    knife ssh "role:gprd-base-db-patroni-sec-v16" "sudo chef-client-enable"
    knife ssh "role:gprd*pgbouncer*sec" "sudo chef-client-enable"
  6. 🔪 Playbook-Runner : Confirm chef-client is ENABLED in all nodes [monitoring_chef_client_enabled][monitoring_chef_client_enabled]

  • Enable databaseTasks for k8s workloads
  1. 🔪 Playbook-Runner : Merge the MR that enables db_database_tasks for k8s nodes
    1. MR for k8s-workloads: gitlab-com/gl-infra/k8s-workloads/gitlab-com!4345 (merged)
  • Enable databaseTasks for deploy nodes
  1. 🔪 Playbook-Runner : Merge the MR that enables db_database_tasks for deploy nodes
    1. MR for chef-repo: https://gitlab.com/gitlab-com/gl-infra/chef-repo/-/merge_requests/5874
    2. run chef client on a deploy node and check it worked (should not have databasetasks false in var/opt/gitlab/gitlab-rails/etc/database.ini)

Post Switchover QA Tests

  • Post Switchover QA Testing
  1. Start Post Switchover QA

    1. 🏆 DevEx On-Call : Full E2E suite against the environment that was decomposed: Production: Full run - manual
      • It will take 1+ hour to run these tests, so you can continue with the Wrapping up of the upgrade and check the test result latter;

Communicate

  • Communication
  1. PRODUCTION ONLY 📣 CMOC : Post update from Status.io maintenance site, publish on @gitlabstatus. Workflow: https://about.gitlab.com/handbook/support/workflows/cmoc_workflows.html#sending-updates-about-maintenance-events

    • Message:
      Gitlab.com SEC database decomposition was performed. We'll continue to monitor for any performance issues until the end of the maintenance window. Thank you for your patience. See <https://gitlab.com/gitlab-com/gl-infra/production/-/issues/19581>
  • Check @gitlab retweeted from @gitlabstatus
  1. PRODUCTION ONLY ☎️ Comms-Handler : In the same thread from the earlier post, post the following message and click on the checkbox "Also send to X channel" so the threaded message would be published to the channel:
    • Message:
      :done: *GitLab.com database layer maintenance decomposition is complete now.* :celebrate:
      We’ll continue to monitor the platform to ensure all systems are functioning correctly.
      • #whats-happening-at-gitlab
      • #infrastructure-lounge (cc @sre-oncall)
  • Wrapping Up
  1. PRODUCTION ONLY 🔪 Playbook-Runner : If the scheduled maintenance is still active in PagerDuty, click on Update then End Now.

  2. [ ] 🔪 Playbook-Runner : Remove silences of fqdn=~"patroni-main-v16.*" and fqdn=~"patroni-sec-v16.*" we created during this process from https://alerts.gitlab.net

  3. 🔪 Playbook-Runner : Create the wal-g daily restore schedule for the [gprd] - [sec] cluster at https://ops.gitlab.net/gitlab-com/gl-infra/data-access/durability/gitlab-restore/postgres-gprd/-/pipeline_schedules

    1. Change the following variables:
  4. 🐺 Coordinator : Check if gitlab_maintenance_mode is DISABLED for gprd [monitoring_gitlab_maintenance_mode][monitoring_gitlab_maintenance_mode]

    • If is not disabled ask 🔪 Playbook-Runner to manually disable it by:
      • SSH to a console VM in gprd (eg. ssh console-01-sv-gprd.c.gitlab-production.internal )
        • Set gitlab_maintenance_mode=0 on node exporter :
          sudo su -
          echo -e "# HELP gitlab_maintenance_mode record maintenance window\n# TYPE gitlab_maintenance_mode untyped\ngitlab_maintenance_mode 0\n" > /opt/prometheus/node_exporter/metrics/gitlab_maintenance_mode.prom
          cat /opt/prometheus/node_exporter/metrics/gitlab_maintenance_mode.prom
          atq
  5. 🏆 DevEx On-Call : (after an hour): Check that the Smoke (as executed via MR enabling db_database_tasks for k8s nodes), and Full run - manual has passed. If there are failures, reach out to on-call Test Platform DRI for the help with investigation. If there is no avialable on-call DRI, reach out to #test-platform and escalate with the management team.

    1. 🏆 DevEx On-Call : If the Smoke or Full E2E tests fail, DevEx performs an initial triage of the failure. If DevEx cannot determine failure is 'unrelated', team decides on declaring an incident and following the incident process.

Close Rollback Window

  • SWITCHOVER plus 4 hours - Close PCL (2025-04-19 17:00 UTC)
  1. 🔪 Playbook-Runner : Run Ansible playbook to Stop the Reverse Logical Replication:

    cd ~/src/db-migration/pg-physical-to-logical
    ansible-playbook \
      -e "ansible_ssh_private_key_file=/home/dbupgrade/.ssh/id_dbupgrade" \
        -i inventory/gprd-sec-decomp.yml \
      stop_reverse_replication.yml 2>&1 \
    | ts | tee -a stop_reverse_replication_gprd_sec_$(date +%Y%m%d).log
  2. 🔪 Playbook-Runner : On the SOURCE cluster patroni-main-v16 Leader/Writer, drop subscription (if still existing) for logical replication:

    • Check if the subscription still exist:
      gitlab-psql \
          -Xc "select subname, subenabled, subconninfo, subslotname, subpublications from pg_subscription" 
  3. 🔪 Playbook-Runner : On the TARGET cluster patroni-sec-v16 Leader/Writer, drop publication and logical_replication_slot for reverse replication:

    • Check if the publication and replication slots still exist:
      gitlab-psql \
        -Xc "select pubname from pg_publication" \
        -Xc "select slot_name, plugin, slot_type, active from pg_replication_slots"
  4. [ ] 🔪 Playbook-Runner : Enable feature flags by typing the following into #production:

    • PRODUCTION:
      1. /chatops run feature set disallow_database_ddl_feature_flags false
    1. 🐺 Coordinator : Check if the underlying DDL lock FF is DISABLED:

      • On slack /chatops run feature get disallow_database_ddl_feature_flags should return DISABLED
    2. ☎️ Comms-Handler : Inform the database team that the CR is completed at #g_database_operations and g_database_frameworks:

      Hi @dbo and @db-team,
      
      We are reaching out to inform that we have completed the work for the https://gitlab.com/gitlab-com/gl-infra/production/-/issues/19581 CR. Therefore we have re-enabled the `execute_batched_migrations_on_schedule`, `execute_background_migrations`, reindexing, async_foreign_key,  sync_index and partition_manager_sync_partitions features and tasks in `gprd` environment.
      
      Thanks!
  5. PRODUCTION ONLY 📣 CMOC : End of maintenance from Status.io maintenance site, publish on @gitlabstatus. Workflow: https://about.gitlab.com/handbook/support/workflows/cmoc_workflows.html#sending-updates-about-maintenance-events

    • Click "Finish Maintenance" and send the following:
      • Message:
        GitLab.com scheduled maintenace for the MAIN and SEC database layers is complete. We'll continue to monitor the platform to ensure all systems are functioning correctly. Thank you for your patience.
    • Check @gitlab retweeted from @gitlabstatus
  6. ☎️ Comms-Handler : Inform @release-managers at #g_delivery about the end of the operational lock

    Hi @release-managers :waves:,
    
    We are reaching out to inform that we have completed the work for the https://gitlab.com/gitlab-com/gl-infra/production/-/issues/19581 CR in our `gprd` SaaS environment. We are closing the operational block for deployments in the `MAIN` and `SEC` database, so regular deployment operations can be fully resumed.
  7. 🔪 Playbook-Runner : Open a separate issue to create/rebuild the SEC DR Archive and Delayed replicas. It will be completed in the next couple of working days.

  8. 🐺 Coordinator : Mark the change request as /label ~"change::complete"

Rollback

Estimated Time to Complete (mins) - 120

  • Rollback (if required)
  1. PRODUCTION ONLY 📣 CMOC : Post an update from Status.io maintenance site, publish on @gitlabstatus. Workflow: https://about.gitlab.com/handbook/support/workflows/cmoc_workflows.html#sending-updates-about-maintenance-events

    • Message:

      Due to an issue during the planned maintenance for the database layer, we have initiated a rollback of the MAIN and SEC database layers and some performance impact still might be expected. We will provide update once the rollback process is completed.
  2. PRODUCTION ONLY ☎️ Comms-Handler : Ask the IMOC or the Head Honcho if this message should be sent to any slack rooms:

    • #whats-happening-at-gitlab
    • #infrastructure-lounge (cc @sre-oncall)
    • #g_delivery (cc @release-managers)
  • There will be no rollback after closing the rollback window!
  1. 🔪 Playbook-Runner : Monitor what pgbouncer pool has connections [monitoring_pgbouncer_gitlab_user_conns][monitoring_pgbouncer_gitlab_user_conns]
ROLLBACK – execute!

Goal: Set gprd-main cluster as Primary cluster

  1. [ ] 🔪Playbook-Runner : Verify reverse logical replication lag is low on patroni-sec leader. This must be done using cmds run on the database not the graph. This must be done by a human. This must be done even if you have previously checked replication lag:

    ssh patroni-sec-v16-01-db-gprd.c.gitlab-production.internal`
    sudo gitlab-psql
    `select pg_wal_lsn_diff(pg_current_wal_lsn(), confirmed_flush_lsn) from pg_replication_slots where 
    slot_name like 'logical_replication_slot%' order by 1 desc limit 1;`
  2. 🔪 Playbook-Runner : Execute switchover_rollback.yml playbook to rollback to MAIN cluster:

    cd ~/src/db-migration/pg-physical-to-logical
    ansible-playbook \
      -e "ansible_ssh_private_key_file=/home/dbupgrade/.ssh/id_dbupgrade" \
      -i inventory/gprd-sec-decomp.yml \
      switchover_rollback.yml 2>&1 \
    | ts | tee -a ansible_switchover_rollback_gprd_sec_$(date +%Y%m%d).log

    Midway through the playbook it will ask "Are you sure you want to continue resuming on pgbouncer?" this is the time to verify configuration of pgbouncer-sec and pgbouncer-sidekiq-sec as below, before saying 'Yes' to resuming pgbouncers

  3. 🔪 Playbook-Runner : Verify configuration of pgbouncer-sec and pgbouncer-sidekiq-sec step after rollback

    knife ssh "role:gprd*pgbouncer*sec*" "sudo grep master.patroni /var/opt/gitlab/pgbouncer/databases.ini"
    # should return master.patroni.service.consul after rollback!
  4. 🔪Playbook-Runner : Unlock writes to main cluster in sec cluster and sec cluster in main cluster by running this on the console node

    single thread:

    gitlab-rake gitlab:db:unlock_writes

    multi-threaded:

    SCOPE_TO_DATABASE=sec INCLUDE_PARTITIONS=false rake gitlab::database::unlock_tables
    SCOPE_TO_DATABASE=main INCLUDE_PARTITIONS=false rake gitlab::database::unlock_tables
    SCOPE_TO_DATABASE=ci INCLUDE_PARTITIONS=false rake gitlab::database::unlock_tables
    SCOPE_TO_DATABASE=sec INCLUDE_PARTITIONS=true rake gitlab::database::unlock_tables
    SCOPE_TO_DATABASE=main INCLUDE_PARTITIONS=true rake gitlab::database::unlock_tables
    SCOPE_TO_DATABASE=ci INCLUDE_PARTITIONS=true rake gitlab::database::unlock_tables
  5. 🔪 Playbook-Runner : Check WRITES going to the SOURCE cluster, patroni-main-v16: [monitoring_user_tables_writes][monitoring_user_tables_writes]

  6. 🔪 Playbook-Runner : Check READS going to the SOURCE cluster, patroni-main-v16: [monitoring_user_tables_reads][monitoring_user_tables_reads].

  7. 🔪 Playbook-Runner : On the TARGET cluster patroni-main-v16 Leader/Writer, drop subscription (if still existing) for logical replication:

    • Check if the subscription still exist:
      gitlab-psql \
          -Xc "select subname, subenabled, subconninfo, subslotname, subpublications from pg_subscription" 
  8. 🔪 Playbook-Runner : On the SOURCE cluster patroni-sec-v16 Leader/Writer, drop publication and logical_replication_slot for reverse replication:

    • Check if the publication and replication slots still exist:
      gitlab-psql \
        -Xc "select pubname from pg_publication" \
        -Xc "select slot_name, plugin, slot_type, active from pg_replication_slots"

Complete the rollback

  1. 🔪 DevEx On-Call : Confirm that our smoke tests are still passing (continue the rollback as this might take an hour...)

  2. 🔪 Playbook-Runner : Revert all the applied MRs (the amount of MRs is variable depending on where the CR failed)

  3. [ ]

  4. 🔪 Playbook-Runner : Re-enable Chef in all nodes:

    knife ssh "role:gprd-base-db-patroni-main-v16*" "sudo chef-client-enable"
    knife ssh "role:gprd-base-db-patroni-sec-v16" "sudo chef-client-enable"
    knife ssh "role:gprd*pgbouncer*sec" "sudo chef-client-enable"
  5. 🔪 Playbook-Runner : Confirm chef-client is ENABLED in all nodes [monitoring_chef_client_enabled][monitoring_chef_client_enabled]

  6. 🔪 Playbook-Runner : Run chef-client on Patroni Nodes:

    knife ssh "role:gprd-base-db-patroni-main-v16*" "sudo chef-client"
    knife ssh "role:gprd-base-db-patroni-sec-v16" "sudo chef-client"
    knife ssh "role:gprd*pgbouncer*sec" "sudo chef-client"
  7. 🔪 Playbook-Runner : Confirm no errors while running chef-client [monitoring_chef_client_error][monitoring_chef_client_error]

  8. 🔪 Playbook-Runner : Shutdown the TARGET gprd-base-db-patroni-sec-v16 cluster to avoid any risk of splitbrain:

    knife ssh "role:gprd-base-db-patroni-sec-v16" "sudo systemctl stop patroni"
  9. PRODUCTION ONLY 📣 CMOC : Post update from Status.io maintenance site, publish on @gitlabstatus. Workflow: https://about.gitlab.com/handbook/support/workflows/cmoc_workflows.html#sending-updates-about-maintenance-events

    • Click "Finish Maintenance" and send the following:
      • Message:

        GitLab.com rollback for the database layer is complete, and we're back up and running. We'll be monitoring the platform to ensure all systems are functioning correctly. Thank you for your patience.
  10. PRODUCTION ONLY ☎️ Comms-Handler : Send the following message to slack rooms:

    GitLab.com rollback for the database layer is complete and we're back up and running. We'll be monitoring the platform to ensure all systems are functioning correctly. Thank you for your patience.
    • #whats-happening-at-gitlab
    • #infrastructure-lounge (cc @sre-oncall)
    • #g_delivery (cc @release-managers)
  11. 🔪 Playbook-Runner : Enable feature flags by typing the following into #production:

    • PRODUCTION:
      1. /chatops run feature set disallow_database_ddl_feature_flags false
    1. ☎️ Comms-Handler : Inform the database team that the CR is completed at #g_database_operations and #g_database_frameworks:
      Hi @dbo and @db-team,
      
      We are reaching out to inform that we have aborted and rolled back the https://gitlab.com/gitlab-com/gl-infra/production/-/issues/19581 CR. Therefore we have re-enabled the `execute_batched_migrations_on_schedule`, `execute_background_migrations`, reindexing, async_foreign_key,  sync_index and partition_manager_sync_partitions features and tasks in `gprd` environment.
      
      Thanks!
  12. ☎️ Comms-Handler : Inform @release-managers at #g_delivery about the end of the operational lock

    Hi @release-managers :waves:,
    
    We are reaching out to inform that we have aborted and rolled back the https://gitlab.com/gitlab-com/gl-infra/production/-/issues/19581 CR in our `gprd` SaaS environment. We are closing the operational block for deployments in the `MAIN` and `SEC` databases, so regular deployment operations can be fully resumed.
  13. 🔪 Playbook-Runner : Check if the underlying DDL lock FF is DISABLED:

    • On slack /chatops run feature get disallow_database_ddl_feature_flags should return DISABLED
  14. 🔪 Playbook-Runner : On two nodes, console and target leader, remove the private keys temporarily placed in ~dbupgrade/.ssh:

    rm ~dbupgrade/.ssh/id_rsa
    rm ~dbupgrade/.ssh/id_dbupgrade
  15. 🔪 Playbook-Runner : ADD the following silences at https://alerts.gitlab.net to silence WALGBaseBackup alerts in patroni-sec-v16 for 2 weeks (14 days = 336 hours)

    • Start time: 2025-04-19T13:00:00.000Z
    • Duration: 336h
      • env="gprd"
      • type="gprd-patroni-sec-v16"
      • alertname=~"WALGBaseBackupFailed|walgBaseBackupDelayed"
  16. 🔪 Playbook-Runner : UPDATE the following silence at https://alerts.gitlab.net to silence alerts in v16 nodes for 2 weeks (14 days = 336 hours):

    • Start time: 2025-04-19T13:00:00.000Z

    • Duration: 336h

    • Matcher:

      • PRODUCTION
        • env="gprd"
        • fqdn=~"patroni-sec-v16.*"
  17. 🔪 Playbook-Runner : DELETE the following silences at https://alerts.gitlab.net

    • Matcher:

      • PRODUCTION
        • env="gprd"
        • fqdn=~"patroni-main-v16.*"
  18. 🐺 Coordinator : Check if gitlab_maintenance_mode is DISABLED for gprd [monitoring_gitlab_maintenance_mode][monitoring_gitlab_maintenance_mode]

    • If is not disabled ask 🔪 Playbook-Runner to manually disable it by:
      • SSH to a console VM in gprd (eg. ssh console-01-sv-gprd.c.gitlab-production.internal )
        • Set gitlab_maintenance_mode=0 on node exporter :
          sudo su -
          echo -e "# HELP gitlab_maintenance_mode record maintenance window\n# TYPE gitlab_maintenance_mode untyped\ngitlab_maintenance_mode 0\n" > /opt/prometheus/node_exporter/metrics/gitlab_maintenance_mode.prom
          cat /opt/prometheus/node_exporter/metrics/gitlab_maintenance_mode.prom
          atq
  19. 🐺 Coordinator : Mark the change request as label changeaborted`

Extra details

In case the Playbook-Runner is disconnected

As most of the steps are executed in a tmux session owned by the Playbook-Runner role we need a safety net in case this person loses their internet connection or otherwise drops off half way through. Since other SREs/DBREs also have root access on the console node where everything is running they should be able to recover it in different ways. We tested the following approach to recovering the tmux session, updating the ssh agent and taking over as a new ansible user.

  • ssh host
  • Add your public SSH key to /home/PREVIOUS_PLAYBOOK_USERNAME/.ssh/authorized_keys
  • sudo chef-client-disable https://gitlab.com/gitlab-com/gl-infra/production/-/issues/19581 so that we don't override the above
  • ssh -A PREVIOUS_PLAYBOOK_USERNAME@host
  • echo $SSH_AUTH_SOCK
  • tmux attach -t 0
  • export SSH_AUTH_SOCK=<VALUE from previous SSH_AUTH_SOCK output>
  • <ctrl-b> :
  • set-environment -g 'SSH_AUTH_SOCK' <VALUE from previous SSH_AUTH_SOCK output>
  • export ANSIBLE_REMOTE_USER=NEW_PLAYBOOK_USERNAME
  • <ctrl-b> :
  • set-environment -g 'ANSIBLE_REMOTE_USER' <your-user>

Change Reviewer checklist

C4 C3 C2 C1:

  • Check if the following applies:
    • The scheduled day and time of execution of the change is appropriate.
    • The change plan is technically accurate.
    • The change plan includes estimated timing values based on previous testing.
    • The change plan includes a viable rollback plan.
    • The specified metrics/monitoring dashboards provide sufficient visibility for the change.

C2 C1:

  • Check if the following applies:
    • The complexity of the plan is appropriate for the corresponding risk of the change. (i.e. the plan contains clear details).
    • The change plan includes success measures for all steps/milestones during the execution.
    • The change adequately minimizes risk within the environment/service.
    • The performance implications of executing the change are well-understood and documented.
    • The specified metrics/monitoring dashboards provide sufficient visibility for the change.
      • If not, is it possible (or necessary) to make changes to observability platforms for added visibility?
    • The change has a primary and secondary SRE with knowledge of the details available during the change window.
    • The change window has been agreed with Release Managers in advance of the change. If the change is planned for APAC hours, this issue has an agreed pre-change approval.
    • The labels blocks deployments and/or blocks feature-flags are applied as necessary.

Change Technician checklist

  • Check if all items below are complete:
    • The change plan is technically accurate.
    • This Change Issue is linked to the appropriate Issue and/or Epic
    • Change has been tested in staging and results noted in a comment on this issue.
    • A dry-run has been conducted and results noted in a comment on this issue.
    • The change execution window respects the Production Change Lock periods.
    • For C1 and C2 change issues, the change event is added to the GitLab Production calendar.
    • For C1 and C2 change issues, the SRE on-call has been informed prior to change being rolled out. (In #production channel, mention @sre-oncall and this issue and await their acknowledgement.)
    • For C1 and C2 change issues, the SRE on-call provided approval with the eoc_approved label on the issue.
    • For C1 and C2 change issues, the Infrastructure Manager provided approval with the manager_approved label on the issue. Mention @gitlab-org/saas-platforms/inframanagers in this issue to request approval and provide visibility to all infrastructure managers.
    • Release managers have been informed prior to any C1, C2, or blocks deployments change being rolled out. (In #production channel, mention @release-managers and this issue and await their acknowledgment.)
    • There are currently no active incidents that are severity1 or severity2
    • If the change involves doing maintenance on a database host, an appropriate silence targeting the host(s) should be added for the duration of the change.
Edited by Lucas Charles