TEST pg14upgrade.md issue generator

Postgres Upgrade Rollout Team

Role	Assigned To
🐺 Coordinator	@rhenchen
🔪 Playbook-Runner	@bshah
☎ Comms-Handler	@kwanyangu
🐘 DBRE	@alexander-sosna
🐬 SRE	@anganga
🏆 Quality	@acunskis @ddavison
🎩 IMOC	@kwanyangu
📣 CMOC	@cmarais @alejguer
🚑 EOC	@anganga @msmiley
💾 Head Honcho	@kwanyangu

Link to gitlab.com CR: production#16266 (closed)

Collaboration

During the change window, the rollout team will collaborate using the following communications channels:

App	Direct Link
Slack	#g_infra_database_reliability
Video Call	https://gitlab.zoom.us/j/97279198952?pwd=eStENnFtK3UxRFFoNU5wT0xFR2JHdz09

Immediately

Perform these steps when the issue is created.

🐺 Coordinator : Fill out the names of the rollout team in the table above.

Support Options

Provider	Plan	Details	Create Ticket
Google Cloud Platform	Gold Support	24x7, email & phone, 1hr response on critical issues	Create GCP Support Ticket

Entry points

Entry point	Before	Blocking mechanism	Allowlist	QA needs	Notes
Pages	Available via *.gitlab.io, and various custom domains	Unavailable if GitLab.com goes down for a brief time. There is a cache but it will expire in `gitlab_cache_expiry` minutes	N/A	N/A

Database hosts

Accessing the rails and database consoles

Production

rails: ssh $USER-rails@console-01-sv-gprd.c.gitlab-production.internal
main db replica: ssh $USER-db@console-01-sv-gprd.c.gitlab-production.internal
main db primary: ssh $USER-db-primary@console-01-sv-gprd.c.gitlab-production.internal
ci db replica: ssh $USER-db-ci@console-01-sv-gprd.c.gitlab-production.internal
ci db primary: ssh $USER-db-ci-primary@console-01-sv-gprd.c.gitlab-production.internal
main db psql: ssh -t patroni-main-2004-04-db-gprd.c.gitlab-production.internal sudo gitlab-psql
ci db psql: ssh -t patroni-ci-2004-05-db-gprd.c.gitlab-production.internal sudo gitlab-psql
registry db psql: ssh -t patroni-v12-registry-01-db-gprd.c.gitlab-production.internal sudo gitlab-psql

Dashboards and debugging

These dashboards might be useful during the rollout:

Production

PostgreSQL replication overview
Triage overview
Sidekiq overview
Sentry - includes application errors
- Workhorse: https://sentry.gitlab.net/gitlab/gitlab-workhorse-gitlabcom/
- Rails (backend): https://sentry.gitlab.net/gitlab/gitlabcom/
- Rails (frontend): https://sentry.gitlab.net/gitlab/gitlabcom-clientside/
- Gitaly (golang): https://sentry.gitlab.net/gitlab/gitaly-production/
- Gitaly (ruby): https://sentry.gitlab.net/gitlab/gitlabcom-gitaly-ruby/
Logs (Kibana)

Repos used during the rollout

The following Ansible playbooks are referenced throughout this issue:

Postgres Upgrade, Switchover & Rollback: https://gitlab.com/gitlab-com/gl-infra/db-migration/-/tree/master/pg-upgrade-logical

High level overview

This gives an high level overview on the procedure.

Upgrade Flowchart

flowchart TB
    subgraph Prepare new enviroment
    A[Create new cluster $TARGET as a carbon copy of the one to upgrade, $SOURCE.] --> B
    B[Attach $TARGET as a standby-only-cluster to $SOURCE via physical replication.] --> C
    end
    C[Make sure both clusters are in sync.] --> D1
    subgraph Upgrade: ansible-playbook upgrade.yml
    D1[Disable Chef] --> D
    D[Change from physical replication to logical.] --> E
    E[Make sure both clusters are in sync again.] --> G
    end
    G[Upgrade $TARGET to new version via pg_upgrade.] --> H
    subgraph Prepare switchover
    H[Make sure both clusters are in sync again.] --> I
    I[Merge Chef MRs so $TARGET uses roles for new PostgreSQL version] --> K
    K[Enable Chef, run Chef-Client] --> L
    L[Make sure Chef finished sucessfully and cluster is still operational] --> M
    M[Disable Chef again] --> N
    end
    N[Check metrics and sanity checks are as expectet] --> O
    subgraph Switchover: ansible-playbook switchover.yml
    O[Redirect RO traffic to $TARGET standbys in addition to $SOURCE] --> P
    P[Check if cluster is operational and metrics are normal] --"Normal"--> Q
    P --"Abnormal"-->GR
    Q[Redirect RO only to $TARGET] --> R
    R[Check if cluster is operational and metrics are normal] --"Normal"--> S
    R --"Abnormal"--> GR
    S[Quality team verify their tests run as expectet] --"Normal"--> T
    S --"Abnormal"-->GR
    end
    T[Switchover: Redirect RW traffict to $TARGET] --> U1
    subgraph Post Switchover Verification
    U1[Check if cluster is operational and metrics are normal]--"Normal"--> U2
    U1 --"Abnormal"--> LR
    U2[Enable Chef, run Chef-Client] --"Normal"--> U3
    U2 --"Abnormal"--> LR
    U3[Check if cluster is operational and metrics are normal] --"Normal"--> Success
    U3 --"Abnormal"--> LR
    Success[Success!]
    end
    subgraph GR[Gracefull Rollback - no dataloss]
    GR1[Start gracefull rollback]
    end
    subgraph LR[Fix forward]
    LR1[Fix the all issues] -->LR2
    LR2[Return to last failed step]
    end

Sketches of the upgrade.yml actions can be found here: upgrade.yml.pdf

Prep Tasks

T minus 1 week (2023-09-02 14:00 UTC)

☎ Comms-Handler : Discuss scheduling of this CR and assess impact on deployments and releases with the Release Managers (@release-managers in Slack). Ask them to comment with approval on this issue.

PRODUCTION ONLY 📣 CMOC : Post update from Status.io maintenance site, publish on @gitlabstatus. Workflow: https://about.gitlab.com/handbook/support/workflows/cmoc_workflows.html#sending-updates-about-maintenance-events

Message:

Next week, we will be undergoing scheduled maintenance to our main database layer. The maintenance will take up to 5 hours starting from 14:00 UTC to 19:00 UTC. GitLab.com will be available but users may experience degraded performance during the maintenance window. We apologize in advance for any inconvenience this may cause. See <https://gitlab.com/gitlab-com/gl-infra/production/-/issues/16266>

T minus 3 days (2023-09-06 14:00 UTC)

☎ Comms-Handler : Coordinate with @release-managers to make sure deployments have been paused until the maintenance window ends. Kindly ask to comment with approval on this issue. Hi @release-managers :waves:, we would like to make sure that deployments have been stopped for the affected environments until 2023-09-09 19:00 UTC. Be aware that we are deactivating certain feature flags during this time. All details can be found in the CR. Please be so kind and comment with approval on https://gitlab.com/gitlab-com/gl-infra/production/-/issues/16266. Be aware that on the first working day after the upgrade performance is closely monitored and performance tuning might be required. Therefore please halt database migrations further, until the following Tuesday. :bow:

Message:

In 3 days, we will be undergoing scheduled maintenance to our main database layer. The maintenance will take up to 5 hours starting from 14:00 UTC to 19:00 UTC. GitLab.com will be available but users may experience degraded performance during the maintenance window. We apologize in advance for any inconvenience this may cause. See <https://gitlab.com/gitlab-com/gl-infra/production/-/issues/16266>

PRODUCTION ONLY ☎ Comms-Handler : Send the tweet link from @gitlabstatus to #social_media_action channel on Slack:

Message:

Hi team, please retweet and pin this from our status page on GitLab Twitter about the upcoming production change where GitLab.com will undergo a scheduled maintenance to our database layer for up to 5 hours: {TWEET_LINK}

@gitlab retweeted from @gitlabstatus
Tweet pinned on @gitlab

PRODUCTION ONLY ☎ Comms-Handler : Send on Slack #whats-happening-at-gitlab:

Message:

:loudspeaker: *Postgres upgrade for our main database clusters is scheduled for 2023-09-09 between 14:00 UTC and 19:00 UTC* :rocket:
Taking place in 3 days time! See <https://gitlab.com/gitlab-com/gl-infra/production/-/issues/16266>

:hammer_and_wrench: *What to expect?*
GitLab.com will be available but users may experience degraded performance during the maintenance window. If you experience any issues likely realted to the upgrade in the next few days after the upgrade, please open an issue and reach the upgrade team at slack channel `#pg_upgrade`

PRODUCTION ONLY ☎ Comms-Handler : Share message from #whats-happening-at-gitlab to the following channels:
- #infrastructure-lounge (cc @sre-oncall)
- #g_delivery (cc @release-managers)
- #community-relations (Inform Marketing team)
- #support_gitlab-com (Inform Support SaaS team)
  - Share with team a link to the change request regarding the maintenance
🐘 DBRE : Create a C1 change request in the production repo and link to this issue. Example: production#8448 (closed)
- Ensure the CR is reviewed by the 🚑 EOC
🐘 DBRE : Ensure this issue has been created on https://ops.gitlab.net/gitlab-com/gl-infra/db-migration, since gitlab.com could potentially be unavailable during the rollout!!!
🐬 SRE : Create a merge request that may be needed in case of rollback and link it in the below rollback section
🏆 Quality : Check that you have Maintainer or Owner permission in https://ops.gitlab.net/gitlab-org/quality to be able to trigger Smoke QA pipeline in schedules (Staging, Production)

T minus 2 days (2023-09-07 14:00 UTC)

🐺 DBRE : Disable the DDL-related feature flags: 1. [ ] Disable feature flags by typing the following into #production:
- PRODUCTION:
  1. /chatops run feature set disallow_database_ddl_feature_flags true

🐘 DBRE : Inform the database team that DDL feature flags have been disabled until the CR is complete. Post the following comment on the gitlab.com CR (production#16266 (closed)):

Hi @gl-database,

Please note that `execute_batched_migrations_on_schedule` and `execute_background_migrations`, reindexing, async_foreign_key, async_index features and partition_manager_sync_partitions tasks will be disabled in the `PRODUCTION` environment, as we are carrying out Postgres upgrades to the database layer between `2023-09-09 14:00 UTC` and `2023-09-09 19:00 UTC`.

We will re-enable the feature flag once work is complete.

Thanks!

T minus 1 day (2023-09-08 14:00 UTC)

PRODUCTION ONLY 📣 CMOC : Post update from Status.io maintenance site, publish on @gitlabstatus. Workflow: https://about.gitlab.com/handbook/support/workflows/cmoc_workflows.html#sending-updates-about-maintenance-events

Message:

Reminder: Tomorrow, we will be undergoing scheduled maintenance to our main database layer. The maintenance will take up to 5 hours starting from 14:00 UTC to 19:00 UTC. GitLab.com will be available but users may experience degraded performance during the maintenance window. We apologize in advance for any inconvenience this may cause. See <https://gitlab.com/gitlab-com/gl-infra/production/-/issues/16266>

PRODUCTION ONLY ☎ Comms-Handler : Send message to #social_media_action to retweet from @gitlabstatus Twitter.
- Message:
```
Hi team, please retweet this from our status page to GitLab Twitter about the scheduled maintenance that is taking place tomorrow: {TWEET_LINK}
```
- @gitlab retweeted from @gitlabstatus
🏆 Quality : Confirm that our smoke tests are passing on the current cluster
🐘 DBRE : Cleanup the Destination GCS Backup location to avoid conflicts in wal-g (IMPORTANT: perform this action pairing with another DBRE/SRE to make sure that the you are deleting the right location)
- GCS Backup Location: https://console.cloud.google.com/storage/browser/gitlab-gprd-postgres-backup/pitr-walg-main-v14
🐘 DBRE : Initiate a full backup (using wal-g) on the new V14 Patroni main cluster:
- SSH to patroni-main-v14-102-db-gprd.c.gitlab-production.internal
  - Run a Wal-G backup:
```
sudo su - gitlab-psql
tmux new -s PGBasebackup
nohup /opt/wal-g/bin/backup.sh >> /var/log/wal-g/wal-g_backup_push.log 2>&1 &
```

T minus 14 hours (2023-09-09 00:00 UTC)

Prepare the environment

🔪 Playbook-Runner : Get the console VM ready for action
- SSH to the console VM in gprd
- Configure dbupgrade user
  - Disable screen sharing to reduce risk of exposing private key
  - Change to user dbupgrade sudo su - dbupgrade
  - Copy dbupgrade user's private key from 1Password to ~/.ssh/id_dbupgrade
  - chmod 600 ~/.ssh/id_dbupgrade
  - Use key as default ln -s /home/dbupgrade/.ssh/id_dbupgrade /home/dbupgrade/.ssh/id_rsa
  - Repeat the same steps steps on target leader (it also has to have the private key)
  - Enable re-screen sharing if benficial
- Start a / resume the tmux session tmux a -t pg14 || tmux new -s pg14
- Create an access_token with at least read_repository for the next step
- Clone repos:
```
rm -rf ~/src \
  && mkdir ~/src \
  && cd ~/src \
  && git clone https://gitlab.com/gitlab-com/gl-infra/db-migration.git \
  && cd db-migration \
  && git checkout latest_stable
```
- Ensure you have the pre-requisites installed:
```
sudo apt install ansible
```
- Ensure that Ansible can talk to all the hosts in gprd-main
```
cd ~/src/db-migration/pg-upgrade-logical
ansible -e "ansible_ssh_private_key_file=/home/dbupgrade/.ssh/id_dbupgrade" -i inventory/gprd-main.yml all -m ping
```
- In advance, run pre-checks, and upgrade-check, pre-install packages to ensure that everything is ready for future upgrade:
```
cd ~/src/db-migration/pg-upgrade-logical
ansible-playbook \
  -e "ansible_ssh_private_key_file=/home/dbupgrade/.ssh/id_dbupgrade" \
  -i inventory/gprd-main.yml \
  upgrade.yml -e "pg_old_version=12 pg_new_version=14" \
  --tags "pre-checks, packages, upgrade-check" 2>&1 \
| ts | tee -a ansible_upgrade_pre_checks_gprd_main_$(date +%Y%m%d).log
```
- Refresh tmux command and shortcut knowledge, https://tmuxcheatsheet.com/. To leave tmux without stopping it, use sequence Ctl-b, Ctrl-z
  
  You shouldn't see any failed hosts!

Postgres Upgrade rollout

Pre Postgres upgrade checks

🐺 Coordinator : Check if disallow_database_ddl_feature_flags is ENABLED:
- On slack /chatops run feature get disallow_database_ddl_feature_flags

🐺 Coordinator : Check if the underlying DDL migrations, patritioning and reindex features were disabled by disallow_database_ddl_feature_flags:

Open a new rails console

PRODUCTION: URL production.teleport.gitlab.net or tsh:

tsh login --proxy=production.teleport.gitlab.net --request-roles=rails-ro --request-reason="Validate if Database Migration/Reindex Workers are disabled during PG14 upgrade: https://gitlab.com/gitlab-com/gl-infra/production/-/issues/16266"
tsh ssh rails-ro@console-ro-01-sv-gprd

Paste the script in the console

def output(name, value)
  color = value ? '31' : '32'
  result = value ? 'enabled' : 'disabled'

  puts "\e[33m#{name} is\e[0m \e[#{color}m#{result}.\e[0m"
end

def check
  ActiveRecord::Base.logger = nil
  output('Database::BatchedBackgroundMigration::MainExecutionWorker', Database::BatchedBackgroundMigration::MainExecutionWorker.new.send(:enabled?))
  output('Database::BatchedBackgroundMigration::CiExecutionWorker', Database::BatchedBackgroundMigration::CiExecutionWorker.new.send(:enabled?))
  output('Database::BatchedBackgroundMigration::CiDatabaseWorker', Database::BatchedBackgroundMigration::CiDatabaseWorker.enabled?)
  output('Database::BatchedBackgroundMigrationWorker', Database::BatchedBackgroundMigrationWorker.enabled?)
  output('Gitlab::Database::Reindexing', Gitlab::Database::Reindexing.enabled?)

  is_execute_background_migrations_enabled = !(Feature.enabled?(:disallow_database_ddl_feature_flags, type: :ops) || Feature.disabled?(:execute_background_migrations, type: :ops))
  output('BackgroundMigration::CiDatabaseWorker', is_execute_background_migrations_enabled)
  output('BackgroundMigrationWorker', is_execute_background_migrations_enabled)

  is_database_async_index_operations_enabled = !(Feature.enabled?(:disallow_database_ddl_feature_flags, type: :ops) || Feature.disabled?(:database_async_index_operations, type: :ops))
  output('rake gitlab:db:execute_async_index_operations', is_database_async_index_operations_enabled)

  is_database_async_foreign_key_validation_enabled = Feature.disabled?(:disallow_database_ddl_feature_flags, type: :ops) && Feature.enabled?(:database_async_foreign_key_validation, type: :ops)
  output('rake gitlab:db:validate_async_constraints', is_database_async_foreign_key_validation_enabled)
  output('Gitlab::Database::AsyncConstraints', is_database_async_foreign_key_validation_enabled)

  is_database_async_index_creation_enabled = Feature.disabled?(:disallow_database_ddl_feature_flags, type: :ops) && Feature.enabled?(:database_async_index_creation, type: :ops)
  output('Gitlab::Database::AsyncIndexes', is_database_async_index_creation_enabled)

  is_partition_manager_sync_partitions_enabled = !(Feature.enabled?(:disallow_database_ddl_feature_flags, type: :ops) || Feature.disabled?(:partition_manager_sync_partitions, type: :ops)) 
  output('Gitlab::Database::Partitioning#sync_partitions', is_partition_manager_sync_partitions_enabled)
  output('Gitlab::Database::Partitioning#drop_detached_partitions', is_partition_manager_sync_partitions_enabled)
end

Check the output - All workers/tasks should be disabled, like for example:

Database::BatchedBackgroundMigration::MainExecutionWorker is disabled.
Database::BatchedBackgroundMigration::CiExecutionWorker is disabled.
Database::BatchedBackgroundMigration::CiDatabaseWorker is disabled.
Database::BatchedBackgroundMigrationWorker is disabled.
Gitlab::Database::Reindexing is disabled.
BackgroundMigration::CiDatabaseWorker is disabled.
BackgroundMigrationWorker is disabled.
rake gitlab:db:execute_async_index_operations is disabled.
rake gitlab:db:validate_async_constraints is disabled.
Gitlab::Database::AsyncConstraints is disabled.
Gitlab::Database::AsyncIndexes is disabled.
Gitlab::Database::Partitioning#sync_partitions is disabled.
Gitlab::Database::Partitioning#drop_detached_partitions is disabled.

🐘 DBRE ADD the following silences at https://alerts.gitlab.net to silence WALGBaseBackup alerts in patroni-main-2004 until the end of the maintenance:
- Start time: 2023-09-07T16:52:09.000Z
  - Duration: 25h
  - env="gprd"
  - type="gprd-patroni-main-2004"
  - alertname=~"WALGBaseBackupFailed|walgBaseBackupDelayed"
🐘 DBRE ADD the following silences at https://alerts.gitlab.net to silence WALGBaseBackup alerts in patroni-main-v14 until the end of the maintenance:
- Start time: 2023-09-07T16:52:09.000Z
  - Duration: 25h
  - env="gprd"
  - type="gprd-patroni-main-v14"
  - alertname=~"WALGBaseBackupFailed|walgBaseBackupDelayed"
🐘 DBRE : Monitor what pgbouncer pool has connections: Thanos

🐘 DBRE : Check if anyone except application is connected to source primary and interrupt them:

ssh patroni-main-2004-101-db-gprd.c.gitlab-production.internal

Check all connections that are not gitlab:

gitlab-psql -c "
  select
    pid, client_addr, usename, application_name, backend_type,
    clock_timestamp() - backend_start as connected_ago,
    state,
    left(query, 200) as query
  from pg_stat_activity
  where
    pid <> pg_backend_pid()
    and usename <> 'gitlab'
    and not backend_type ~ '(walsender|logical replication|pg_wait_sampling)'
    and usename not in ('pgbouncer', 'postgres_exporter', 'gitlab-consul')
    and application_name <> 'Patroni'
  "

If there are sessions that potentially can perform any writes, spend up to 10 minutes to make an attempt to find the actors and ask them to stop.

Finally, terminate all the remaining sessions that are not coming from application/infra components and potentially can cause writes:

gitlab-psql -c "
  select pg_terminate_backend(pid)
  from pg_stat_activity
  where
    pid <> pg_backend_pid()
    and usename <> 'gitlab'
    and not backend_type ~ '(walsender|logical replication|pg_wait_sampling)'
    and usename not in ('pgbouncer', 'postgres_exporter', 'gitlab-consul')
    and application_name <> 'Patroni'
  "

🐘 DBRE : Monitor the Primary Leader and Standby Leader PostgreSQL Log Files:

ssh patroni-main-2004-101-db-gprd.c.gitlab-production.internal
ssh patroni-main-v14-101-db-gprd.c.gitlab-production.internal

Get leader for each cluster:
```
sudo gitlab-patronictl list
```

Connect via SSH to the previously identified leaders and tail Postgres logs:

ssh ... # the leader host here
sudo tail -f /var/log/gitlab/postgresql/postgresql.csv

On the v14 leader, start loop to terminate autovacuum workers, to unblock concurrent vacuumdb workers attempting to ANALYZE tables after pg_upgrade:

while sleep 10; do
  gitlab-psql -XAtc "
        select query, pid, pg_terminate_backend(pid)
        from pg_stat_activity 
        where query like 'autovacuum: VACUUM % (to prevent wraparound)'" 2>&1 \
    | ts | sudo tee -a /var/opt/gitlab/autovacuum_terminator_$(date +%Y%m%d).log
done

Postgres Upgrade

Playbook source: https://gitlab.com/gitlab-com/gl-infra/db-migration/-/tree/master/pg-upgrade-logical

For this part, since each cluster takes 30-40mins, we will trigger the upgrades in parallel to save time.

UPGRADE – execute!

🔪 Playbook-Runner : Press Ctrl-b then up to go to the first terminal.

🔪 Playbook-Runner : Run Ansible playbook for Upgrading the gprd-main cluster:

cd ~/src/db-migration/pg-upgrade-logical
ansible-playbook \
  -e "ansible_ssh_private_key_file=/home/dbupgrade/.ssh/id_dbupgrade" \
  -i inventory/gprd-main.yml \
  upgrade.yml -e "pg_old_version=12 pg_new_version=14" 2>&1 \
| ts | tee -a ansible_upgrade_gprd_main_$(date +%Y%m%d).log

Post Postgres upgrades verification

You can execute the following steps as soon as their respective upgrades in the previous step have finished executing.

🐘 DBRE : Check logical replication lag, and wait to get in sync: PG14 Upgrade Dashboard
🐘 DBRE : Ensure gprd-main cluster is in the desired state.
1. Login to node 01:
```
ssh patroni-main-v14-101-db-gprd.c.gitlab-production.internal
```
2. Get leader for each cluster:
```
sudo gitlab-patronictl list
```
3. Connect via SSH to the previously identified leaders and tail Postgres logs:
```
ssh ... # the leader host here
sudo tail -f /var/log/gitlab/postgresql/postgresql.csv
```
4. 🐘 DBRE : On the v14 leader, stop the monitoring-terminate of autovacuum workers – in psql, press Ctrl-C
🐘 DBRE : Trigger GCS snapshot on the new v14 Patroni main cluster:
- SSH to patroni-main-v14-102-db-gprd.c.gitlab-production.internal
  - Run a manual GCS Snapshot
```
sudo su - gitlab-psql
tmux new -s GCSSnapshot
/usr/local/bin/gcs-snapshot.sh
```
🐬 SRE : Merge the MR that update PostgreSQL dirs and binaries references in Chef for patroni-main-v14. First confirm there are no errors in merge pipeline. If the MR was merged, then revert it, and get it merged.
1. MR for patroni-main-v14: https://gitlab.com/gitlab-com/gl-infra/chef-repo/-/merge_requests/3939
🐬 SRE : Ensure that the changes merged from the previous step have been deployed to the Chef server before re-enabling Chef by confirming the linked master pipeline for ops.gitlab.net completed successfully.

🐬 SRE : Re-enable Chef in all gprd-main nodes:

knife ssh "role:gprd-base-db-patroni-main-v14" "sudo chef-client-enable"

🐬 SRE : Confirm chef-client is enabled in all nodes thanos link

🐬 SRE : Run chef-client on Patroni Nodes:

knife ssh "role:gprd-base-db-patroni-main-v14" "sudo chef-client"

Confirm that chef-client ran on all nodes thanos link

🐬

SRE : Confirm:

No errors while running chef-client thanos link and we still have v14 binary.
```
knife ssh "roles:gprd-base-db-patroni-main-v14" "sudo /usr/lib/postgresql/14/bin/postgres --version"
```
- Output should show postgres (PostgreSQL) 14.x (Ubuntu 14.x-x.pgdg20.04+1) for all nodes

Consul service endpoint db-replica-v14.service.consul. points to v14 nodes, and consul service endpoint db-replica.service.consul. is pointing to v12 replica nodes.

dig @127.0.0.1 -p 8600 db-replica.service.consul. SRV +short
dig @127.0.0.1 -p 8600 db-replica-v14.service.consul. SRV +short

dig @127.0.0.1 -p 8600 master.patroni.service.consul. SRV +short
dig @127.0.0.1 -p 8600 master.patroni-v14.service.consul. SRV +short

🐬 SRE : Stop chef on both old and new clusters, on all nodes, before we execute switchover

knife ssh "role:gprd-base-db-patroni-main-v14 OR role:gprd-base-db-patroni-main-2004" "sudo chef-client-disable https://gitlab.com/gitlab-com/gl-infra/production/-/issues/16266"

🐬 SRE : Confirm chef-client is disabled thanos link
🐘 DBRE : A KNOWN ISSUE (TODO to improve) – at this point, it is very likely that logical replication is broken because .pgpass on target leader has only 1 line again (chef removed the 2nd line, that is needed to connect to the source leader). Get it back manually – copying the existing line, replacing localhost in the beginning with *.
🐘 DBRE : Restart target patroni cluster nodes if gitlab-patronictl list command show Pending restart is required. A successful cluster restart will display Success: restart on member for each cluster memebers and a subsequent gitlab-patronictl list command will not show Pending restart required
```
sudo gitlab-patronictl list
sudo gitlab-patronictl restart gprd-patroni-main-v14 --force
sudo gitlab-patronictl list
```
🐘 DBRE : Check logical replication lag, and wait to get in sync: PG14 Upgrade Dashboard

Start data corruption check - pg_amcheck

🐘 DBRE : On the v14 Replica nodes only, run pg_amcheck. Run tmux and as a nohup command):
- On each Replica: sudo su - gitlab-psql and start a / resume the tmux session tmux a -t pg_amcheck || tmux new -s pg_amcheck
```
export PGOPTIONS="-c statement_timeout=30min"
cd /tmp; nohup time /usr/lib/postgresql/14/bin/pg_amcheck -p 5432 -h localhost -U gitlab-superuser -d gitlabhq_production -j 96  --verbose -P   --heapallindexed 2>&1 | tee -a /var/tmp/pg_amcheck.$(date "+%F-%H-%M").log &
tail -f /var/tmp/pg_amcheck.$(date "+%F-%H-%M").log
```
  - Monitor logical replication lagging, if it seems that the logical replication is throttling, kill pg_amcheck and start it again with a smaller value for -j;
  - IMPORTANT: make sure you are not running pg_amcheck in the v14 Writer/Primary node, as this will cause logical replication lag in the target and spikes of rollbacks and errors

T minus 3 hours (2023-09-09 11:00 UTC)

Message:

    We will be undergoing scheduled maintenance to our main database layer in 3 hours. The maintenance will take up to 5 hours starting from 14:00 UTC to 19:00 UTC. GitLab.com will be available but users may experience degraded performance during the maintenance window. We apologize in advance for any inconvenience this may cause. See <https://gitlab.com/gitlab-com/gl-infra/production/-/issues/16266>

☎ Comms-Handler : Send on Slack #whats-happening-at-gitlab - Message:

  :loudspeaker: *Postgres upgrade for our database clusters is scheduled for 2023-09-09 between 14:00 UTC and 19:00 UTC :rocket:
  This is taking place in 3 hours :hourglass_flowing_sand:

  :hammer_and_wrench: *What to expect?*
  GitLab.com will be available but users may experience degraded performance during the maintenance window. If you experience any issues likely related to the upgrade in the next few days after the upgrade, please open an issue and reach the upgrade team at slack channel - #pg_upgrade.

  You can follow our issue link on ops.gitlab.net for the upgrade.

☎ Comms-Handler : Share message from #whats-happening-at-gitlab to the following channels:
- #infrastructure-lounge (cc @sre-oncall)
- #g_delivery (cc @release-managers)
🐘 DBRE : Monitor logical replication lagging, if it seems that the logical replication is throttling, kill pg_amcheck

T minus 1 hour (2023-09-09 13:00 UTC)

🔪 Playbook-Runner : Add the following silences at https://alerts.gitlab.net to silence alerts in V14 nodes for the duration of the change + 1 hour:
- Start time: 2023-09-09T14:00:00.000Z
- Duration: 6h
- Matchers
  - PRODUCTION
    - env="gprd"
    - fqdn=~"patroni-main-v14.*"
🔪 Playbook-Runner : Add the following silences at https://alerts.gitlab.net to silence alerts in V12 nodes for 2 weeks:
- Start time: 2023-09-09T14:00:00.000Z
- Duration: 341h
- Matchers
  - PRODUCTION
    - env="gprd"
    - fqdn=~"patroni-main-2004.*"
🔪 Playbook-Runner : Add the following silences at https://alerts.gitlab.net to silence sidekiq alerts for the duration of the change + 1 hour:
- Start time: 2023-09-09T14:00:00.000Z
- Duration: 6h
- Matchers
  - env="gprd"
  - alertname="SidekiqServiceSidekiqExecutionErrorSLOViolationSingleShard"
  - component="sidekiq_execution"

🔪 Playbook-Runner : Schedule a job to enable gitlab_maintenance_mode into a node exporter, during the upgrade window:

SSH to a console VM in gprd (eg. ssh console-01-sv-gprd.c.gitlab-production.internal )

Schedule jobs:

sudo su -
echo -e "# HELP gitlab_maintenance_mode record maintenance window\n# TYPE gitlab_maintenance_mode untyped\ngitlab_maintenance_mode 1\n" > /opt/prometheus/node_exporter/metrics/gitlab_maintenance_mode.prom | at -t 202309091400
echo -e "# HELP gitlab_maintenance_mode record maintenance window\n# TYPE gitlab_maintenance_mode untyped\ngitlab_maintenance_mode 0\n" > /opt/prometheus/node_exporter/metrics/gitlab_maintenance_mode.prom | at -t 202309091900
cat /opt/prometheus/node_exporter/metrics/gitlab_maintenance_mode.prom
atq

Message:

We will be undergoing scheduled maintenance to our main database layer in 1 hour. The maintenance will take up to 5 hours starting from 14:00 UTC to 19:00 UTC. GitLab.com will be available but users may experience degraded performance during the maintenance window. We apologize in advance for any inconvenience this may cause. See <https://gitlab.com/gitlab-com/gl-infra/production/-/issues/16266>

PRODUCTION ONLY ☎ Comms-Handler : Post to #announcements on Slack:

Message:

Scheduled maintenance to our main database layer starts in an hour, lasting up to 5 hours. From 14:00 UTC to 19:00 UTC

PRODUCTION ONLY ☁ 🔪 Playbook-Runner : Create a maintenance window in PagerDuty with the following:
- Which services are affected?
- Why is this maintenance happening?
```
Performing Postgres cluster upgrades so silencing the pager.
```
- Select Start at a scheduled time:
  - Timezone: (UTC+00:00) UTC
  - Start: 09/09/2023 | 02:00 PM
  - End: 09/09/2023 | 07:00 PM
🐬 SRE : Check that all needed Chef MRs are rebased and contain the proper changes.
1. Post-Upgrade MR, to change cluster to use PG14 roles:
  - MR for patroni-main-v14: https://gitlab.com/gitlab-com/gl-infra/chef-repo/-/merge_requests/3939
2. Post-Switchover MR to configure Consul and Prometheus
  - MR for gprd-patroni-main: https://gitlab.com/gitlab-com/gl-infra/chef-repo/-/merge_requests/3904
🐘 DBRE : Ensure that we have a successful full WAL-G backup that has taken place in the last 24 hours for each cluster: Thanos Graph
- If you see 2 rows (one for each cluster: MAIN + CI + REGISTRY), then a gap, then another 2 rows, then you have a recent (< 24 hours) successful backup. The gap is the period of time when the backups were executing. You can use a timestamp converter to convert the timestamps into human-readable date/time if you want to check when the backup finished.
- If you see that currently there are no lines (empty result), then it's possible that the backups are still running OR they have failed, so check the following:
  - Check this Thanos graph to see the start time of the backup job - you should be able to see an increase of the value every time the backup starts (around midnight). If the last increase was more than 24 hours ago then it means that the last backup hasn't started as it should have, and you'll need to investigate why the job failed to start.
  - If the backup job should have finished by now, then you should check this Thanos graph to see the job failed value for the last time the backup job ran. If the value is > 0 for any time in the past 24 hours, then you'll need to investigate why the job failed.
  The backup job is triggered by crond (user: gitlab-psql) and any replica is eligible to run the job but it only runs on the one that acquires the consul lock.
🐘 DBRE : On the v14 Replica nodes, review the pg_amcheck log files created in the previous steps to find out any data corruption errors and check amcheck progress:
```
egrep 'ERROR:|DETAIL:|LOCATION:' /var/tmp/pg_amcheck.*.log
cat /var/tmp/pg_amcheck.*.log | grep relations | tail -1
```
🐘 DBRE : Monitor logical replication lagging, if it seems that the logical replication is throttling, kill pg_amcheck

Postgres Upgrade Call

These steps will be run in a video call. The 🐺 Coordinator runs the call, asking other roles to perform each step on the checklist at the appropriate time.

Changes are made one at a time, and verified before moving onto the next step. All the steps will be executed from a console VM and we should keep the session shared (tmux, screen...).

Whoever is performing a change should share their screen and explain their actions as they work through them. Everyone else should watch closely for mistakes or errors! A few things to keep an especially sharp eye out for:

Exposed credentials (except short-lived items like 2FA codes)
Running commands against the wrong hosts
Navigating to the wrong pages in web browsers (gstg vs. gprd, etc)

Remember that the call will be recorded. We will consider making it public after we confirmed that no SAFE data was leaked. If you see something happening that shouldn't be public, mention it.

Roll call

🐺 Coordinator : Mark the change request as /label ~"change::in-progress"
🐺 Coordinator : Ensure everyone mentioned above is on the call
🐺 Coordinator : Ensure the video call room host is on the call

Data Corruption Checks

🐘 DBRE : On the v14 Replica Nodes, review the pg_amcheck log files created in the previous steps to find out any data corruption errors and to get the last status of the progress:
```
egrep 'ERROR:|DETAIL:|LOCATION:' /var/tmp/pg_amcheck.*.log
cat /var/tmp/pg_amcheck.*.log | grep relations | tail -1
```
🐺 Coordinator : If there are any errors that indicate possible data corruption, then abort the Maintenance and proceed with the partial rollback of the steps already performed;
🐘 DBRE : On the v14 Replica Nodes, kill pg_amcheck processes:
```
sudo killall pg_amcheck
ps -ef | grep pg_amcheck
```

🐘 DBRE : On the v14 Replica Nodes, terminate any existing backend processes:

sudo gitlab-psql -c "
select pg_terminate_backend(pid)
from pg_stat_activity
where
  pid <> pg_backend_pid()
  and usename <> 'gitlab'
  and not backend_type ~ '(walsender|logical replication|pg_wait_sampling)'
  and usename not in ('pgbouncer', 'postgres_exporter', 'gitlab-consul')
  and application_name <> 'Patroni'
"

🐺 Coordinator : [optional] Double check that no pg_amcheck processes nor queries are running on the v14 Replica Nodes.

ps -ef | grep pg_amcheck

sudo gitlab-psql -c "
select pid, usename, application_name, client_addr, substr(query,1,120) as "query"
from pg_stat_activity
where
  usename <> 'gitlab'
  and not backend_type ~ '(walsender|logical replication|pg_wait_sampling)'
  and usename not in ('pgbouncer', 'postgres_exporter', 'gitlab-consul')
  and application_name <> 'Patroni'
"

Pre-maintenance Health Checks

🐺 Coordinator : Check if gitlab_maintenance_mode is enabled for gprd (Thanos link)

If is not enabled ask 🔪 Playbook-Runner to manually enable it by:

SSH to a console VM in gprd (eg. ssh console-01-sv-gprd.c.gitlab-production.internal )

Set gitlab_maintenance_mode=1 on node exporter :

sudo su -
echo -e "# HELP gitlab_maintenance_mode record maintenance window\n# TYPE gitlab_maintenance_mode untyped\ngitlab_maintenance_mode 1\n" > /opt/prometheus/node_exporter/metrics/gitlab_maintenance_mode.prom
cat /opt/prometheus/node_exporter/metrics/gitlab_maintenance_mode.prom
atq

🐺 Coordinator : Check if disallow_database_ddl_feature_flags is ENABLED:
- On slack /chatops run feature get disallow_database_ddl_feature_flags

🐺 Coordinator : Check if the underlying DDL migrations, patritioning and reindex features were disabled by disallow_database_ddl_feature_flags:

Open a new rails console

PRODUCTION: URL production.teleport.gitlab.net or tsh:

tsh login --proxy=production.teleport.gitlab.net --request-roles=rails-ro --request-reason="Validate if Database Migration/Reindex Workers are disabled during PG14 upgrade: https://gitlab.com/gitlab-com/gl-infra/production/-/issues/16266"
tsh ssh rails-ro@console-ro-01-sv-gprd

Paste the script in the console

def output(name, value)
  color = value ? '31' : '32'
  result = value ? 'enabled' : 'disabled'
  puts "\e[33m#{name} is\e[0m \e[#{color}m#{result}.\e[0m"
end
def check
  ActiveRecord::Base.logger = nil
  output('Database::BatchedBackgroundMigration::MainExecutionWorker', Database::BatchedBackgroundMigration::MainExecutionWorker.new.send(:enabled?))
  output('Database::BatchedBackgroundMigration::CiExecutionWorker', Database::BatchedBackgroundMigration::CiExecutionWorker.new.send(:enabled?))
  output('Database::BatchedBackgroundMigration::CiDatabaseWorker', Database::BatchedBackgroundMigration::CiDatabaseWorker.enabled?)
  output('Database::BatchedBackgroundMigrationWorker', Database::BatchedBackgroundMigrationWorker.enabled?)
  output('Gitlab::Database::Reindexing', Gitlab::Database::Reindexing.enabled?)

  is_execute_background_migrations_enabled = !(Feature.enabled?(:disallow_database_ddl_feature_flags, type: :ops) || Feature.disabled?(:execute_background_migrations, type: :ops))
  output('BackgroundMigration::CiDatabaseWorker', is_execute_background_migrations_enabled)
  output('BackgroundMigrationWorker', is_execute_background_migrations_enabled)

  is_database_async_index_operations_enabled = !(Feature.enabled?(:disallow_database_ddl_feature_flags, type: :ops) || Feature.disabled?(:database_async_index_operations, type: :ops))
  output('rake gitlab:db:execute_async_index_operations', is_database_async_index_operations_enabled)

  is_database_async_foreign_key_validation_enabled = Feature.disabled?(:disallow_database_ddl_feature_flags, type: :ops) && Feature.enabled?(:database_async_foreign_key_validation, type: :ops)
  output('rake gitlab:db:validate_async_constraints', is_database_async_foreign_key_validation_enabled)
  output('Gitlab::Database::AsyncConstraints', is_database_async_foreign_key_validation_enabled)

  is_database_async_index_creation_enabled = Feature.disabled?(:disallow_database_ddl_feature_flags, type: :ops) && Feature.enabled?(:database_async_index_creation, type: :ops)
  output('Gitlab::Database::AsyncIndexes', is_database_async_index_creation_enabled)
  is_partition_manager_sync_partitions_enabled = !(Feature.enabled?(:disallow_database_ddl_feature_flags, type: :ops) || Feature.disabled?(:partition_manager_sync_partitions, type: :ops)) 
  output('Gitlab::Database::Partitioning#sync_partitions', is_partition_manager_sync_partitions_enabled)
  output('Gitlab::Database::Partitioning#drop_detached_partitions', is_partition_manager_sync_partitions_enabled)
end

Check the output - All workers/tasks should be disabled, like for example:

Database::BatchedBackgroundMigration::MainExecutionWorker is disabled.
Database::BatchedBackgroundMigration::CiExecutionWorker is disabled.
Database::BatchedBackgroundMigration::CiDatabaseWorker is disabled.
Database::BatchedBackgroundMigrationWorker is disabled.
Gitlab::Database::Reindexing is disabled.
BackgroundMigration::CiDatabaseWorker is disabled.
BackgroundMigrationWorker is disabled.
rake gitlab:db:execute_async_index_operations is disabled.
rake gitlab:db:validate_async_constraints is disabled.
Gitlab::Database::AsyncConstraints is disabled.
Gitlab::Database::AsyncIndexes is disabled.
Gitlab::Database::Partitioning#sync_partitions is disabled.
Gitlab::Database::Partitioning#drop_detached_partitions is disabled.

🐺 Coordinator : Ensure that there are no active critical alerts (s1) or open incidents:
- PRODUCTION:
  - https://dashboards.gitlab.net/d/SOn6MeNmk/alerts?orgId=1&var-interval=1m&var-environment=gprd&var-alertname=All&var-alertstate=All&var-prometheus=prometheus-01-inf-gprd&var-prometheus_app=prometheus-app-01-inf-gprd
  - https://gitlab.com/gitlab-com/gl-infra/production/issues?label_name%5B%5D=incident
🐺 Coordinator : Check Sentry if there are errors that might indicate database problems: Production Sentry
🐘 DBRE : Ensure writes are happening on Postgres/Patroni nodes in gprd-main: Thanos
🐘 DBRE Check prometheus sanity check metrics for reads all going to the correct hosts
1. Index reads
  - Expected result: all queries going to the patroni-main-2004 cluster.
2. Sequential scans
  - Expected result: all queries going to the patroni-main-2004 cluster.

Terminals

You should already be in a tmux session, but only if you are planning to upgrade two clusters at the same time, open a second pane so that we have one terminal for each cluster.

Press Ctrl-b then " your existing terminal in tmux to open a new pane, split horizontally.
You can move between the panes by pressing Ctrl-b then up or down arrows.

🔪 Playbook-Runner : Verify that the ansible inventory is up to date and reflects the real state from the cluster.
- Once again, ensure that Ansible can talk to all the hosts in gprd-main:
```
cd ~/src/db-migration/pg-upgrade-logical
ansible -e "ansible_ssh_private_key_file=/home/dbupgrade/.ssh/id_dbupgrade" -i inventory/gprd-main.yml all -m ping
```
  You shouldn't see any failed hosts!

T minus zero (2023-09-09 14:00 UTC)

We expect the maintenance window to last for up to 5 hours, starting from now.

Pre Switchover checks (T plus 0 min)

🐘 DBRE : Monitor what pgbouncer pool has connections: Thanos

🐘 DBRE : Monitor the Primary Leader and Standby Leader PostgreSQL Log Files:

ssh patroni-main-2004-101-db-gprd.c.gitlab-production.internal
ssh patroni-main-v14-101-db-gprd.c.gitlab-production.internal

Get leader for each cluster:
```
sudo gitlab-patronictl list
```

Connect via SSH to the previously identified leaders and tail Postgres logs:

ssh ... # the leader host here
sudo tail -f /var/log/gitlab/postgresql/postgresql.csv

🐬 SRE : Confirm chef-client is disabled thanos link

Evaluation of QA/Validations results - Commitment

If QA/Validations has succeeded, then we can continue to "Complete the Upgrade and Switchover to v14". If some QA/Validations has failed, the 🐺 Coordinator must decide whether to continue with the upgrade, or to rollback. A decision to continue in these circumstances should be counter-signed by the 🎩 Head Honcho .

The following are commitment criteria:

Goals:

The top priority is to maintain data integrity. Rolling back after the maintenance window has ended is very difficult, and will result in any changes made in the interim being lost.
Failures with an unknown cause should be investigated further. If we can't determine the root cause within the maintenance window, we should rollback.

Postgres Upgrade - Switchover (T plus `~30` mins)

Playbook source: https://gitlab.com/gitlab-com/gl-infra/db-migration/-/tree/master/pg-upgrade-logical

SWITCHOVER – execute!

🔪 Playbook-Runner : Press Ctrl-b then up to go to the first terminal.
🔪 Playbook-Runner : Run Ansible playbook to Switchover the gprd-main cluster (it is interactive; reply "y" three times):
```
cd ~/src/db-migration/pg-upgrade-logical
ansible-playbook \
  -e "ansible_ssh_private_key_file=/home/dbupgrade/.ssh/id_dbupgrade" \
  -i inventory/gprd-main.yml \
  switchover.yml 2>&1 \
| ts | tee -a ansible_switchover_gprd_main_$(date +%Y%m%d).log
```
- switchover.yml asks to confirm steps 3 times (type "y"). After each step, DBRE has to verify traffic to proper nodes, lack of errors, DB latencies and confirm decision to continue
  - 🔪 Playbook-Runner : First "y": start R/O traffic to new v14 replicas
  - 🐬 SRE : After 1st YES, The sevice db-replica.service.consul. should be pointing to both V12 and V14 replica nodes
```
dig @127.0.0.1 -p 8600 db-replica.service.consul. SRV +short

dig @127.0.0.1 -p 8600 master.patroni.service.consul. SRV +short
```
  - 🐘 DBRE : Check Read-Only activity metrics for 15 minutes PG14 Upgrade Dashboard
    - Compare the volume of standbys TPS (commits) between Target vs Source (it should split workload 50/50 between old and new replicas)
    - Compare volume of rollback TPS – ERRORS
  - 🐺 Coordinator : Wait for the end of the hourly Write TPS spike to finish (around 18m past the hour)
  - 🔪 Playbook-Runner : Second "y": stop R/O traffic to old replicas
  - 🐬 SRE : After 2nd YES, The sevice db-replica.service.consul. should be pointing only to the new V14 replica nodes
```
dig @127.0.0.1 -p 8600 db-replica.service.consul. SRV +short

dig @127.0.0.1 -p 8600 master.patroni.service.consul. SRV +short
```
  - 🐘 DBRE : Check the metrics for as long as we observe connections to the SOURCE standbys, minimum time 15 minutes. (this is not blocking the 🏆 Quality tests).
  - 🏆 Quality : Trigger Smoke E2E suite against the environment that was upgraded:Production: Four hourly smoke tests. This has an estimated duration of 15 minutes.
  - 🏆 Quality : If the smoke tests fail, Quality should re-run the failed job to see if it is reproducible. In parallel a 15 minute window to do an initial triage of the failure will be alloted. If Quality cannot determine failure is 'unrelated' within that period - stop and reschedule the whole procedure.
  - 🐺 Coordinator : This is the point of no return! We will not execute a rollback after this point! Proceed wisely!
    - Get agreement of peers and concent of 🎩 Head Honcho to proceed
  - 🐺 Coordinator : Wait for the end of the hourly Write TPS spike to finish (around 18m past the hour)
    - Only proceed with R/W traffic primary switchover when logical replication lag < 500 MiB
  - 🔪 Playbook-Runner : Third "y": R/W traffic, primary switchover.
  - 🐬 SRE : After 3rd YES, the master service for master.patroni.service.consul. should be pointing only to the new V14 Leader/Writer node.
```
dig @127.0.0.1 -p 8600 master.patroni.service.consul. SRV +shorthort
```
- 🔪 Playbook-Runner : If the first "y" fails, repeat in the "forced" mode:
```
cd ~/src/db-migration/pg-upgrade-logical
ansible-playbook \
  -e "ansible_ssh_private_key_file=/home/dbupgrade/.ssh/id_dbupgrade" \
  -i inventory/gprd-main.yml \
  switchover.yml -e "force_mode=true" 2>&1 \
| ts | tee -a ansible_switchover_gprd_main_$(date +%Y%m%d)_FORCE_MODE.log
```

Post Postgres Switchover verification

You can execute the following steps as soon as their respective upgrades in the previous step have finished executing.

🐘 DBRE : Ensure main cluster is in desired state.
1. Login to node 01:
```
ssh patroni-main-v14-101-db-gprd.c.gitlab-production.internal
```
2. Get leader for each cluster:
```
sudo gitlab-patronictl list
```
3. Connect via SSH to the previously identified leaders and tail Postgres logs:
```
ssh ... # the leader host here
sudo tail -f /var/log/gitlab/postgresql/postgresql.csv
```
4. 🐘 DBRE : On the v14 leader, stop the monitoring-terminate of autovacuum workers – in psql, press Ctrl-C.
🏆 Quality : Trigger Smoke E2E suite against the environment that was upgraded:Production: Four hourly smoke tests

Metrics sanity check after switchover to v14

PRODUCTION ONLY 📣 CMOC : Post update from Status.io maintenance site, publish on @gitlabstatus. Workflow: https://about.gitlab.com/handbook/support/workflows/cmoc_workflows.html#sending-updates-about-maintenance-events
- Message:
```
GitLab.com planned maintenance for the database layer is almost complete. We're continuing to verify that all systems are functioning correctly. Thank you for your patience.
```

🐬 SRE : Merge the MR that updates main teleport DB endpoint, the MR that updates the console config endpoint and the MR that updates the source snapshot of the main-data-analytics DB. First confirm there are no errors in merge pipeline. If the MR was merged, then revert it, and get it merged.
1. MR for main teleport DB endpoint: gitlab-com/gl-infra/k8s-workloads/gitlab-helmfiles!2790 (merged)
2. MR for main console config endpoint: https://gitlab.com/gitlab-com/gl-infra/chef-repo/-/merge_requests/3916
3. MR for main snapshot image: https://gitlab.com/gitlab-com/gl-infra/data-server-rebuild-ansible/-/merge_requests/54

🐘 DBRE : Ensure writes are happening on Postgres/Patroni nodes in gprd-main: Thanos
🐘 DBRE Check prometheus sanity check metrics for reads all going to the correct hosts
1. Index reads
  - Expected result: all queries going to the patroni-main-v14 cluster.
2. Sequential scans
- Expected result: all queries going to the patroni-main-v14 cluster.
🐘 DBRE Check Sentry if there are errors that might indicate database problems: Production Sentry
🐬 SRE : Merge the MR that update Consul and Prometheus. First confirm there are no errors in merge pipeline. If the MR was merged, then revert it, and get it merged.
1. MR for gprd-patroni-main: https://gitlab.com/gitlab-com/gl-infra/chef-repo/-/merge_requests/3904
🐬 SRE : Ensure that the changes merged from the previous step have been deployed to the Chef server before re-enabling Chef by confirming the linked master pipeline for ops.gitlab.net completed successfully.

🐬 SRE : Start cron.service on all gprd-main nodes:

knife ssh "role:gprd-base-db-patroni-main-v14" "sudo systemctl is-active cron.service"
knife ssh "role:gprd-base-db-patroni-main-v14" "sudo systemctl start cron.service"
knife ssh "role:gprd-base-db-patroni-main-v14" "sudo systemctl is-active cron.service"

🐬 SRE : Re-enable Chef in all gprd-main nodes:

knife ssh "role:gprd-base-db-patroni-main-2004 OR role:gprd-base-db-patroni-main-v14" "sudo chef-client-enable"

🐬 SRE : Confirm Chef is enabled in all nodes thanos link

🐬 SRE : Run chef-client on Patroni Nodes:

knife ssh "role:gprd-base-db-patroni-main-v14 OR role:gprd-base-db-patroni-main-2004" "sudo chef-client"

🐬 SRE : Confirm:
- No errors while running chef-client thanos link and we still have v14 binary.
```
knife ssh "roles:gprd-base-db-patroni-main-v14" "sudo /usr/lib/postgresql/14/bin/postgres --version"
```
- The sevice db-replica.service.consul. should be pointing only to the new V14 replica nodes, and the master service for master.patroni.service.consul. should be pointing only to the new V14 Leader/Writer node.
```
dig @127.0.0.1 -p 8600 db-replica.service.consul. SRV +short

dig @127.0.0.1 -p 8600 master.patroni.service.consul. SRV +short
```

Communicate

🐺 Coordinator : TODO Remove the broadcast message (if it's after the initial window, it has probably expired automatically)
PRODUCTION ONLY 📣 CMOC : Post update from Status.io maintenance site, publish on @gitlabstatus. Workflow: https://about.gitlab.com/handbook/support/workflows/cmoc_workflows.html#sending-updates-about-maintenance-events
- Click "Finish Maintenance" and send the following:
  - Message:
```
GitLab.com planned maintenance for the database layer is complete. We'll be monitoring the platform to ensure all systems are functioning correctly. Thank you for your patience.
```
PRODUCTION ONLY ☎ Comms-Handler : In the same thread from the earlier post, post the following message and click on the checkbox "Also send to X channel" so the threaded message would be published to the channel:
- Message:
```
:done: *GitLab.com database layer maintenance upgrade is complete now.* :celebrate:
We’ll continue to monitor the platform to ensure all systems are functioning correctly.
```
  - #whats-happening-at-gitlab
  - #infrastructure-lounge (cc @sre-oncall)
  - #g_delivery (cc @release-managers)
PRODUCTION ONLY ☎ Comms-Handler : Send a message to #social_media_action to unpin maintenance tweet on @gitlab Twitter:
- Message:
```
Hi team :wave:, the maintenance upgrade is complete now, you may unpin the maintenance tweet on GitLab Twitter.
```

Complete the Upgrade (T plus `~290` mins)

Verification

Start Post Switchover to v14 QA
1. 🏆 Quality : Trigger Smoke and Full E2E suite against the environment that was upgraded:Production: Four hourly smoke tests, and Twice daily full run

Wrapping up

PRODUCTION ONLY 🔪 Playbook-Runner : If the scheduled maintenance is still active in PagerDuty, click on Update then End Now.
🔪 Playbook-Runner : Remove silences of fqdn=~"patroni-main-v14.*" we created during this process from https://alerts.gitlab.net (Important: don't remove the silence of SOURCE nodes)
🔪 Playbook-Runner : ADD the following silences at https://alerts.gitlab.net to silence WALGBaseBackup alerts in patroni-main-2004 for 2 weeks (14 days = 336 hours)
- Start time: 2023-09-07T16:52:09.000Z
- Duration: 336h
  - env="gprd"
  - type="gprd-patroni-main-2004"
  - alertname=~"WALGBaseBackupFailed|walgBaseBackupDelayed"
🐘 DBRE : Initiate a full backup (using wal-g) and trigger GCS snapshot on the new v14 Patroni main cluster:
- SSH to patroni-main-v14-102-db-gprd.c.gitlab-production.internal
  - Run a Wal-G backup:
```
sudo su - gitlab-psql
tmux new -s PGBasebackup
nohup /opt/wal-g/bin/backup.sh >> /var/log/wal-g/wal-g_backup_push.log 2>&1 &
```
- Open another SSH session to patroni-main-v14-102-db-gprd.c.gitlab-production.internal
  - Run a manual GCS Snapshot
```
sudo su - gitlab-psql
tmux new -s GCSSnapshot
/usr/local/bin/gcs-snapshot.sh
```
🏆 Quality : Quality team (after an hour): Check that the Smoke, and Full E2E suite has passed.
🏆 Quality : Trigger smoke tests one more time now that Chef would have had time to run:
PRODUCTION ONLY ☎ Comms-Handler : Notify our customer over Slack channel that Postgres Upgrade finished, and request to validate GitLab.com.
PRODUCTION ONLY 🐘 DBRE : Ping @NikolayS on the gitlab.com CR (production#16266 (closed)) (and/or in #database-lab) that the work is complete so he can update the DLE environment.

PRODUCTION ONLY 🐬 SRE : Confirm that Teleport access works properly and we're hitting the right clusters:

# Login to Teleport
tsh login --add-keys-to-agent=no --proxy=production.teleport.gitlab.net --request-roles=database-ro-gprd --request-reason="Testing DB connectivity after PG14 Upgrade - https://gitlab.com/gitlab-com/gl-infra/production/-/issues/16266 "
query="select setting from pg_settings where name='cluster_name'"

# Check MAIN
echo "${query}" | tsh db connect --db-user=console-ro --db-name=gitlabhq_production db-main-primary-gprd --add-keys-to-agent=no
echo "${query}" | tsh db connect --db-user=console-ro --db-name=gitlabhq_production db-main-replica-gprd --add-keys-to-agent=no

# Check CI
echo "${query}" | tsh db connect --db-user=console-ro --db-name=gitlabhq_production db-ci-primary-gprd --add-keys-to-agent=no
echo "${query}" | tsh db connect --db-user=console-ro --db-name=gitlabhq_production db-ci-replica-gprd --add-keys-to-agent=no

You should see v14 as part of the cluster name.

🐘 DBRE : Update the wal-g daily restore schedule for the [gprd] - [main] cluster at https://ops.gitlab.net/gitlab-com/gl-infra/gitlab-restore/postgres-gprd/-/pipeline_schedules
1. Change the following variables:
  - PSQL_VERSION = 14
  - BACKUP_PATH = ? (? = use the "directory" from the new v14 GCS backup location at: https://console.cloud.google.com/storage/browser/gitlab-gprd-postgres-backup/pitr-walg-main-v14)

🐘 DBRE : Enable feature flags by typing the following into #production:

PRODUCTION:
1. /chatops run feature set disallow_database_ddl_feature_flags false

🐘 DBRE : Inform the database team that the CR is completed and that the background migrations and reindexing feature flags have been re-enabled. Post the following comment on the gitlab.com CR (production#16266 (closed)):

Hi @gl-database,

Please note that we have completed the work for this CR in the `gprd` environment. Therefore we have re-enabled the `execute_batched_migrations_on_schedule`, `execute_background_migrations`, reindexing, async_foreign_key,  sync_index and partition_manager_sync_partitions features and tasks in `PRODUCTION`.

Could you please confirm that they have been re-enabled correctly?

Thanks!

🐬 SRE : Check if the underlying DDL features were ENABLED back by disabling the disallow_database_ddl_feature_flags:

On slack /chatops run feature get disallow_database_ddl_feature_flags should return DISABLED

On Rails Console:

Open a new rails console

PRODUCTION: URL production.teleport.gitlab.net or tsh:

tsh login --proxy=production.teleport.gitlab.net --request-roles=rails-ro --request-reason="Validate if Database Migration/Reindex Workers are disabled during PG14 upgrade: https://gitlab.com/gitlab-com/gl-infra/production/-/issues/16266"
tsh ssh rails-ro@console-ro-01-sv-gprd

Paste the script in the console

def output(name, value)
  color = value ? '31' : '32'
  result = value ? 'enabled' : 'disabled'

  puts "\e[33m#{name} is\e[0m \e[#{color}m#{result}.\e[0m"
end

def check
  ActiveRecord::Base.logger = nil
  output('Database::BatchedBackgroundMigration::MainExecutionWorker', Database::BatchedBackgroundMigration::MainExecutionWorker.new.send(:enabled?))
  output('Database::BatchedBackgroundMigration::CiExecutionWorker', Database::BatchedBackgroundMigration::CiExecutionWorker.new.send(:enabled?))
  output('Database::BatchedBackgroundMigration::CiDatabaseWorker', Database::BatchedBackgroundMigration::CiDatabaseWorker.enabled?)
  output('Database::BatchedBackgroundMigrationWorker', Database::BatchedBackgroundMigrationWorker.enabled?)
  output('Gitlab::Database::Reindexing', Gitlab::Database::Reindexing.enabled?)
  
  is_execute_background_migrations_enabled = !(Feature.enabled?(:disallow_database_ddl_feature_flags, type: :ops) || Feature.disabled?(:execute_background_migrations, type: :ops))
  output('BackgroundMigration::CiDatabaseWorker', is_execute_background_migrations_enabled)
  output('BackgroundMigrationWorker', is_execute_background_migrations_enabled)
  
  is_database_async_index_operations_enabled = !(Feature.enabled?(:disallow_database_ddl_feature_flags, type: :ops) || Feature.disabled?(:database_async_index_operations, type: :ops))
  output('rake gitlab:db:execute_async_index_operations', is_database_async_index_operations_enabled)
  
  is_database_async_foreign_key_validation_enabled = Feature.disabled?(:disallow_database_ddl_feature_flags, type: :ops) && Feature.enabled?(:database_async_foreign_key_validation, type: :ops)
  output('rake gitlab:db:validate_async_constraints', is_database_async_foreign_key_validation_enabled)
  output('Gitlab::Database::AsyncConstraints', is_database_async_foreign_key_validation_enabled)
  
  is_database_async_index_creation_enabled = Feature.disabled?(:disallow_database_ddl_feature_flags, type: :ops) && Feature.enabled?(:database_async_index_creation, type: :ops)
  output('Gitlab::Database::AsyncIndexes', is_database_async_index_creation_enabled)

  is_partition_manager_sync_partitions_enabled = !(Feature.enabled?(:disallow_database_ddl_feature_flags, type: :ops) || Feature.disabled?(:partition_manager_sync_partitions, type: :ops)) 
  output('Gitlab::Database::Partitioning#sync_partitions', is_partition_manager_sync_partitions_enabled)
  output('Gitlab::Database::Partitioning#drop_detached_partitions', is_partition_manager_sync_partitions_enabled)
end

Check the output - All workers/tasks should be enabled, like for example:

Database::BatchedBackgroundMigration::MainExecutionWorker is enabled.
Database::BatchedBackgroundMigration::CiExecutionWorker is enabled.
Database::BatchedBackgroundMigration::CiDatabaseWorker is enabled.
Database::BatchedBackgroundMigrationWorker is enabled.
Gitlab::Database::Reindexing is enabled.
BackgroundMigration::CiDatabaseWorker is enabled.
BackgroundMigrationWorker is enabled.
rake gitlab:db:execute_async_index_operations is enabled.
rake gitlab:db:validate_async_constraints is enabled.
Gitlab::Database::AsyncConstraints is enabled.
Gitlab::Database::AsyncIndexes is enabled.
Gitlab::Database::Partitioning#sync_partitions is enabled.
Gitlab::Database::Partitioning#drop_detached_partitions is enabled.

🐬 SRE : We have a separate issue to rebuild each cluster's DR Archive and Delayed replicas. We will use the following issue link to track rebuild the main cluster's DR Archive and Delayed replicas from the most recent v14 database backup of the main. It will be completed in the next couple of working days. TODO add links for gprd
- PRODUCTION:

🐘 DBRE Logical replication from `SOURCE to TARGET`` should have been destroyed by the switchover, check and destroy it if it was not:

🐘

DBRE On the TARGET cluster patroni-main-v14 Leader/Writer, drop subscription (if still existing) for logical replication:

Check if the subscription still exist:

gitlab-psql \
    -Xc "select subname, subenabled, subconninfo, subslotname, subpublications from pg_subscription"

Drop the logical replication subscription:

gitlab-psql \
    -Xc "alter subscription logical_subscription disable" \
    -Xc "alter subscription logical_subscription set (slot_name = none)" \
    -Xc "drop subscription logical_subscription"

🐘

DBRE On the SOURCE cluster patroni-main-2004 Leader/Writer, drop publication and logical_replication_slot for reverse replication:

Check if the publication and replication slots still exist:

gitlab-psql \
  -Xc "select pubname from pg_publication" \
  -Xc "select slot_name, plugin, slot_type, active from pg_replication_slots"

Check if the publication and replication slots still exist:

gitlab-psql \
  -Xc "drop publication logical_replication" \
  -Xc "select pg_drop_replication_slot('logical_replication_slot') from pg_replication_slots where slot_name = 'logical_replication_slot'" \
  -Xc "drop table if exists test_publication" \
  -Xc "drop table if exists test_replication"

Reverse replication validation

🐺 Coordinator : Coordinate with 🐘 DBRE to make sure we stay in the current status for an hour and continue to run reverse replication from v14 to v12 per https://gitlab.com/gitlab-com/gl-infra/reliability/-/issues/23986#note_1524934967
🐘 DBRE TODO_ @NikolayS TO ADD steps to validate the reverse replication is working fine as expected.

🐘 DBRE On the SOURCE cluster patroni-main-2004 Leader/Writer, drop subscription (if still existing) for reverse replication:

Check if the subscription still exist:

gitlab-psql \
    -Xc "select subname, subenabled, subconninfo, subslotname, subpublications from pg_subscription"

Drop the logical replication subscription:

gitlab-psql \
    -Xc "alter subscription reverse_logical_subscription disable" \
    -Xc "alter subscription reverse_logical_subscription set (slot_name = none)" \
    -Xc "drop subscription reverse_logical_subscription"

🐘 DBRE On the TARGET cluster patroni-main-v14 Leader/Writer, drop publication and reverse_logical_replication_slot for reverse replication:

Check if the publication and replication slots still exist:

gitlab-psql \
  -Xc "select pubname from pg_publication" \
  -Xc "select slot_name, plugin, slot_type, active from pg_replication_slots"

Check if the publication and replication slots still exist:

gitlab-psql \
  -Xc "drop publication reverse_logical_replication" \
  -Xc "select pg_drop_replication_slot('reverse_logical_replication_slot') from pg_replication_slots where slot_name = 'reverse_logical_replication_slot'" \
  -Xc "drop table if exists test_publication" \
  -Xc "drop table if exists test_replication"

🐘 DBRE Shutdown the SOURCE gprd-base-db-patroni-main-2004 cluster to avoid any risk of splitbrain:
```
knife ssh "role:gprd-base-db-patroni-main-2004" "sudo systemctl stop patroni"
```

🐺 Coordinator : Check if gitlab_maintenance_mode is DISABLED for gprd (Thanos link)

If is not disabled ask 🔪 Playbook-Runner to manually disable it by:

SSH to a console VM in gprd (eg. ssh console-01-sv-gprd.c.gitlab-production.internal )

Set gitlab_maintenance_mode=0 on node exporter :

sudo su -
echo -e "# HELP gitlab_maintenance_mode record maintenance window\n# TYPE gitlab_maintenance_mode untyped\ngitlab_maintenance_mode 0\n" > /opt/prometheus/node_exporter/metrics/gitlab_maintenance_mode.prom
cat /opt/prometheus/node_exporter/metrics/gitlab_maintenance_mode.prom
atq

🐺 Coordinator : Mark the change request as /label ~"change::complete"

Rollback (if required)

PRODUCTION ONLY 📣 CMOC : Post an update from Status.io maintenance site, publish on @gitlabstatus. Workflow: https://about.gitlab.com/handbook/support/workflows/cmoc_workflows.html#sending-updates-about-maintenance-events
- Message:
```
Due to an issue during the planned maintenance for the database layer, we have initiated a rollback of the changes. We will provide update once the rollback process is completed.
```
PRODUCTION ONLY ☎ Comms-Handler : Ask the IMOC or the Head Honcho if this message should be sent to any slack rooms:
- #whats-happening-at-gitlab
- #infrastructure-lounge (cc @sre-oncall)
- #g_delivery (cc @release-managers)
- #community-relations

Health check

🐺 Coordinator : Ensure that there are no active critical alerts or open incidents:
- PRODUCTION:
  - https://dashboards.gitlab.net/d/SOn6MeNmk/alerts?orgId=1&var-interval=1m&var-environment=gprd&var-alertname=All&var-alertstate=All&var-prometheus=prometheus-01-inf-gprd&var-prometheus_app=prometheus-app-01-inf-gprd
  - https://gitlab.com/gitlab-com/gl-infra/production/issues?label_name%5B%5D=incident
🔪 Playbook-Runner : Verify that the ansible inventory is up to date and reflects the real state from the cluster.

Rollback Postgres Upgrade

After the switchover there is NO reverse replication, replicating data from PG14 to PG12!
After enabling site traffic on the new cluster, all new changes to the database will only be on the new cluster
There will be no rollback after the switchover!

🐘 DBRE : Check if rollback is possible
1. We have not switched over
🐘 DBRE : Monitor what pgbouncer pool has connections Thanos
🐘 DBRE Start patroni in the source gprd-base-db-patroni-main-2004 cluster nodes:
```
knife ssh "role:gprd-base-db-patroni-main-2004" "sudo systemctl start patroni"
```
🐘 DBRE : TODO @vitabaks @NikolayS Disable "read only" flag on v12 cluster:
🐘 DBRE : Monitor the v14 and v12 PostgreSQL Log Files:
1. Login to node 01 of each cluster:
  - ssh patroni-main-2004-101-db-gprd.c.gitlab-production.internal
  - ssh patroni-main-v14-101-db-gprd.c.gitlab-production.internal
2. Get leader for each cluster:
```
sudo gitlab-patronictl list
```
3. Connect via SSH to the previously identified leaders.
4. Tail Postgres Logs:
```
sudo tail -f /var/log/gitlab/postgresql/postgresql.csv
```

ROLLBACK – execute!

Goal: Set gprd-main v12 cluster as Primary cluster

🔪 Playbook-Runner : Execute switchover_rollback.yml playbook to rollback to v12 cluster:

cd ~/src/db-migration/pg-upgrade-logical
ansible-playbook \
  -e "ansible_ssh_private_key_file=/home/dbupgrade/.ssh/id_dbupgrade" \
  -i inventory/gprd-main.yml \
  switchover_rollback.yml -e "force_mode=true" 2>&1 \
| ts | tee -a ansible_switchover_rollback_gprd_main_$(date +%Y%m%d).log

🐘 DBRE : Ensure writes are happening on Postgres/Patroni nodes in patroni-main-2004: Thanos
🐘 DBRE Check prometheus sanity check metrics for reads all going to the correct hosts
1. Index reads
  - Expected result: all queries going to the patroni-main-2004 cluster.
2. Sequential scans
  - Expected result: all queries going to the patroni-main-2004 cluster.

Complete the rollback

🏆 Quality Confirm that our smoke tests are still passing (continue the rollback as this might take an hour...)
🐬 SRE : Revert the MR that change Consul and Prometheus, IF it was Merged.
1. Revert MR: https://gitlab.com/gitlab-com/gl-infra/chef-repo/-/merge_requests/3905

🐬 SRE : Re-enable Chef in all gprd-main nodes:

knife ssh "role:gprd-base-db-patroni-main-2004 OR role:gprd-base-db-patroni-main-v14" "sudo chef-client-enable"

🐬 SRE : Confirm chef-client is enabled in all nodes thanos link

🐬 SRE : Run chef-client on Patroni Nodes:

knife ssh "role:gprd-base-db-patroni-main-v14 OR role:gprd-base-db-patroni-main-2004" "sudo chef-client"

🐬 SRE : Confirm no errors while running chef-client thanos link
PRODUCTION ONLY 📣 CMOC : Post update from Status.io maintenance site, publish on @gitlabstatus. Workflow: https://about.gitlab.com/handbook/support/workflows/cmoc_workflows.html#sending-updates-about-maintenance-events
- Click "Finish Maintenance" and send the following:
  - Message:
```
GitLab.com rollback for the database layer is complete, and we're back up and running. We'll be monitoring the platform to ensure all systems are functioning correctly. Thank you for your patience.
```

PRODUCTION ONLY ☎ Comms-Handler : Send the following message to slack rooms:

GitLab.com rollback for the database layer is complete and we're back up and running. We'll be monitoring the platform to ensure all systems are functioning correctly. Thank you for your patience.

#whats-happening-at-gitlab
#infrastructure-lounge (cc @sre-oncall)
#g_delivery (cc @release-managers)

🐘 DBRE : Enable feature flags by typing the following into #production:

PRODUCTION:
1. /chatops run feature set disallow_database_ddl_feature_flags false

🐘 DBRE : Inform the database team that the CR was rolled back and that the background migrations and reindexing feature flags have been re-enabled. Post the following comment on the gitlab.com CR (production#16266 (closed)):

Hi @gl-database,

Please note that we have rolled back the work for this CR in the `gprd` environment. Therefore we have re-enabled the `execute_batched_migrations_on_schedule`, `execute_background_migrations`, reindexing, async_foreign_key, sync_index and partition_manager_sync_partitions features and tasks in `PRODUCTION`.

Could you please confirm that they have been re-enabled correctly?

Thanks!

🐬 SRE : Check if the underlying DDL features were ENABLED back by disabling the disallow_database_ddl_feature_flags:

On slack /chatops run feature get disallow_database_ddl_feature_flags should return DISABLED

On Rails Console:

Open a new rails console

PRODUCTION: URL production.teleport.gitlab.net or tsh:

tsh login --proxy=production.teleport.gitlab.net --request-roles=rails-ro --request-reason="Validate if Database Migration/Reindex Workers are disabled during PG14 upgrade: https://gitlab.com/gitlab-com/gl-infra/production/-/issues/16266"
tsh ssh rails-ro@console-ro-01-sv-gprd

Paste the script in the console

def output(name, value)
  color = value ? '31' : '32'
  result = value ? 'enabled' : 'disabled'

  puts "\e[33m#{name} is\e[0m \e[#{color}m#{result}.\e[0m"
end

def check
  ActiveRecord::Base.logger = nil
  output('Database::BatchedBackgroundMigration::MainExecutionWorker', Database::BatchedBackgroundMigration::MainExecutionWorker.new.send(:enabled?))
  output('Database::BatchedBackgroundMigration::CiExecutionWorker', Database::BatchedBackgroundMigration::CiExecutionWorker.new.send(:enabled?))
  output('Database::BatchedBackgroundMigration::CiDatabaseWorker', Database::BatchedBackgroundMigration::CiDatabaseWorker.enabled?)
  output('Database::BatchedBackgroundMigrationWorker', Database::BatchedBackgroundMigrationWorker.enabled?)
  output('Gitlab::Database::Reindexing', Gitlab::Database::Reindexing.enabled?)
  
  is_execute_background_migrations_enabled = !(Feature.enabled?(:disallow_database_ddl_feature_flags, type: :ops) || Feature.disabled?(:execute_background_migrations, type: :ops))
  output('BackgroundMigration::CiDatabaseWorker', is_execute_background_migrations_enabled)
  output('BackgroundMigrationWorker', is_execute_background_migrations_enabled)
  
  is_database_async_index_operations_enabled = !(Feature.enabled?(:disallow_database_ddl_feature_flags, type: :ops) || Feature.disabled?(:database_async_index_operations, type: :ops))
  output('rake gitlab:db:execute_async_index_operations', is_database_async_index_operations_enabled)
  
  is_database_async_foreign_key_validation_enabled = Feature.disabled?(:disallow_database_ddl_feature_flags, type: :ops) && Feature.enabled?(:database_async_foreign_key_validation, type: :ops)
  output('rake gitlab:db:validate_async_constraints', is_database_async_foreign_key_validation_enabled)
  output('Gitlab::Database::AsyncConstraints', is_database_async_foreign_key_validation_enabled)
  
  is_database_async_index_creation_enabled = Feature.disabled?(:disallow_database_ddl_feature_flags, type: :ops) && Feature.enabled?(:database_async_index_creation, type: :ops)
  output('Gitlab::Database::AsyncIndexes', is_database_async_index_creation_enabled)

  is_partition_manager_sync_partitions_enabled = !(Feature.enabled?(:disallow_database_ddl_feature_flags, type: :ops) || Feature.disabled?(:partition_manager_sync_partitions, type: :ops)) 
  output('Gitlab::Database::Partitioning#sync_partitions', is_partition_manager_sync_partitions_enabled)
  output('Gitlab::Database::Partitioning#drop_detached_partitions', is_partition_manager_sync_partitions_enabled)
end

Check the output - All workers/tasks should be enabled, like for example:

Database::BatchedBackgroundMigration::MainExecutionWorker is enabled.
Database::BatchedBackgroundMigration::CiExecutionWorker is enabled.
Database::BatchedBackgroundMigration::CiDatabaseWorker is enabled.
Database::BatchedBackgroundMigrationWorker is enabled.
Gitlab::Database::Reindexing is enabled.
BackgroundMigration::CiDatabaseWorker is enabled.
BackgroundMigrationWorker is enabled.
rake gitlab:db:execute_async_index_operations is enabled.
rake gitlab:db:validate_async_constraints is enabled.
Gitlab::Database::AsyncConstraints is enabled.
Gitlab::Database::AsyncIndexes is enabled.
Gitlab::Database::Partitioning#sync_partitions is enabled.
Gitlab::Database::Partitioning#drop_detached_partitions is enabled.

🔪 Playbook-Runner : On two nodes, console and target leader, remove the private keys temporarily placed in ~dbupgrade/.ssh:
```
rm ~dbupgrade/.ssh/id_rsa
rm ~dbupgrade/.ssh/id_dbupgrade
```

🐘 DBRE On the TARGET cluster patroni-main-v14 Leader/Writer, drop subscription (if still existing) for logical replication:

Check if the subscription still exist:

gitlab-psql \
    -Xc "select subname, subenabled, subconninfo, subslotname, subpublications from pg_subscription"

Drop the logical replication subscription:

gitlab-psql \
    -Xc "alter subscription logical_subscription disable" \
    -Xc "alter subscription logical_subscription set (slot_name = none)" \
    -Xc "drop subscription logical_subscription"

🐘 DBRE On the SOURCE cluster patroni-main-2004 Leader/Writer, drop publication and logical_replication_slot for reverse replication:

Check if the publication and replication slots still exist:

gitlab-psql \
  -Xc "select pubname from pg_publication" \
  -Xc "select slot_name, plugin, slot_type, active from pg_replication_slots"

Check if the publication and replication slots still exist:

gitlab-psql \
  -Xc "drop publication logical_replication" \
  -Xc "select pg_drop_replication_slot('logical_replication_slot') from pg_replication_slots where slot_name = 'logical_replication_slot'" \
  -Xc "drop table if exists test_publication" \
  -Xc "drop table if exists test_replication"

🐘 DBRE : ADD the following silences at https://alerts.gitlab.net to silence WALGBaseBackup alerts in patroni-main-v14 for 2 weeks (14 days = 336 hours)
- Start time: 2023-09-07T16:52:09.000Z
- Duration: 336h
  - env="gprd"
  - type="gprd-patroni-main-v14"
  - alertname=~"WALGBaseBackupFailed|walgBaseBackupDelayed"
🐘 DBRE : UPDATE the following silence at https://alerts.gitlab.net to silence alerts in V14 nodes for 2 weeks (14 days = 336 hours):
- Start time: 2023-09-09T14:00:00.000Z
- Duration: 341h
- Matcher:
  - PRODUCTION
    - env="gprd"
    - fqdn=~"patroni-main-v14.*"
🐘 DBRE : DELETE the following silences at https://alerts.gitlab.net
- Matcher:
  - PRODUCTION
    - env="gprd"
    - fqdn=~"patroni-main-2004.*"

🐺 Coordinator : Check if gitlab_maintenance_mode is DISABLED for gprd (Thanos link)

If is not disabled ask 🔪 Playbook-Runner to manually disable it by:

SSH to a console VM in gprd (eg. ssh console-01-sv-gprd.c.gitlab-production.internal )

Set gitlab_maintenance_mode=0 on node exporter :

sudo su -
echo -e "# HELP gitlab_maintenance_mode record maintenance window\n# TYPE gitlab_maintenance_mode untyped\ngitlab_maintenance_mode 0\n" > /opt/prometheus/node_exporter/metrics/gitlab_maintenance_mode.prom
cat /opt/prometheus/node_exporter/metrics/gitlab_maintenance_mode.prom
atq

🐺 Coordinator : Mark the production#16266 (closed) change request as /label ~"change::aborted"

Extra details

In case the Playbook-Runner is disconnected

As most of the steps are executed in a tmux session owned by the Playbook-Runner role we need a safety net in case this person loses their internet connection or otherwise drops off half way through. Since other SREs/DBREs also have root access on the console node where everything is running they should be able to recover it in different ways. We tested the following approach to recovering the tmux session, updating the ssh agent and taking over as a new ansible user.

ssh host
Add your public SSH key to /home/PREVIOUS_PLAYBOOK_USERNAME/.ssh/authorized_keys
sudo chef-client-disable https://gitlab.com/gitlab-com/gl-infra/production/-/issues/16266 so that we don't override the above
ssh -A PREVIOUS_PLAYBOOK_USERNAME@host
echo $SSH_AUTH_SOCK
tmux attach -t 0
export SSH_AUTH_SOCK=<VALUE from previous SSH_AUTH_SOCK output>
<ctrl-b> :
set-environment -g 'SSH_AUTH_SOCK' <VALUE from previous SSH_AUTH_SOCK output>
export ANSIBLE_REMOTE_USER=NEW_PLAYBOOK_USERNAME
<ctrl-b> :
set-environment -g 'ANSIBLE_REMOTE_USER' <your-user>

Edited Sep 07, 2023 by Rafael Henchen

TEST pg14upgrade.md issue generator

Postgres Upgrade Rollout Team

Collaboration

Immediately

Support Options

Entry points

Database hosts

Accessing the rails and database consoles

Production

Dashboards and debugging

Production

Repos used during the rollout

High level overview

Prep Tasks

Prepare the environment

Postgres Upgrade rollout

Pre Postgres upgrade checks

Postgres Upgrade

UPGRADE – execute!

Post Postgres upgrades verification

Start data corruption check - pg_amcheck

Postgres Upgrade Call

Roll call

Data Corruption Checks

Pre-maintenance Health Checks

Terminals

T minus zero (2023-09-09 14:00 UTC)

Pre Switchover checks (T plus 0 min)

Evaluation of QA/Validations results - Commitment

Postgres Upgrade - Switchover (T plus ~30 mins)

SWITCHOVER – execute!

Post Postgres Switchover verification

Metrics sanity check after switchover to v14

Communicate

Complete the Upgrade (T plus ~290 mins)

Verification

Wrapping up

Reverse replication validation

Rollback (if required)

Health check

Rollback Postgres Upgrade

ROLLBACK – execute!

Complete the rollback

Extra details

In case the Playbook-Runner is disconnected

Postgres Upgrade - Switchover (T plus `~30` mins)

Complete the Upgrade (T plus `~290` mins)