TEST pg14upgrade.md issue generator
Postgres Upgrade Rollout Team
Role | Assigned To |
---|---|
|
@rhenchen |
|
@bshah |
|
@kwanyangu |
|
@alexander-sosna |
|
@anganga |
|
@acunskis @ddavison |
|
@kwanyangu |
|
@cmarais @alejguer |
|
@anganga @msmiley |
|
@kwanyangu |
Link to gitlab.com CR: production#16266 (closed)
Collaboration
During the change window, the rollout team will collaborate using the following communications channels:
App | Direct Link |
---|---|
Slack | #g_infra_database_reliability |
Video Call | https://gitlab.zoom.us/j/97279198952?pwd=eStENnFtK3UxRFFoNU5wT0xFR2JHdz09 |
Immediately
Perform these steps when the issue is created.
-
🐺 Coordinator : Fill out the names of the rollout team in the table above.
Support Options
Provider | Plan | Details | Create Ticket |
---|---|---|---|
Google Cloud Platform | Gold Support | 24x7, email & phone, 1hr response on critical issues | Create GCP Support Ticket |
Entry points
Entry point | Before | Blocking mechanism | Allowlist | QA needs | Notes |
---|---|---|---|---|---|
Pages | Available via *.gitlab.io, and various custom domains | Unavailable if GitLab.com goes down for a brief time. There is a cache but it will expire in gitlab_cache_expiry minutes |
N/A | N/A | |
Database hosts
Accessing the rails and database consoles
Production
- rails:
ssh $USER-rails@console-01-sv-gprd.c.gitlab-production.internal
- main db replica:
ssh $USER-db@console-01-sv-gprd.c.gitlab-production.internal
- main db primary:
ssh $USER-db-primary@console-01-sv-gprd.c.gitlab-production.internal
- ci db replica:
ssh $USER-db-ci@console-01-sv-gprd.c.gitlab-production.internal
- ci db primary:
ssh $USER-db-ci-primary@console-01-sv-gprd.c.gitlab-production.internal
- main db psql:
ssh -t patroni-main-2004-04-db-gprd.c.gitlab-production.internal sudo gitlab-psql
- ci db psql:
ssh -t patroni-ci-2004-05-db-gprd.c.gitlab-production.internal sudo gitlab-psql
- registry db psql:
ssh -t patroni-v12-registry-01-db-gprd.c.gitlab-production.internal sudo gitlab-psql
Dashboards and debugging
These dashboards might be useful during the rollout:
Production
- PostgreSQL replication overview
- Triage overview
- Sidekiq overview
-
Sentry - includes application errors
- Workhorse: https://sentry.gitlab.net/gitlab/gitlab-workhorse-gitlabcom/
- Rails (backend): https://sentry.gitlab.net/gitlab/gitlabcom/
- Rails (frontend): https://sentry.gitlab.net/gitlab/gitlabcom-clientside/
- Gitaly (golang): https://sentry.gitlab.net/gitlab/gitaly-production/
- Gitaly (ruby): https://sentry.gitlab.net/gitlab/gitlabcom-gitaly-ruby/
- Logs (Kibana)
Repos used during the rollout
The following Ansible playbooks are referenced throughout this issue:
- Postgres Upgrade, Switchover & Rollback: https://gitlab.com/gitlab-com/gl-infra/db-migration/-/tree/master/pg-upgrade-logical
High level overview
This gives an high level overview on the procedure.
Upgrade Flowchart
flowchart TB
subgraph Prepare new enviroment
A[Create new cluster $TARGET as a carbon copy of the one to upgrade, $SOURCE.] --> B
B[Attach $TARGET as a standby-only-cluster to $SOURCE via physical replication.] --> C
end
C[Make sure both clusters are in sync.] --> D1
subgraph Upgrade: ansible-playbook upgrade.yml
D1[Disable Chef] --> D
D[Change from physical replication to logical.] --> E
E[Make sure both clusters are in sync again.] --> G
end
G[Upgrade $TARGET to new version via pg_upgrade.] --> H
subgraph Prepare switchover
H[Make sure both clusters are in sync again.] --> I
I[Merge Chef MRs so $TARGET uses roles for new PostgreSQL version] --> K
K[Enable Chef, run Chef-Client] --> L
L[Make sure Chef finished sucessfully and cluster is still operational] --> M
M[Disable Chef again] --> N
end
N[Check metrics and sanity checks are as expectet] --> O
subgraph Switchover: ansible-playbook switchover.yml
O[Redirect RO traffic to $TARGET standbys in addition to $SOURCE] --> P
P[Check if cluster is operational and metrics are normal] --"Normal"--> Q
P --"Abnormal"-->GR
Q[Redirect RO only to $TARGET] --> R
R[Check if cluster is operational and metrics are normal] --"Normal"--> S
R --"Abnormal"--> GR
S[Quality team verify their tests run as expectet] --"Normal"--> T
S --"Abnormal"-->GR
end
T[Switchover: Redirect RW traffict to $TARGET] --> U1
subgraph Post Switchover Verification
U1[Check if cluster is operational and metrics are normal]--"Normal"--> U2
U1 --"Abnormal"--> LR
U2[Enable Chef, run Chef-Client] --"Normal"--> U3
U2 --"Abnormal"--> LR
U3[Check if cluster is operational and metrics are normal] --"Normal"--> Success
U3 --"Abnormal"--> LR
Success[Success!]
end
subgraph GR[Gracefull Rollback - no dataloss]
GR1[Start gracefull rollback]
end
subgraph LR[Fix forward]
LR1[Fix the all issues] -->LR2
LR2[Return to last failed step]
end
Sketches of the upgrade.yml actions can be found here: upgrade.yml.pdf
Prep Tasks
-
T minus 1 week (2023-09-02 14:00 UTC)
-
☎ Comms-Handler : Discuss scheduling of this CR and assess impact on deployments and releases with the Release Managers (@release-managers
in Slack). Ask them to comment with approval on this issue. -
PRODUCTION ONLY 📣 CMOC : Post update from Status.io maintenance site, publish on @gitlabstatus. Workflow: https://about.gitlab.com/handbook/support/workflows/cmoc_workflows.html#sending-updates-about-maintenance-events- Message:
Next week, we will be undergoing scheduled maintenance to our main database layer. The maintenance will take up to 5 hours starting from 14:00 UTC to 19:00 UTC. GitLab.com will be available but users may experience degraded performance during the maintenance window. We apologize in advance for any inconvenience this may cause. See <https://gitlab.com/gitlab-com/gl-infra/production/-/issues/16266>
- Message:
-
T minus 3 days (2023-09-06 14:00 UTC)
-
☎ Comms-Handler : Coordinate with@release-managers
to make sure deployments have been paused until the maintenance window ends. Kindly ask to comment with approval on this issue.Hi @release-managers :waves:, we would like to make sure that deployments have been stopped for the affected environments until 2023-09-09 19:00 UTC. Be aware that we are deactivating certain feature flags during this time. All details can be found in the CR. Please be so kind and comment with approval on https://gitlab.com/gitlab-com/gl-infra/production/-/issues/16266. Be aware that on the first working day after the upgrade performance is closely monitored and performance tuning might be required. Therefore please halt database migrations further, until the following Tuesday. :bow:
-
PRODUCTION ONLY 📣 CMOC : Post update from Status.io maintenance site, publish on @gitlabstatus. Workflow: https://about.gitlab.com/handbook/support/workflows/cmoc_workflows.html#sending-updates-about-maintenance-events- Message:
In 3 days, we will be undergoing scheduled maintenance to our main database layer. The maintenance will take up to 5 hours starting from 14:00 UTC to 19:00 UTC. GitLab.com will be available but users may experience degraded performance during the maintenance window. We apologize in advance for any inconvenience this may cause. See <https://gitlab.com/gitlab-com/gl-infra/production/-/issues/16266>
- Message:
-
PRODUCTION ONLY ☎ Comms-Handler : Send the tweet link from @gitlabstatus to #social_media_action channel on Slack: -
PRODUCTION ONLY ☎ Comms-Handler : Send on Slack#whats-happening-at-gitlab
:-
Message:
:loudspeaker: *Postgres upgrade for our main database clusters is scheduled for 2023-09-09 between 14:00 UTC and 19:00 UTC* :rocket: Taking place in 3 days time! See <https://gitlab.com/gitlab-com/gl-infra/production/-/issues/16266> :hammer_and_wrench: *What to expect?* GitLab.com will be available but users may experience degraded performance during the maintenance window. If you experience any issues likely realted to the upgrade in the next few days after the upgrade, please open an issue and reach the upgrade team at slack channel `#pg_upgrade`
-
-
PRODUCTION ONLY ☎ Comms-Handler : Share message from#whats-happening-at-gitlab
to the following channels:-
#infrastructure-lounge
(cc@sre-oncall
) -
#g_delivery
(cc@release-managers
) -
#community-relations
(Inform Marketing team) -
#support_gitlab-com
(Inform Support SaaS team)-
Share with team a link to the change request regarding the maintenance
-
-
-
🐘 DBRE : Create a C1 change request in the production repo and link to this issue. Example: production#8448 (closed)-
Ensure the CR is reviewed by the 🚑 EOC
-
-
🐘 DBRE : Ensure this issue has been created on https://ops.gitlab.net/gitlab-com/gl-infra/db-migration, sincegitlab.com
could potentially be unavailable during the rollout!!! -
🐬 SRE : Create a merge request that may be needed in case of rollback and link it in the below rollback section -
🏆 Quality : Check that you haveMaintainer
orOwner
permission in https://ops.gitlab.net/gitlab-org/quality to be able to trigger Smoke QA pipeline in schedules (Staging, Production)
-
T minus 2 days (2023-09-07 14:00 UTC)
-
🐺 DBRE : Disable the DDL-related feature flags: 1. [ ] Disable feature flags by typing the following into#production
:- PRODUCTION:
-
/chatops run feature set disallow_database_ddl_feature_flags true
-
- PRODUCTION:
-
🐘 DBRE : Inform the database team that DDL feature flags have been disabled until the CR is complete. Post the following comment on the gitlab.com CR (production#16266 (closed)):Hi @gl-database, Please note that `execute_batched_migrations_on_schedule` and `execute_background_migrations`, reindexing, async_foreign_key, async_index features and partition_manager_sync_partitions tasks will be disabled in the `PRODUCTION` environment, as we are carrying out Postgres upgrades to the database layer between `2023-09-09 14:00 UTC` and `2023-09-09 19:00 UTC`. We will re-enable the feature flag once work is complete. Thanks!
-
T minus 1 day (2023-09-08 14:00 UTC)
-
PRODUCTION ONLY 📣 CMOC : Post update from Status.io maintenance site, publish on@gitlabstatus
. Workflow: https://about.gitlab.com/handbook/support/workflows/cmoc_workflows.html#sending-updates-about-maintenance-events- Message:
Reminder: Tomorrow, we will be undergoing scheduled maintenance to our main database layer. The maintenance will take up to 5 hours starting from 14:00 UTC to 19:00 UTC. GitLab.com will be available but users may experience degraded performance during the maintenance window. We apologize in advance for any inconvenience this may cause. See <https://gitlab.com/gitlab-com/gl-infra/production/-/issues/16266>
- Message:
-
PRODUCTION ONLY ☎ Comms-Handler : Send message to #social_media_action to retweet from @gitlabstatus Twitter.-
Message:
Hi team, please retweet this from our status page to GitLab Twitter about the scheduled maintenance that is taking place tomorrow: {TWEET_LINK}
-
@gitlab retweeted from @gitlabstatus
-
-
🏆 Quality : Confirm that our smoke tests are passing on the current cluster -
🐘 DBRE : Cleanup the Destination GCS Backup location to avoid conflicts inwal-g
(IMPORTANT: perform this action pairing with another DBRE/SRE to make sure that the you are deleting the right location) -
🐘 DBRE : Initiate a full backup (usingwal-g
) on the new V14 Patroni main cluster:-
SSH to patroni-main-v14-102-db-gprd.c.gitlab-production.internal
-
Run a Wal-G backup: sudo su - gitlab-psql tmux new -s PGBasebackup nohup /opt/wal-g/bin/backup.sh >> /var/log/wal-g/wal-g_backup_push.log 2>&1 &
-
-
-
T minus 14 hours (2023-09-09 00:00 UTC)
Prepare the environment
-
🔪 Playbook-Runner : Get the console VM ready for action-
SSH to the console VM in gprd
-
Configure dbupgrade user -
Disable screen sharing to reduce risk of exposing private key -
Change to user dbupgrade sudo su - dbupgrade
-
Copy dbupgrade user's private key from 1Password to ~/.ssh/id_dbupgrade
-
chmod 600 ~/.ssh/id_dbupgrade
-
Use key as default ln -s /home/dbupgrade/.ssh/id_dbupgrade /home/dbupgrade/.ssh/id_rsa
-
Repeat the same steps steps on target leader (it also has to have the private key) -
Enable re-screen sharing if benficial
-
-
Start a / resume the tmux session tmux a -t pg14 || tmux new -s pg14
-
Create an access_token with at least read_repository
for the next step -
Clone repos: rm -rf ~/src \ && mkdir ~/src \ && cd ~/src \ && git clone https://gitlab.com/gitlab-com/gl-infra/db-migration.git \ && cd db-migration \ && git checkout latest_stable
-
Ensure you have the pre-requisites installed: sudo apt install ansible
-
Ensure that Ansible can talk to all the hosts in gprd-main cd ~/src/db-migration/pg-upgrade-logical ansible -e "ansible_ssh_private_key_file=/home/dbupgrade/.ssh/id_dbupgrade" -i inventory/gprd-main.yml all -m ping
-
In advance, run pre-checks, and upgrade-check, pre-install packages to ensure that everything is ready for future upgrade: cd ~/src/db-migration/pg-upgrade-logical ansible-playbook \ -e "ansible_ssh_private_key_file=/home/dbupgrade/.ssh/id_dbupgrade" \ -i inventory/gprd-main.yml \ upgrade.yml -e "pg_old_version=12 pg_new_version=14" \ --tags "pre-checks, packages, upgrade-check" 2>&1 \ | ts | tee -a ansible_upgrade_pre_checks_gprd_main_$(date +%Y%m%d).log
-
Refresh tmux command and shortcut knowledge, https://tmuxcheatsheet.com/. To leave tmux without stopping it, use sequence Ctl-b, Ctrl-z
You shouldn't see any failed hosts!
-
Postgres Upgrade rollout
Pre Postgres upgrade checks
-
🐺 Coordinator : Check if disallow_database_ddl_feature_flags is ENABLED:-
On slack /chatops run feature get disallow_database_ddl_feature_flags
-
-
🐺 Coordinator : Check if the underlying DDL migrations, patritioning and reindex features were disabled by disallow_database_ddl_feature_flags:-
Open a new rails console -
PRODUCTION: URL production.teleport.gitlab.net or tsh:
tsh login --proxy=production.teleport.gitlab.net --request-roles=rails-ro --request-reason="Validate if Database Migration/Reindex Workers are disabled during PG14 upgrade: https://gitlab.com/gitlab-com/gl-infra/production/-/issues/16266" tsh ssh rails-ro@console-ro-01-sv-gprd
-
PRODUCTION: URL production.teleport.gitlab.net or tsh:
-
Paste the script in the console def output(name, value) color = value ? '31' : '32' result = value ? 'enabled' : 'disabled' puts "\e[33m#{name} is\e[0m \e[#{color}m#{result}.\e[0m" end def check ActiveRecord::Base.logger = nil output('Database::BatchedBackgroundMigration::MainExecutionWorker', Database::BatchedBackgroundMigration::MainExecutionWorker.new.send(:enabled?)) output('Database::BatchedBackgroundMigration::CiExecutionWorker', Database::BatchedBackgroundMigration::CiExecutionWorker.new.send(:enabled?)) output('Database::BatchedBackgroundMigration::CiDatabaseWorker', Database::BatchedBackgroundMigration::CiDatabaseWorker.enabled?) output('Database::BatchedBackgroundMigrationWorker', Database::BatchedBackgroundMigrationWorker.enabled?) output('Gitlab::Database::Reindexing', Gitlab::Database::Reindexing.enabled?) is_execute_background_migrations_enabled = !(Feature.enabled?(:disallow_database_ddl_feature_flags, type: :ops) || Feature.disabled?(:execute_background_migrations, type: :ops)) output('BackgroundMigration::CiDatabaseWorker', is_execute_background_migrations_enabled) output('BackgroundMigrationWorker', is_execute_background_migrations_enabled) is_database_async_index_operations_enabled = !(Feature.enabled?(:disallow_database_ddl_feature_flags, type: :ops) || Feature.disabled?(:database_async_index_operations, type: :ops)) output('rake gitlab:db:execute_async_index_operations', is_database_async_index_operations_enabled) is_database_async_foreign_key_validation_enabled = Feature.disabled?(:disallow_database_ddl_feature_flags, type: :ops) && Feature.enabled?(:database_async_foreign_key_validation, type: :ops) output('rake gitlab:db:validate_async_constraints', is_database_async_foreign_key_validation_enabled) output('Gitlab::Database::AsyncConstraints', is_database_async_foreign_key_validation_enabled) is_database_async_index_creation_enabled = Feature.disabled?(:disallow_database_ddl_feature_flags, type: :ops) && Feature.enabled?(:database_async_index_creation, type: :ops) output('Gitlab::Database::AsyncIndexes', is_database_async_index_creation_enabled) is_partition_manager_sync_partitions_enabled = !(Feature.enabled?(:disallow_database_ddl_feature_flags, type: :ops) || Feature.disabled?(:partition_manager_sync_partitions, type: :ops)) output('Gitlab::Database::Partitioning#sync_partitions', is_partition_manager_sync_partitions_enabled) output('Gitlab::Database::Partitioning#drop_detached_partitions', is_partition_manager_sync_partitions_enabled) end
-
Check the output - All workers/tasks should be disabled
, like for example:Database::BatchedBackgroundMigration::MainExecutionWorker is disabled. Database::BatchedBackgroundMigration::CiExecutionWorker is disabled. Database::BatchedBackgroundMigration::CiDatabaseWorker is disabled. Database::BatchedBackgroundMigrationWorker is disabled. Gitlab::Database::Reindexing is disabled. BackgroundMigration::CiDatabaseWorker is disabled. BackgroundMigrationWorker is disabled. rake gitlab:db:execute_async_index_operations is disabled. rake gitlab:db:validate_async_constraints is disabled. Gitlab::Database::AsyncConstraints is disabled. Gitlab::Database::AsyncIndexes is disabled. Gitlab::Database::Partitioning#sync_partitions is disabled. Gitlab::Database::Partitioning#drop_detached_partitions is disabled.
-
-
🐘 DBRE ADD the following silences at https://alerts.gitlab.net to silenceWALGBaseBackup
alerts in patroni-main-2004 until the end of the maintenance:- Start time:
2023-09-07T16:52:09.000Z
- Duration:
25h
-
env="gprd"
-
type="gprd-patroni-main-2004"
-
alertname=~"WALGBaseBackupFailed|walgBaseBackupDelayed"
- Duration:
- Start time:
-
🐘 DBRE ADD the following silences at https://alerts.gitlab.net to silenceWALGBaseBackup
alerts in patroni-main-v14 until the end of the maintenance:- Start time:
2023-09-07T16:52:09.000Z
- Duration:
25h
-
env="gprd"
-
type="gprd-patroni-main-v14"
-
alertname=~"WALGBaseBackupFailed|walgBaseBackupDelayed"
- Duration:
- Start time:
-
🐘 DBRE : Monitor what pgbouncer pool has connections: Thanos -
🐘 DBRE : Check if anyone except application is connected to source primary and interrupt them:-
Login to source primary ssh patroni-main-2004-101-db-gprd.c.gitlab-production.internal
-
Check all connections that are not gitlab
:gitlab-psql -c " select pid, client_addr, usename, application_name, backend_type, clock_timestamp() - backend_start as connected_ago, state, left(query, 200) as query from pg_stat_activity where pid <> pg_backend_pid() and usename <> 'gitlab' and not backend_type ~ '(walsender|logical replication|pg_wait_sampling)' and usename not in ('pgbouncer', 'postgres_exporter', 'gitlab-consul') and application_name <> 'Patroni' "
-
If there are sessions that potentially can perform any writes, spend up to 10 minutes to make an attempt to find the actors and ask them to stop. -
Finally, terminate all the remaining sessions that are not coming from application/infra components and potentially can cause writes: gitlab-psql -c " select pg_terminate_backend(pid) from pg_stat_activity where pid <> pg_backend_pid() and usename <> 'gitlab' and not backend_type ~ '(walsender|logical replication|pg_wait_sampling)' and usename not in ('pgbouncer', 'postgres_exporter', 'gitlab-consul') and application_name <> 'Patroni' "
-
-
🐘 DBRE : Monitor the Primary Leader and Standby Leader PostgreSQL Log Files:-
Login to node 01: ssh patroni-main-2004-101-db-gprd.c.gitlab-production.internal ssh patroni-main-v14-101-db-gprd.c.gitlab-production.internal
-
Get leader for each cluster: sudo gitlab-patronictl list
-
Connect via SSH to the previously identified leaders and tail Postgres logs: ssh ... # the leader host here sudo tail -f /var/log/gitlab/postgresql/postgresql.csv
-
On the v14 leader, start loop to terminate autovacuum workers, to unblock concurrent vacuumdb workers attempting to ANALYZE tables after pg_upgrade: while sleep 10; do gitlab-psql -XAtc " select query, pid, pg_terminate_backend(pid) from pg_stat_activity where query like 'autovacuum: VACUUM % (to prevent wraparound)'" 2>&1 \ | ts | sudo tee -a /var/opt/gitlab/autovacuum_terminator_$(date +%Y%m%d).log done
-
Postgres Upgrade
Playbook source: https://gitlab.com/gitlab-com/gl-infra/db-migration/-/tree/master/pg-upgrade-logical
For this part, since each cluster takes 30-40mins, we will trigger the upgrades in parallel to save time.
UPGRADE – execute!
-
🔪 Playbook-Runner : PressCtrl-b
thenup
to go to the first terminal. -
🔪 Playbook-Runner : Run Ansible playbook for Upgrading the gprd-main cluster:cd ~/src/db-migration/pg-upgrade-logical ansible-playbook \ -e "ansible_ssh_private_key_file=/home/dbupgrade/.ssh/id_dbupgrade" \ -i inventory/gprd-main.yml \ upgrade.yml -e "pg_old_version=12 pg_new_version=14" 2>&1 \ | ts | tee -a ansible_upgrade_gprd_main_$(date +%Y%m%d).log
Post Postgres upgrades verification
You can execute the following steps as soon as their respective upgrades in the previous step have finished executing.
-
🐘 DBRE : Check logical replication lag, and wait to get in sync: PG14 Upgrade Dashboard -
🐘 DBRE : Ensure gprd-main cluster is in the desired state.-
Login to node 01: ssh patroni-main-v14-101-db-gprd.c.gitlab-production.internal
-
Get leader for each cluster: sudo gitlab-patronictl list
-
Connect via SSH to the previously identified leaders and tail Postgres logs: ssh ... # the leader host here sudo tail -f /var/log/gitlab/postgresql/postgresql.csv
-
🐘 DBRE : On the v14 leader, stop the monitoring-terminate of autovacuum workers – in psql, press Ctrl-C
-
-
🐘 DBRE : Trigger GCS snapshot on the new v14 Patroni main cluster:-
SSH to patroni-main-v14-102-db-gprd.c.gitlab-production.internal
-
Run a manual GCS Snapshot sudo su - gitlab-psql tmux new -s GCSSnapshot /usr/local/bin/gcs-snapshot.sh
-
-
-
🐬 SRE : Merge the MR that update PostgreSQL dirs and binaries references in Chef for patroni-main-v14. First confirm there are no errors in merge pipeline. If the MR was merged, then revert it, and get it merged.-
MR for patroni-main-v14: https://gitlab.com/gitlab-com/gl-infra/chef-repo/-/merge_requests/3939
-
-
🐬 SRE : Ensure that the changes merged from the previous step have been deployed to the Chef server before re-enabling Chef by confirming the linkedmaster
pipeline forops.gitlab.net
completed successfully. -
🐬 SRE : Re-enable Chef in all gprd-main nodes:knife ssh "role:gprd-base-db-patroni-main-v14" "sudo chef-client-enable"
-
🐬 SRE : Confirm chef-client is enabled in all nodes thanos link -
🐬 SRE : Run chef-client on Patroni Nodes:knife ssh "role:gprd-base-db-patroni-main-v14" "sudo chef-client"
Confirm that chef-client ran on all nodes thanos link
-
🐬 SRE : Confirm:-
No errors while running chef-client thanos link and we still have v14 binary. knife ssh "roles:gprd-base-db-patroni-main-v14" "sudo /usr/lib/postgresql/14/bin/postgres --version"
- Output should show
postgres (PostgreSQL) 14.x (Ubuntu 14.x-x.pgdg20.04+1)
for all nodes
- Output should show
-
Consul service endpoint db-replica-v14.service.consul.
points to v14 nodes, and consul service endpointdb-replica.service.consul.
is pointing to v12 replica nodes.dig @127.0.0.1 -p 8600 db-replica.service.consul. SRV +short dig @127.0.0.1 -p 8600 db-replica-v14.service.consul. SRV +short dig @127.0.0.1 -p 8600 master.patroni.service.consul. SRV +short dig @127.0.0.1 -p 8600 master.patroni-v14.service.consul. SRV +short
-
-
-
🐬 SRE : Stop chef on both old and new clusters, on all nodes, before we execute switchoverknife ssh "role:gprd-base-db-patroni-main-v14 OR role:gprd-base-db-patroni-main-2004" "sudo chef-client-disable https://gitlab.com/gitlab-com/gl-infra/production/-/issues/16266"
-
🐬 SRE : Confirm chef-client is disabled thanos link -
🐘 DBRE : A KNOWN ISSUE (TODO to improve) – at this point, it is very likely that logical replication is broken because.pgpass
on target leader has only 1 line again (chef removed the 2nd line, that is needed to connect to the source leader). Get it back manually – copying the existing line, replacinglocalhost
in the beginning with*
. -
🐘 DBRE : Restart target patroni cluster nodes ifgitlab-patronictl list
command showPending restart
is required. A successful cluster restart will displaySuccess: restart on member
for each cluster memebers and a subsequentgitlab-patronictl list
command will not showPending restart
requiredsudo gitlab-patronictl list sudo gitlab-patronictl restart gprd-patroni-main-v14 --force sudo gitlab-patronictl list
-
🐘 DBRE : Check logical replication lag, and wait to get in sync: PG14 Upgrade Dashboard
Start data corruption check - pg_amcheck
-
🐘 DBRE : On the v14 Replica nodes only, runpg_amcheck
. Run tmux and as a nohup command):-
On each Replica: sudo su - gitlab-psql
and start a / resume the tmux sessiontmux a -t pg_amcheck || tmux new -s pg_amcheck
export PGOPTIONS="-c statement_timeout=30min" cd /tmp; nohup time /usr/lib/postgresql/14/bin/pg_amcheck -p 5432 -h localhost -U gitlab-superuser -d gitlabhq_production -j 96 --verbose -P --heapallindexed 2>&1 | tee -a /var/tmp/pg_amcheck.$(date "+%F-%H-%M").log & tail -f /var/tmp/pg_amcheck.$(date "+%F-%H-%M").log
-
Monitor logical replication lagging, if it seems that the logical replication is throttling, kill pg_amcheck and start it again with a smaller value for -j
; -
IMPORTANT: make sure you are not running pg_amcheck in the v14 Writer/Primary node, as this will cause logical replication lag in the target and spikes of rollbacks and errors
-
-
-
T minus 3 hours (2023-09-09 11:00 UTC)
-
PRODUCTION ONLY 📣 CMOC : Post update from Status.io maintenance site, publish on@gitlabstatus
. Workflow: https://about.gitlab.com/handbook/support/workflows/cmoc_workflows.html#sending-updates-about-maintenance-events- Message:
We will be undergoing scheduled maintenance to our main database layer in 3 hours. The maintenance will take up to 5 hours starting from 14:00 UTC to 19:00 UTC. GitLab.com will be available but users may experience degraded performance during the maintenance window. We apologize in advance for any inconvenience this may cause. See <https://gitlab.com/gitlab-com/gl-infra/production/-/issues/16266>
- Message:
-
☎ Comms-Handler : Send on Slack#whats-happening-at-gitlab
- Message::loudspeaker: *Postgres upgrade for our database clusters is scheduled for 2023-09-09 between 14:00 UTC and 19:00 UTC :rocket: This is taking place in 3 hours :hourglass_flowing_sand: :hammer_and_wrench: *What to expect?* GitLab.com will be available but users may experience degraded performance during the maintenance window. If you experience any issues likely related to the upgrade in the next few days after the upgrade, please open an issue and reach the upgrade team at slack channel - #pg_upgrade. You can follow our issue link on ops.gitlab.net for the upgrade.
-
☎ Comms-Handler : Share message from#whats-happening-at-gitlab
to the following channels:-
#infrastructure-lounge
(cc@sre-oncall
) -
#g_delivery
(cc@release-managers
)
-
-
🐘 DBRE : Monitor logical replication lagging, if it seems that the logical replication is throttling, kill pg_amcheck
-
T minus 1 hour (2023-09-09 13:00 UTC)
-
🔪 Playbook-Runner : Add the following silences at https://alerts.gitlab.net to silence alerts in V14 nodes for the duration of the change + 1 hour:-
Start time:
2023-09-09T14:00:00.000Z
-
Duration:
6h
-
Matchers
-
PRODUCTION
-
env="gprd"
-
fqdn=~"patroni-main-v14.*"
-
-
PRODUCTION
-
-
🔪 Playbook-Runner : Add the following silences at https://alerts.gitlab.net to silence alerts in V12 nodes for 2 weeks:-
Start time:
2023-09-09T14:00:00.000Z
-
Duration:
341h
-
Matchers
-
PRODUCTION
-
env="gprd"
-
fqdn=~"patroni-main-2004.*"
-
-
PRODUCTION
-
-
🔪 Playbook-Runner : Add the following silences at https://alerts.gitlab.net to silence sidekiq alerts for the duration of the change + 1 hour:- Start time:
2023-09-09T14:00:00.000Z
- Duration:
6h
- Matchers
-
env="gprd"
-
alertname="SidekiqServiceSidekiqExecutionErrorSLOViolationSingleShard"
-
component="sidekiq_execution"
-
- Start time:
-
🔪 Playbook-Runner : Schedule a job to enablegitlab_maintenance_mode
into a node exporter, during the upgrade window:-
SSH to a console VM in gprd
(eg.ssh console-01-sv-gprd.c.gitlab-production.internal
)-
Schedule jobs: sudo su - echo -e "# HELP gitlab_maintenance_mode record maintenance window\n# TYPE gitlab_maintenance_mode untyped\ngitlab_maintenance_mode 1\n" > /opt/prometheus/node_exporter/metrics/gitlab_maintenance_mode.prom | at -t 202309091400 echo -e "# HELP gitlab_maintenance_mode record maintenance window\n# TYPE gitlab_maintenance_mode untyped\ngitlab_maintenance_mode 0\n" > /opt/prometheus/node_exporter/metrics/gitlab_maintenance_mode.prom | at -t 202309091900 cat /opt/prometheus/node_exporter/metrics/gitlab_maintenance_mode.prom atq
-
-
-
PRODUCTION ONLY 📣 CMOC : Post update from Status.io maintenance site, publish on@gitlabstatus
. Workflow: https://about.gitlab.com/handbook/support/workflows/cmoc_workflows.html#sending-updates-about-maintenance-events- Message:
We will be undergoing scheduled maintenance to our main database layer in 1 hour. The maintenance will take up to 5 hours starting from 14:00 UTC to 19:00 UTC. GitLab.com will be available but users may experience degraded performance during the maintenance window. We apologize in advance for any inconvenience this may cause. See <https://gitlab.com/gitlab-com/gl-infra/production/-/issues/16266>
- Message:
-
PRODUCTION ONLY ☎ Comms-Handler : Post to#announcements
on Slack:-
Message:
Scheduled maintenance to our main database layer starts in an hour, lasting up to 5 hours. From 14:00 UTC to 19:00 UTC
-
-
PRODUCTION ONLY ☁ 🔪 Playbook-Runner : Create a maintenance window in PagerDuty with the following:-
Which services are affected?
-
Why is this maintenance happening?
Performing Postgres cluster upgrades so silencing the pager.
-
Select
Start at a scheduled time
:- Timezone:
(UTC+00:00) UTC
- Start:
09/09/2023 | 02:00 PM
- End:
09/09/2023 | 07:00 PM
- Timezone:
-
-
🐬 SRE : Check that all needed Chef MRs are rebased and contain the proper changes.-
Post-Upgrade MR, to change cluster to use PG14 roles: - MR for patroni-main-v14: https://gitlab.com/gitlab-com/gl-infra/chef-repo/-/merge_requests/3939
-
Post-Switchover MR to configure Consul and Prometheus - MR for gprd-patroni-main: https://gitlab.com/gitlab-com/gl-infra/chef-repo/-/merge_requests/3904
-
-
🐘 DBRE : Ensure that we have a successful full WAL-G backup that has taken place in the last 24 hours for each cluster: Thanos Graph-
If you see 2 rows (one for each cluster: MAIN + CI + REGISTRY), then a gap, then another 2 rows, then you have a recent (< 24 hours) successful backup. The gap is the period of time when the backups were executing. You can use a timestamp converter to convert the timestamps into human-readable date/time if you want to check when the backup finished.
-
If you see that currently there are no lines (empty result), then it's possible that the backups are still running OR they have failed, so check the following:
-
Check this Thanos graph to see the start time of the backup job - you should be able to see an increase of the value every time the backup starts (around midnight). If the last increase was more than 24 hours ago then it means that the last backup hasn't started as it should have, and you'll need to investigate why the job failed to start.
-
If the backup job should have finished by now, then you should check this Thanos graph to see the job failed value for the last time the backup job ran. If the value is
> 0
for any time in the past 24 hours, then you'll need to investigate why the job failed.
The backup job is triggered by
crond
(user:gitlab-psql
) and any replica is eligible to run the job but it only runs on the one that acquires the consul lock. -
-
-
🐘 DBRE : On the v14 Replica nodes, review the pg_amcheck log files created in the previous steps to find out any data corruption errors and check amcheck progress:egrep 'ERROR:|DETAIL:|LOCATION:' /var/tmp/pg_amcheck.*.log cat /var/tmp/pg_amcheck.*.log | grep relations | tail -1
-
🐘 DBRE : Monitor logical replication lagging, if it seems that the logical replication is throttling, kill pg_amcheck
Postgres Upgrade Call
These steps will be run in a video call. The
Changes are made one at a time, and verified before moving onto the next step. All the steps will be executed from a console VM and we should keep the session shared (tmux, screen...).
Whoever is performing a change should share their screen and explain their actions as they work through them. Everyone else should watch closely for mistakes or errors! A few things to keep an especially sharp eye out for:
- Exposed credentials (except short-lived items like 2FA codes)
- Running commands against the wrong hosts
- Navigating to the wrong pages in web browsers (gstg vs. gprd, etc)
Remember that the call will be recorded. We will consider making it public after we confirmed that no SAFE data was leaked. If you see something happening that shouldn't be public, mention it.
Roll call
-
🐺 Coordinator : Mark the change request as/label ~"change::in-progress"
-
🐺 Coordinator : Ensure everyone mentioned above is on the call -
🐺 Coordinator : Ensure the video call room host is on the call
Data Corruption Checks
-
🐘 DBRE : On the v14 Replica Nodes, review the pg_amcheck log files created in the previous steps to find out any data corruption errors and to get the last status of the progress:egrep 'ERROR:|DETAIL:|LOCATION:' /var/tmp/pg_amcheck.*.log cat /var/tmp/pg_amcheck.*.log | grep relations | tail -1
-
🐺 Coordinator : If there are any errors that indicate possible data corruption, then abort the Maintenance and proceed with the partial rollback of the steps already performed; -
🐘 DBRE : On the v14 Replica Nodes, killpg_amcheck
processes:sudo killall pg_amcheck ps -ef | grep pg_amcheck
-
🐘 DBRE : On the v14 Replica Nodes, terminate any existing backend processes:sudo gitlab-psql -c " select pg_terminate_backend(pid) from pg_stat_activity where pid <> pg_backend_pid() and usename <> 'gitlab' and not backend_type ~ '(walsender|logical replication|pg_wait_sampling)' and usename not in ('pgbouncer', 'postgres_exporter', 'gitlab-consul') and application_name <> 'Patroni' "
-
🐺 Coordinator : [optional] Double check that nopg_amcheck
processes nor queries are running on the v14 Replica Nodes.ps -ef | grep pg_amcheck sudo gitlab-psql -c " select pid, usename, application_name, client_addr, substr(query,1,120) as "query" from pg_stat_activity where usename <> 'gitlab' and not backend_type ~ '(walsender|logical replication|pg_wait_sampling)' and usename not in ('pgbouncer', 'postgres_exporter', 'gitlab-consul') and application_name <> 'Patroni' "
Pre-maintenance Health Checks
-
🐺 Coordinator : Check if gitlab_maintenance_mode is enabled for gprd (Thanos link)- If is not enabled ask
🔪 Playbook-Runner to manually enable it by:- SSH to a console VM in
gprd
(eg.ssh console-01-sv-gprd.c.gitlab-production.internal
)- Set
gitlab_maintenance_mode=1
on node exporter :sudo su - echo -e "# HELP gitlab_maintenance_mode record maintenance window\n# TYPE gitlab_maintenance_mode untyped\ngitlab_maintenance_mode 1\n" > /opt/prometheus/node_exporter/metrics/gitlab_maintenance_mode.prom cat /opt/prometheus/node_exporter/metrics/gitlab_maintenance_mode.prom atq
- Set
- SSH to a console VM in
- If is not enabled ask
-
🐺 Coordinator : Check if disallow_database_ddl_feature_flags is ENABLED:-
On slack /chatops run feature get disallow_database_ddl_feature_flags
-
-
🐺 Coordinator : Check if the underlying DDL migrations, patritioning and reindex features were disabled by disallow_database_ddl_feature_flags:-
Open a new rails console -
PRODUCTION: URL production.teleport.gitlab.net or tsh:
tsh login --proxy=production.teleport.gitlab.net --request-roles=rails-ro --request-reason="Validate if Database Migration/Reindex Workers are disabled during PG14 upgrade: https://gitlab.com/gitlab-com/gl-infra/production/-/issues/16266" tsh ssh rails-ro@console-ro-01-sv-gprd
-
PRODUCTION: URL production.teleport.gitlab.net or tsh:
-
Paste the script in the console def output(name, value) color = value ? '31' : '32' result = value ? 'enabled' : 'disabled' puts "\e[33m#{name} is\e[0m \e[#{color}m#{result}.\e[0m" end def check ActiveRecord::Base.logger = nil output('Database::BatchedBackgroundMigration::MainExecutionWorker', Database::BatchedBackgroundMigration::MainExecutionWorker.new.send(:enabled?)) output('Database::BatchedBackgroundMigration::CiExecutionWorker', Database::BatchedBackgroundMigration::CiExecutionWorker.new.send(:enabled?)) output('Database::BatchedBackgroundMigration::CiDatabaseWorker', Database::BatchedBackgroundMigration::CiDatabaseWorker.enabled?) output('Database::BatchedBackgroundMigrationWorker', Database::BatchedBackgroundMigrationWorker.enabled?) output('Gitlab::Database::Reindexing', Gitlab::Database::Reindexing.enabled?) is_execute_background_migrations_enabled = !(Feature.enabled?(:disallow_database_ddl_feature_flags, type: :ops) || Feature.disabled?(:execute_background_migrations, type: :ops)) output('BackgroundMigration::CiDatabaseWorker', is_execute_background_migrations_enabled) output('BackgroundMigrationWorker', is_execute_background_migrations_enabled) is_database_async_index_operations_enabled = !(Feature.enabled?(:disallow_database_ddl_feature_flags, type: :ops) || Feature.disabled?(:database_async_index_operations, type: :ops)) output('rake gitlab:db:execute_async_index_operations', is_database_async_index_operations_enabled) is_database_async_foreign_key_validation_enabled = Feature.disabled?(:disallow_database_ddl_feature_flags, type: :ops) && Feature.enabled?(:database_async_foreign_key_validation, type: :ops) output('rake gitlab:db:validate_async_constraints', is_database_async_foreign_key_validation_enabled) output('Gitlab::Database::AsyncConstraints', is_database_async_foreign_key_validation_enabled) is_database_async_index_creation_enabled = Feature.disabled?(:disallow_database_ddl_feature_flags, type: :ops) && Feature.enabled?(:database_async_index_creation, type: :ops) output('Gitlab::Database::AsyncIndexes', is_database_async_index_creation_enabled) is_partition_manager_sync_partitions_enabled = !(Feature.enabled?(:disallow_database_ddl_feature_flags, type: :ops) || Feature.disabled?(:partition_manager_sync_partitions, type: :ops)) output('Gitlab::Database::Partitioning#sync_partitions', is_partition_manager_sync_partitions_enabled) output('Gitlab::Database::Partitioning#drop_detached_partitions', is_partition_manager_sync_partitions_enabled) end
-
Check the output - All workers/tasks should be disabled
, like for example:Database::BatchedBackgroundMigration::MainExecutionWorker is disabled. Database::BatchedBackgroundMigration::CiExecutionWorker is disabled. Database::BatchedBackgroundMigration::CiDatabaseWorker is disabled. Database::BatchedBackgroundMigrationWorker is disabled. Gitlab::Database::Reindexing is disabled. BackgroundMigration::CiDatabaseWorker is disabled. BackgroundMigrationWorker is disabled. rake gitlab:db:execute_async_index_operations is disabled. rake gitlab:db:validate_async_constraints is disabled. Gitlab::Database::AsyncConstraints is disabled. Gitlab::Database::AsyncIndexes is disabled. Gitlab::Database::Partitioning#sync_partitions is disabled. Gitlab::Database::Partitioning#drop_detached_partitions is disabled.
-
-
🐺 Coordinator : Ensure that there are no active critical alerts (s1) or open incidents:- PRODUCTION:
-
🐺 Coordinator : Check Sentry if there are errors that might indicate database problems: Production Sentry -
🐘 DBRE : Ensure writes are happening on Postgres/Patroni nodes in gprd-main: Thanos -
🐘 DBRE Check prometheus sanity check metrics for reads all going to the correct hosts-
Index reads
-
Expected result: all queries going to the patroni-main-2004
cluster.
-
-
Sequential scans
-
Expected result: all queries going to the patroni-main-2004
cluster.
-
-
Index reads
Terminals
You should already be in a tmux
session, but only if you are planning to upgrade two clusters at the same time, open a second pane so that we have one terminal for each cluster.
- Press
Ctrl-b
then"
your existing terminal intmux
to open a new pane, split horizontally. - You can move between the panes by pressing
Ctrl-b
thenup
ordown
arrows.
-
🔪 Playbook-Runner : Verify that the ansible inventory is up to date and reflects the real state from the cluster.-
Once again, ensure that Ansible can talk to all the hosts in gprd-main: cd ~/src/db-migration/pg-upgrade-logical ansible -e "ansible_ssh_private_key_file=/home/dbupgrade/.ssh/id_dbupgrade" -i inventory/gprd-main.yml all -m ping
You shouldn't see any failed hosts!
-
T minus zero (2023-09-09 14:00 UTC)
We expect the maintenance window to last for up to 5 hours, starting from now.
Pre Switchover checks (T plus 0 min)
-
🐘 DBRE : Monitor what pgbouncer pool has connections: Thanos -
🐘 DBRE : Monitor the Primary Leader and Standby Leader PostgreSQL Log Files:-
Login to node 01: ssh patroni-main-2004-101-db-gprd.c.gitlab-production.internal ssh patroni-main-v14-101-db-gprd.c.gitlab-production.internal
-
Get leader for each cluster: sudo gitlab-patronictl list
-
Connect via SSH to the previously identified leaders and tail Postgres logs: ssh ... # the leader host here sudo tail -f /var/log/gitlab/postgresql/postgresql.csv
-
-
🐬 SRE : Confirm chef-client is disabled thanos link
Evaluation of QA/Validations results - Commitment
If QA/Validations has succeeded, then we can continue to "Complete the Upgrade and Switchover to v14". If some
QA/Validations has failed, the
The following are commitment criteria:
Goals:
- The top priority is to maintain data integrity. Rolling back after the maintenance window has ended is very difficult, and will result in any changes made in the interim being lost.
- Failures with an unknown cause should be investigated further. If we can't determine the root cause within the maintenance window, we should rollback.
~30
mins)
Postgres Upgrade - Switchover (T plus Playbook source: https://gitlab.com/gitlab-com/gl-infra/db-migration/-/tree/master/pg-upgrade-logical
SWITCHOVER – execute!
-
🔪 Playbook-Runner : PressCtrl-b
thenup
to go to the first terminal. -
🔪 Playbook-Runner : Run Ansible playbook to Switchover the gprd-main cluster (it is interactive; reply "y" three times):cd ~/src/db-migration/pg-upgrade-logical ansible-playbook \ -e "ansible_ssh_private_key_file=/home/dbupgrade/.ssh/id_dbupgrade" \ -i inventory/gprd-main.yml \ switchover.yml 2>&1 \ | ts | tee -a ansible_switchover_gprd_main_$(date +%Y%m%d).log
- switchover.yml asks to confirm steps 3 times (type "y"). After each step, DBRE has to verify traffic to proper nodes, lack of errors, DB latencies and confirm decision to continue
-
🔪 Playbook-Runner : First "y": start R/O traffic to new v14 replicas -
🐬 SRE : After 1st YES, The sevicedb-replica.service.consul.
should be pointing to both V12 and V14 replica nodesdig @127.0.0.1 -p 8600 db-replica.service.consul. SRV +short dig @127.0.0.1 -p 8600 master.patroni.service.consul. SRV +short
-
🐘 DBRE : Check Read-Only activity metrics for 15 minutes PG14 Upgrade Dashboard-
Compare the volume of standbys TPS (commits)
between Target vs Source (it should split workload 50/50 between old and new replicas) -
Compare volume of rollback TPS – ERRORS
-
-
🐺 Coordinator : Wait for the end of the hourly Write TPS spike to finish (around 18m past the hour) -
🔪 Playbook-Runner : Second "y": stop R/O traffic to old replicas -
🐬 SRE : After 2nd YES, The sevicedb-replica.service.consul.
should be pointing only to the new V14 replica nodesdig @127.0.0.1 -p 8600 db-replica.service.consul. SRV +short dig @127.0.0.1 -p 8600 master.patroni.service.consul. SRV +short
-
🐘 DBRE : Check the metrics for as long as we observe connections to theSOURCE
standbys, minimum time 15 minutes. (this is not blocking the🏆 Quality tests). -
🏆 Quality : Trigger Smoke E2E suite against the environment that was upgraded:Production:Four hourly smoke tests
. This has an estimated duration of 15 minutes. -
🏆 Quality : If the smoke tests fail, Quality should re-run the failed job to see if it is reproducible. In parallel a 15 minute window to do an initial triage of the failure will be alloted. If Quality cannot determine failure is 'unrelated' within that period - stop and reschedule the whole procedure. -
🐺 Coordinator : This is the point of no return! We will not execute a rollback after this point! Proceed wisely!-
Get agreement of peers and concent of 🎩 Head Honcho to proceed
-
-
🐺 Coordinator : Wait for the end of the hourly Write TPS spike to finish (around 18m past the hour)-
Only proceed with R/W traffic primary switchover when logical replication lag < 500 MiB
-
-
🔪 Playbook-Runner : Third "y": R/W traffic, primary switchover. -
🐬 SRE : After 3rd YES, the master service formaster.patroni.service.consul.
should be pointing only to the new V14 Leader/Writer node.dig @127.0.0.1 -p 8600 master.patroni.service.consul. SRV +shorthort
-
-
🔪 Playbook-Runner : If the first "y" fails, repeat in the "forced" mode:cd ~/src/db-migration/pg-upgrade-logical ansible-playbook \ -e "ansible_ssh_private_key_file=/home/dbupgrade/.ssh/id_dbupgrade" \ -i inventory/gprd-main.yml \ switchover.yml -e "force_mode=true" 2>&1 \ | ts | tee -a ansible_switchover_gprd_main_$(date +%Y%m%d)_FORCE_MODE.log
- switchover.yml asks to confirm steps 3 times (type "y"). After each step, DBRE has to verify traffic to proper nodes, lack of errors, DB latencies and confirm decision to continue
Post Postgres Switchover verification
You can execute the following steps as soon as their respective upgrades in the previous step have finished executing.
-
🐘 DBRE : Ensure main cluster is in desired state.-
Login to node 01: ssh patroni-main-v14-101-db-gprd.c.gitlab-production.internal
-
Get leader for each cluster: sudo gitlab-patronictl list
-
Connect via SSH to the previously identified leaders and tail Postgres logs: ssh ... # the leader host here sudo tail -f /var/log/gitlab/postgresql/postgresql.csv
-
🐘 DBRE : On the v14 leader, stop the monitoring-terminate of autovacuum workers – in psql, press Ctrl-C.
-
-
🏆 Quality : Trigger Smoke E2E suite against the environment that was upgraded:Production:Four hourly smoke tests
Metrics sanity check after switchover to v14
-
PRODUCTION ONLY 📣 CMOC : Post update from Status.io maintenance site, publish on@gitlabstatus
. Workflow: https://about.gitlab.com/handbook/support/workflows/cmoc_workflows.html#sending-updates-about-maintenance-events- Message:
GitLab.com planned maintenance for the database layer is almost complete. We're continuing to verify that all systems are functioning correctly. Thank you for your patience.
- Message:
-
🐬 SRE : Merge the MR that updates main teleport DB endpoint, the MR that updates the console config endpoint and the MR that updates the source snapshot of the main-data-analytics DB. First confirm there are no errors in merge pipeline. If the MR was merged, then revert it, and get it merged.-
MR for main teleport DB endpoint: gitlab-com/gl-infra/k8s-workloads/gitlab-helmfiles!2790 (merged) -
MR for main console config endpoint: https://gitlab.com/gitlab-com/gl-infra/chef-repo/-/merge_requests/3916 -
MR for main snapshot image: https://gitlab.com/gitlab-com/gl-infra/data-server-rebuild-ansible/-/merge_requests/54
-
-
🐘 DBRE : Ensure writes are happening on Postgres/Patroni nodes in gprd-main: Thanos -
🐘 DBRE Check prometheus sanity check metrics for reads all going to the correct hosts-
Index reads
-
Expected result: all queries going to the patroni-main-v14
cluster.
-
- Sequential scans
-
Expected result: all queries going to the patroni-main-v14
cluster.
-
Index reads
-
🐘 DBRE Check Sentry if there are errors that might indicate database problems: Production Sentry -
🐬 SRE : Merge the MR that update Consul and Prometheus. First confirm there are no errors in merge pipeline. If the MR was merged, then revert it, and get it merged.-
MR for gprd-patroni-main: https://gitlab.com/gitlab-com/gl-infra/chef-repo/-/merge_requests/3904
-
-
🐬 SRE : Ensure that the changes merged from the previous step have been deployed to the Chef server before re-enabling Chef by confirming the linkedmaster
pipeline forops.gitlab.net
completed successfully. -
🐬 SRE : Start cron.service on all gprd-main nodes:knife ssh "role:gprd-base-db-patroni-main-v14" "sudo systemctl is-active cron.service" knife ssh "role:gprd-base-db-patroni-main-v14" "sudo systemctl start cron.service" knife ssh "role:gprd-base-db-patroni-main-v14" "sudo systemctl is-active cron.service"
-
🐬 SRE : Re-enable Chef in all gprd-main nodes:knife ssh "role:gprd-base-db-patroni-main-2004 OR role:gprd-base-db-patroni-main-v14" "sudo chef-client-enable"
-
🐬 SRE : Confirm Chef is enabled in all nodes thanos link -
🐬 SRE : Run chef-client on Patroni Nodes:knife ssh "role:gprd-base-db-patroni-main-v14 OR role:gprd-base-db-patroni-main-2004" "sudo chef-client"
-
🐬 SRE : Confirm:-
No errors while running chef-client thanos link and we still have v14 binary. knife ssh "roles:gprd-base-db-patroni-main-v14" "sudo /usr/lib/postgresql/14/bin/postgres --version"
-
The sevice db-replica.service.consul.
should be pointing only to the new V14 replica nodes, and the master service formaster.patroni.service.consul.
should be pointing only to the new V14 Leader/Writer node.dig @127.0.0.1 -p 8600 db-replica.service.consul. SRV +short dig @127.0.0.1 -p 8600 master.patroni.service.consul. SRV +short
-
-
Communicate
-
🐺 Coordinator : TODO Remove the broadcast message (if it's after the initial window, it has probably expired automatically) -
PRODUCTION ONLY 📣 CMOC : Post update from Status.io maintenance site, publish on@gitlabstatus
. Workflow: https://about.gitlab.com/handbook/support/workflows/cmoc_workflows.html#sending-updates-about-maintenance-events-
Click "Finish Maintenance" and send the following: - Message:
GitLab.com planned maintenance for the database layer is complete. We'll be monitoring the platform to ensure all systems are functioning correctly. Thank you for your patience.
- Message:
-
-
PRODUCTION ONLY ☎ Comms-Handler : In the same thread from the earlier post, post the following message and click on the checkbox "Also send to X channel" so the threaded message would be published to the channel:- Message:
:done: *GitLab.com database layer maintenance upgrade is complete now.* :celebrate: We’ll continue to monitor the platform to ensure all systems are functioning correctly.
-
#whats-happening-at-gitlab
-
#infrastructure-lounge
(cc@sre-oncall
) -
#g_delivery
(cc@release-managers
)
-
- Message:
-
PRODUCTION ONLY ☎ Comms-Handler : Send a message to#social_media_action
to unpin maintenance tweet on@gitlab
Twitter:-
Message:
Hi team :wave:, the maintenance upgrade is complete now, you may unpin the maintenance tweet on GitLab Twitter.
-
~290
mins)
Complete the Upgrade (T plus
~290
mins)Verification
-
Start Post Switchover to v14 QA
-
🏆 Quality : Trigger Smoke and Full E2E suite against the environment that was upgraded:Production:Four hourly smoke tests
, andTwice daily full run
-
Wrapping up
-
PRODUCTION ONLY 🔪 Playbook-Runner : If the scheduled maintenance is still active in PagerDuty, click onUpdate
thenEnd Now
. -
🔪 Playbook-Runner : Remove silences offqdn=~"patroni-main-v14.*"
we created during this process from https://alerts.gitlab.net (Important: don't remove the silence of SOURCE nodes) -
🔪 Playbook-Runner : ADD the following silences at https://alerts.gitlab.net to silenceWALGBaseBackup
alerts in patroni-main-2004 for 2 weeks (14 days = 336 hours)- Start time:
2023-09-07T16:52:09.000Z
- Duration:
336h
-
env="gprd"
-
type="gprd-patroni-main-2004"
-
alertname=~"WALGBaseBackupFailed|walgBaseBackupDelayed"
-
- Start time:
-
🐘 DBRE : Initiate a full backup (usingwal-g
) and trigger GCS snapshot on the new v14 Patroni main cluster:-
SSH to patroni-main-v14-102-db-gprd.c.gitlab-production.internal
-
Run a Wal-G backup: sudo su - gitlab-psql tmux new -s PGBasebackup nohup /opt/wal-g/bin/backup.sh >> /var/log/wal-g/wal-g_backup_push.log 2>&1 &
-
-
Open another SSH session to patroni-main-v14-102-db-gprd.c.gitlab-production.internal
-
Run a manual GCS Snapshot sudo su - gitlab-psql tmux new -s GCSSnapshot /usr/local/bin/gcs-snapshot.sh
-
-
-
🏆 Quality : Quality team (after an hour): Check that the Smoke, and Full E2E suite has passed. -
🏆 Quality : Trigger smoke tests one more time now that Chef would have had time to run: -
PRODUCTION ONLY ☎ Comms-Handler : Notify our customer over Slack channel that Postgres Upgrade finished, and request to validate GitLab.com. -
PRODUCTION ONLY 🐘 DBRE : Ping@NikolayS
on the gitlab.com CR (production#16266 (closed)) (and/or in #database-lab) that the work is complete so he can update the DLE environment. -
PRODUCTION ONLY 🐬 SRE : Confirm that Teleport access works properly and we're hitting the right clusters:# Login to Teleport tsh login --add-keys-to-agent=no --proxy=production.teleport.gitlab.net --request-roles=database-ro-gprd --request-reason="Testing DB connectivity after PG14 Upgrade - https://gitlab.com/gitlab-com/gl-infra/production/-/issues/16266 " query="select setting from pg_settings where name='cluster_name'" # Check MAIN echo "${query}" | tsh db connect --db-user=console-ro --db-name=gitlabhq_production db-main-primary-gprd --add-keys-to-agent=no echo "${query}" | tsh db connect --db-user=console-ro --db-name=gitlabhq_production db-main-replica-gprd --add-keys-to-agent=no # Check CI echo "${query}" | tsh db connect --db-user=console-ro --db-name=gitlabhq_production db-ci-primary-gprd --add-keys-to-agent=no echo "${query}" | tsh db connect --db-user=console-ro --db-name=gitlabhq_production db-ci-replica-gprd --add-keys-to-agent=no
You should see
v14
as part of the cluster name. -
🐘 DBRE : Update the wal-g daily restore schedule for the [gprd] - [main] cluster at https://ops.gitlab.net/gitlab-com/gl-infra/gitlab-restore/postgres-gprd/-/pipeline_schedules-
Change the following variables: PSQL_VERSION = 14
-
BACKUP_PATH = ?
(? = use the "directory" from the new v14 GCS backup location at: https://console.cloud.google.com/storage/browser/gitlab-gprd-postgres-backup/pitr-walg-main-v14)
-
-
🐘 DBRE : Enable feature flags by typing the following into#production
:- PRODUCTION:
-
/chatops run feature set disallow_database_ddl_feature_flags false
-
-
🐘 DBRE : Inform the database team that the CR is completed and that the background migrations and reindexing feature flags have been re-enabled. Post the following comment on the gitlab.com CR (production#16266 (closed)):Hi @gl-database, Please note that we have completed the work for this CR in the `gprd` environment. Therefore we have re-enabled the `execute_batched_migrations_on_schedule`, `execute_background_migrations`, reindexing, async_foreign_key, sync_index and partition_manager_sync_partitions features and tasks in `PRODUCTION`. Could you please confirm that they have been re-enabled correctly? Thanks!
- PRODUCTION:
-
🐬 SRE : Check if the underlying DDL features were ENABLED back by disabling the disallow_database_ddl_feature_flags:-
On slack /chatops run feature get disallow_database_ddl_feature_flags
should return DISABLED -
On Rails Console: -
Open a new rails console -
PRODUCTION: URL production.teleport.gitlab.net or tsh:
tsh login --proxy=production.teleport.gitlab.net --request-roles=rails-ro --request-reason="Validate if Database Migration/Reindex Workers are disabled during PG14 upgrade: https://gitlab.com/gitlab-com/gl-infra/production/-/issues/16266" tsh ssh rails-ro@console-ro-01-sv-gprd
-
PRODUCTION: URL production.teleport.gitlab.net or tsh:
-
Paste the script in the console def output(name, value) color = value ? '31' : '32' result = value ? 'enabled' : 'disabled' puts "\e[33m#{name} is\e[0m \e[#{color}m#{result}.\e[0m" end def check ActiveRecord::Base.logger = nil output('Database::BatchedBackgroundMigration::MainExecutionWorker', Database::BatchedBackgroundMigration::MainExecutionWorker.new.send(:enabled?)) output('Database::BatchedBackgroundMigration::CiExecutionWorker', Database::BatchedBackgroundMigration::CiExecutionWorker.new.send(:enabled?)) output('Database::BatchedBackgroundMigration::CiDatabaseWorker', Database::BatchedBackgroundMigration::CiDatabaseWorker.enabled?) output('Database::BatchedBackgroundMigrationWorker', Database::BatchedBackgroundMigrationWorker.enabled?) output('Gitlab::Database::Reindexing', Gitlab::Database::Reindexing.enabled?) is_execute_background_migrations_enabled = !(Feature.enabled?(:disallow_database_ddl_feature_flags, type: :ops) || Feature.disabled?(:execute_background_migrations, type: :ops)) output('BackgroundMigration::CiDatabaseWorker', is_execute_background_migrations_enabled) output('BackgroundMigrationWorker', is_execute_background_migrations_enabled) is_database_async_index_operations_enabled = !(Feature.enabled?(:disallow_database_ddl_feature_flags, type: :ops) || Feature.disabled?(:database_async_index_operations, type: :ops)) output('rake gitlab:db:execute_async_index_operations', is_database_async_index_operations_enabled) is_database_async_foreign_key_validation_enabled = Feature.disabled?(:disallow_database_ddl_feature_flags, type: :ops) && Feature.enabled?(:database_async_foreign_key_validation, type: :ops) output('rake gitlab:db:validate_async_constraints', is_database_async_foreign_key_validation_enabled) output('Gitlab::Database::AsyncConstraints', is_database_async_foreign_key_validation_enabled) is_database_async_index_creation_enabled = Feature.disabled?(:disallow_database_ddl_feature_flags, type: :ops) && Feature.enabled?(:database_async_index_creation, type: :ops) output('Gitlab::Database::AsyncIndexes', is_database_async_index_creation_enabled) is_partition_manager_sync_partitions_enabled = !(Feature.enabled?(:disallow_database_ddl_feature_flags, type: :ops) || Feature.disabled?(:partition_manager_sync_partitions, type: :ops)) output('Gitlab::Database::Partitioning#sync_partitions', is_partition_manager_sync_partitions_enabled) output('Gitlab::Database::Partitioning#drop_detached_partitions', is_partition_manager_sync_partitions_enabled) end
-
Check the output - All workers/tasks should be enabled
, like for example:Database::BatchedBackgroundMigration::MainExecutionWorker is enabled. Database::BatchedBackgroundMigration::CiExecutionWorker is enabled. Database::BatchedBackgroundMigration::CiDatabaseWorker is enabled. Database::BatchedBackgroundMigrationWorker is enabled. Gitlab::Database::Reindexing is enabled. BackgroundMigration::CiDatabaseWorker is enabled. BackgroundMigrationWorker is enabled. rake gitlab:db:execute_async_index_operations is enabled. rake gitlab:db:validate_async_constraints is enabled. Gitlab::Database::AsyncConstraints is enabled. Gitlab::Database::AsyncIndexes is enabled. Gitlab::Database::Partitioning#sync_partitions is enabled. Gitlab::Database::Partitioning#drop_detached_partitions is enabled.
-
-
-
🐬 SRE : We have a separate issue to rebuild each cluster's DR Archive and Delayed replicas. We will use the following issue link to track rebuild the main cluster's DR Archive and Delayed replicas from the most recent v14 database backup of the main. It will be completed in the next couple of working days. TODO add links for gprd -
🐘 DBRE Logical replication from `SOURCE to TARGET`` should have been destroyed by the switchover, check and destroy it if it was not:-
🐘 DBRE On the TARGET cluster patroni-main-v14 Leader/Writer, drop subscription (if still existing) for logical replication:-
Check if the subscription still exist: gitlab-psql \ -Xc "select subname, subenabled, subconninfo, subslotname, subpublications from pg_subscription"
-
Drop the logical replication subscription: gitlab-psql \ -Xc "alter subscription logical_subscription disable" \ -Xc "alter subscription logical_subscription set (slot_name = none)" \ -Xc "drop subscription logical_subscription"
-
-
🐘 DBRE On the SOURCE cluster patroni-main-2004 Leader/Writer, drop publication and logical_replication_slot for reverse replication:-
Check if the publication and replication slots still exist: gitlab-psql \ -Xc "select pubname from pg_publication" \ -Xc "select slot_name, plugin, slot_type, active from pg_replication_slots"
-
Check if the publication and replication slots still exist: gitlab-psql \ -Xc "drop publication logical_replication" \ -Xc "select pg_drop_replication_slot('logical_replication_slot') from pg_replication_slots where slot_name = 'logical_replication_slot'" \ -Xc "drop table if exists test_publication" \ -Xc "drop table if exists test_replication"
-
-
Reverse replication validation
-
🐺 Coordinator : Coordinate with🐘 DBRE to make sure we stay in the current status for an hour and continue to run reverse replication fromv14
tov12
per https://gitlab.com/gitlab-com/gl-infra/reliability/-/issues/23986#note_1524934967 -
🐘 DBRE TODO_ @NikolayS TO ADD steps to validate the reverse replication is working fine as expected. -
🐘 DBRE On the SOURCE cluster patroni-main-2004 Leader/Writer, drop subscription (if still existing) for reverse replication:-
Check if the subscription still exist: gitlab-psql \ -Xc "select subname, subenabled, subconninfo, subslotname, subpublications from pg_subscription"
-
Drop the logical replication subscription: gitlab-psql \ -Xc "alter subscription reverse_logical_subscription disable" \ -Xc "alter subscription reverse_logical_subscription set (slot_name = none)" \ -Xc "drop subscription reverse_logical_subscription"
-
-
🐘 DBRE On the TARGET cluster patroni-main-v14 Leader/Writer, drop publication and reverse_logical_replication_slot for reverse replication:-
Check if the publication and replication slots still exist: gitlab-psql \ -Xc "select pubname from pg_publication" \ -Xc "select slot_name, plugin, slot_type, active from pg_replication_slots"
-
Check if the publication and replication slots still exist: gitlab-psql \ -Xc "drop publication reverse_logical_replication" \ -Xc "select pg_drop_replication_slot('reverse_logical_replication_slot') from pg_replication_slots where slot_name = 'reverse_logical_replication_slot'" \ -Xc "drop table if exists test_publication" \ -Xc "drop table if exists test_replication"
-
-
🐘 DBRE Shutdown the SOURCE gprd-base-db-patroni-main-2004 cluster to avoid any risk of splitbrain:knife ssh "role:gprd-base-db-patroni-main-2004" "sudo systemctl stop patroni"
-
🐺 Coordinator : Check if gitlab_maintenance_mode is DISABLED for gprd (Thanos link)- If is not disabled ask
🔪 Playbook-Runner to manually disable it by:- SSH to a console VM in
gprd
(eg.ssh console-01-sv-gprd.c.gitlab-production.internal
)- Set
gitlab_maintenance_mode=0
on node exporter :sudo su - echo -e "# HELP gitlab_maintenance_mode record maintenance window\n# TYPE gitlab_maintenance_mode untyped\ngitlab_maintenance_mode 0\n" > /opt/prometheus/node_exporter/metrics/gitlab_maintenance_mode.prom cat /opt/prometheus/node_exporter/metrics/gitlab_maintenance_mode.prom atq
- Set
- SSH to a console VM in
- If is not disabled ask
-
🐺 Coordinator : Mark the change request as/label ~"change::complete"
Rollback (if required)
-
PRODUCTION ONLY 📣 CMOC : Post an update from Status.io maintenance site, publish on@gitlabstatus
. Workflow: https://about.gitlab.com/handbook/support/workflows/cmoc_workflows.html#sending-updates-about-maintenance-events-
Message:
Due to an issue during the planned maintenance for the database layer, we have initiated a rollback of the changes. We will provide update once the rollback process is completed.
-
-
PRODUCTION ONLY ☎ Comms-Handler : Ask the IMOC or the Head Honcho if this message should be sent to any slack rooms:#whats-happening-at-gitlab
-
#infrastructure-lounge
(cc@sre-oncall
) -
#g_delivery
(cc@release-managers
) #community-relations
Health check
-
🐺 Coordinator : Ensure that there are no active critical alerts or open incidents:- PRODUCTION:
-
🔪 Playbook-Runner : Verify that the ansible inventory is up to date and reflects the real state from the cluster.
Rollback Postgres Upgrade
- After the switchover there is NO reverse replication, replicating data from PG14 to PG12!
- After enabling site traffic on the new cluster, all new changes to the database will only be on the new cluster
- There will be no rollback after the switchover!
-
🐘 DBRE : Check if rollback is possible-
We have not switched over
-
-
🐘 DBRE : Monitor what pgbouncer pool has connections Thanos -
🐘 DBRE Start patroni in the source gprd-base-db-patroni-main-2004 cluster nodes:knife ssh "role:gprd-base-db-patroni-main-2004" "sudo systemctl start patroni"
-
🐘 DBRE : TODO @vitabaks @NikolayS Disable "read only" flag on v12 cluster: -
🐘 DBRE : Monitor the v14 and v12 PostgreSQL Log Files:-
Login to node 01 of each cluster: ssh patroni-main-2004-101-db-gprd.c.gitlab-production.internal
ssh patroni-main-v14-101-db-gprd.c.gitlab-production.internal
-
Get leader for each cluster: sudo gitlab-patronictl list
-
Connect via SSH to the previously identified leaders. -
Tail Postgres Logs: sudo tail -f /var/log/gitlab/postgresql/postgresql.csv
-
ROLLBACK – execute!
Goal: Set gprd-main v12 cluster as Primary cluster
-
🔪 Playbook-Runner : Executeswitchover_rollback.yml
playbook to rollback to v12 cluster:cd ~/src/db-migration/pg-upgrade-logical ansible-playbook \ -e "ansible_ssh_private_key_file=/home/dbupgrade/.ssh/id_dbupgrade" \ -i inventory/gprd-main.yml \ switchover_rollback.yml -e "force_mode=true" 2>&1 \ | ts | tee -a ansible_switchover_rollback_gprd_main_$(date +%Y%m%d).log
-
🐘 DBRE : Ensure writes are happening on Postgres/Patroni nodes in patroni-main-2004: Thanos -
🐘 DBRE Check prometheus sanity check metrics for reads all going to the correct hosts-
Index reads
-
Expected result: all queries going to the patroni-main-2004
cluster.
-
-
Sequential scans
-
Expected result: all queries going to the patroni-main-2004
cluster.
-
-
Index reads
Complete the rollback
-
🏆 Quality Confirm that our smoke tests are still passing (continue the rollback as this might take an hour...) -
🐬 SRE : Revert the MR that change Consul and Prometheus, IF it was Merged. -
🐬 SRE : Re-enable Chef in all gprd-main nodes:knife ssh "role:gprd-base-db-patroni-main-2004 OR role:gprd-base-db-patroni-main-v14" "sudo chef-client-enable"
-
🐬 SRE : Confirm chef-client is enabled in all nodes thanos link -
🐬 SRE : Run chef-client on Patroni Nodes:knife ssh "role:gprd-base-db-patroni-main-v14 OR role:gprd-base-db-patroni-main-2004" "sudo chef-client"
-
🐬 SRE : Confirm no errors while running chef-client thanos link -
PRODUCTION ONLY 📣 CMOC : Post update from Status.io maintenance site, publish on@gitlabstatus
. Workflow: https://about.gitlab.com/handbook/support/workflows/cmoc_workflows.html#sending-updates-about-maintenance-events-
Click "Finish Maintenance" and send the following: -
Message:
GitLab.com rollback for the database layer is complete, and we're back up and running. We'll be monitoring the platform to ensure all systems are functioning correctly. Thank you for your patience.
-
-
-
PRODUCTION ONLY ☎ Comms-Handler : Send the following message to slack rooms:GitLab.com rollback for the database layer is complete and we're back up and running. We'll be monitoring the platform to ensure all systems are functioning correctly. Thank you for your patience.
-
#whats-happening-at-gitlab
-
#infrastructure-lounge
(cc@sre-oncall
) -
#g_delivery
(cc@release-managers
)
-
-
🐘 DBRE : Enable feature flags by typing the following into#production
:- PRODUCTION:
-
/chatops run feature set disallow_database_ddl_feature_flags false
-
-
🐘 DBRE : Inform the database team that the CR was rolled back and that the background migrations and reindexing feature flags have been re-enabled. Post the following comment on the gitlab.com CR (production#16266 (closed)):Hi @gl-database, Please note that we have rolled back the work for this CR in the `gprd` environment. Therefore we have re-enabled the `execute_batched_migrations_on_schedule`, `execute_background_migrations`, reindexing, async_foreign_key, sync_index and partition_manager_sync_partitions features and tasks in `PRODUCTION`. Could you please confirm that they have been re-enabled correctly? Thanks!
- PRODUCTION:
-
🐬 SRE : Check if the underlying DDL features were ENABLED back by disabling the disallow_database_ddl_feature_flags:-
On slack /chatops run feature get disallow_database_ddl_feature_flags
should return DISABLED -
On Rails Console: -
Open a new rails console -
PRODUCTION: URL production.teleport.gitlab.net or tsh:
tsh login --proxy=production.teleport.gitlab.net --request-roles=rails-ro --request-reason="Validate if Database Migration/Reindex Workers are disabled during PG14 upgrade: https://gitlab.com/gitlab-com/gl-infra/production/-/issues/16266" tsh ssh rails-ro@console-ro-01-sv-gprd
-
PRODUCTION: URL production.teleport.gitlab.net or tsh:
-
Paste the script in the console def output(name, value) color = value ? '31' : '32' result = value ? 'enabled' : 'disabled' puts "\e[33m#{name} is\e[0m \e[#{color}m#{result}.\e[0m" end def check ActiveRecord::Base.logger = nil output('Database::BatchedBackgroundMigration::MainExecutionWorker', Database::BatchedBackgroundMigration::MainExecutionWorker.new.send(:enabled?)) output('Database::BatchedBackgroundMigration::CiExecutionWorker', Database::BatchedBackgroundMigration::CiExecutionWorker.new.send(:enabled?)) output('Database::BatchedBackgroundMigration::CiDatabaseWorker', Database::BatchedBackgroundMigration::CiDatabaseWorker.enabled?) output('Database::BatchedBackgroundMigrationWorker', Database::BatchedBackgroundMigrationWorker.enabled?) output('Gitlab::Database::Reindexing', Gitlab::Database::Reindexing.enabled?) is_execute_background_migrations_enabled = !(Feature.enabled?(:disallow_database_ddl_feature_flags, type: :ops) || Feature.disabled?(:execute_background_migrations, type: :ops)) output('BackgroundMigration::CiDatabaseWorker', is_execute_background_migrations_enabled) output('BackgroundMigrationWorker', is_execute_background_migrations_enabled) is_database_async_index_operations_enabled = !(Feature.enabled?(:disallow_database_ddl_feature_flags, type: :ops) || Feature.disabled?(:database_async_index_operations, type: :ops)) output('rake gitlab:db:execute_async_index_operations', is_database_async_index_operations_enabled) is_database_async_foreign_key_validation_enabled = Feature.disabled?(:disallow_database_ddl_feature_flags, type: :ops) && Feature.enabled?(:database_async_foreign_key_validation, type: :ops) output('rake gitlab:db:validate_async_constraints', is_database_async_foreign_key_validation_enabled) output('Gitlab::Database::AsyncConstraints', is_database_async_foreign_key_validation_enabled) is_database_async_index_creation_enabled = Feature.disabled?(:disallow_database_ddl_feature_flags, type: :ops) && Feature.enabled?(:database_async_index_creation, type: :ops) output('Gitlab::Database::AsyncIndexes', is_database_async_index_creation_enabled) is_partition_manager_sync_partitions_enabled = !(Feature.enabled?(:disallow_database_ddl_feature_flags, type: :ops) || Feature.disabled?(:partition_manager_sync_partitions, type: :ops)) output('Gitlab::Database::Partitioning#sync_partitions', is_partition_manager_sync_partitions_enabled) output('Gitlab::Database::Partitioning#drop_detached_partitions', is_partition_manager_sync_partitions_enabled) end
-
Check the output - All workers/tasks should be enabled
, like for example:Database::BatchedBackgroundMigration::MainExecutionWorker is enabled. Database::BatchedBackgroundMigration::CiExecutionWorker is enabled. Database::BatchedBackgroundMigration::CiDatabaseWorker is enabled. Database::BatchedBackgroundMigrationWorker is enabled. Gitlab::Database::Reindexing is enabled. BackgroundMigration::CiDatabaseWorker is enabled. BackgroundMigrationWorker is enabled. rake gitlab:db:execute_async_index_operations is enabled. rake gitlab:db:validate_async_constraints is enabled. Gitlab::Database::AsyncConstraints is enabled. Gitlab::Database::AsyncIndexes is enabled. Gitlab::Database::Partitioning#sync_partitions is enabled. Gitlab::Database::Partitioning#drop_detached_partitions is enabled.
-
-
-
🔪 Playbook-Runner : On two nodes, console and target leader, remove the private keys temporarily placed in~dbupgrade/.ssh
:rm ~dbupgrade/.ssh/id_rsa rm ~dbupgrade/.ssh/id_dbupgrade
-
🐘 DBRE On the TARGET cluster patroni-main-v14 Leader/Writer, drop subscription (if still existing) for logical replication:-
Check if the subscription still exist: gitlab-psql \ -Xc "select subname, subenabled, subconninfo, subslotname, subpublications from pg_subscription"
-
Drop the logical replication subscription: gitlab-psql \ -Xc "alter subscription logical_subscription disable" \ -Xc "alter subscription logical_subscription set (slot_name = none)" \ -Xc "drop subscription logical_subscription"
-
-
🐘 DBRE On the SOURCE cluster patroni-main-2004 Leader/Writer, drop publication and logical_replication_slot for reverse replication:-
Check if the publication and replication slots still exist: gitlab-psql \ -Xc "select pubname from pg_publication" \ -Xc "select slot_name, plugin, slot_type, active from pg_replication_slots"
-
Check if the publication and replication slots still exist: gitlab-psql \ -Xc "drop publication logical_replication" \ -Xc "select pg_drop_replication_slot('logical_replication_slot') from pg_replication_slots where slot_name = 'logical_replication_slot'" \ -Xc "drop table if exists test_publication" \ -Xc "drop table if exists test_replication"
-
-
🐘 DBRE : ADD the following silences at https://alerts.gitlab.net to silenceWALGBaseBackup
alerts in patroni-main-v14 for 2 weeks (14 days = 336 hours)- Start time:
2023-09-07T16:52:09.000Z
- Duration:
336h
-
env="gprd"
-
type="gprd-patroni-main-v14"
-
alertname=~"WALGBaseBackupFailed|walgBaseBackupDelayed"
-
- Start time:
-
🐘 DBRE : UPDATE the following silence at https://alerts.gitlab.net to silence alerts in V14 nodes for 2 weeks (14 days = 336 hours):-
Start time:
2023-09-09T14:00:00.000Z
-
Duration:
341h
-
Matcher:
-
PRODUCTION
-
env="gprd"
-
fqdn=~"patroni-main-v14.*"
-
-
PRODUCTION
-
-
🐘 DBRE : DELETE the following silences at https://alerts.gitlab.net-
Matcher:
-
PRODUCTION
-
env="gprd"
-
fqdn=~"patroni-main-2004.*"
-
-
PRODUCTION
-
-
🐺 Coordinator : Check if gitlab_maintenance_mode is DISABLED for gprd (Thanos link)- If is not disabled ask
🔪 Playbook-Runner to manually disable it by:- SSH to a console VM in
gprd
(eg.ssh console-01-sv-gprd.c.gitlab-production.internal
)- Set
gitlab_maintenance_mode=0
on node exporter :sudo su - echo -e "# HELP gitlab_maintenance_mode record maintenance window\n# TYPE gitlab_maintenance_mode untyped\ngitlab_maintenance_mode 0\n" > /opt/prometheus/node_exporter/metrics/gitlab_maintenance_mode.prom cat /opt/prometheus/node_exporter/metrics/gitlab_maintenance_mode.prom atq
- Set
- SSH to a console VM in
- If is not disabled ask
-
🐺 Coordinator : Mark the production#16266 (closed) change request as/label ~"change::aborted"
Extra details
In case the Playbook-Runner is disconnected
As most of the steps are executed in a tmux session owned by the Playbook-Runner role we need a safety net in case this person loses their internet connection or otherwise drops off half way through. Since other SREs/DBREs also have root access on the console node where everything is running they should be able to recover it in different ways. We tested the following approach to recovering the tmux session, updating the ssh agent and taking over as a new ansible user.
ssh host
- Add your public SSH key to
/home/PREVIOUS_PLAYBOOK_USERNAME/.ssh/authorized_keys
-
sudo chef-client-disable https://gitlab.com/gitlab-com/gl-infra/production/-/issues/16266
so that we don't override the above ssh -A PREVIOUS_PLAYBOOK_USERNAME@host
echo $SSH_AUTH_SOCK
tmux attach -t 0
export SSH_AUTH_SOCK=<VALUE from previous SSH_AUTH_SOCK output>
<ctrl-b> :
set-environment -g 'SSH_AUTH_SOCK' <VALUE from previous SSH_AUTH_SOCK output>
export ANSIBLE_REMOTE_USER=NEW_PLAYBOOK_USERNAME
<ctrl-b> :
set-environment -g 'ANSIBLE_REMOTE_USER' <your-user>