[2024-04-11] GSTG - Patroni Switchover - Main and CI
Staging Change
Change Summary
We have built an Ansible Playbook to Switchover Patroni Leader to one of the available patroni replicas with Zero-Downtime and Zero-Data-Loss -DBRE Toolkit
In this change, we would like to validate the patroni switchover playbook (switchover_patroni_leader.yml
) for the gstg
Main and CI clusters. We will validate the playbook and capture the results as it will help us in planning an identical Change Request for Patroni Leader Switchover in the <code data-sourcepos="10:298-10:301">gprd</code> environment.
Reference:
- DBRE Toolkit
- Patroni leader node Switchover
- Implement Postgres Data Checksums - Phase 2
- db-migration!511 (merged)
Change Details
- Services Impacted - Database ServicePatroni ServicePatroniCI
- Change Technician - @bshah11
- Change Reviewer - @cmcfarland @rhenchen.gitlab @alexander-sosna @NikolayS
- Time tracking - 90 minutes
- Downtime Component - No downtime is expected during the patroni switchover
Detailed steps for the change
Change Steps - steps to take to execute the change
Estimated Time to Complete (mins) - 90 minutes
-
Set label changein-progress /label ~change::in-progress
-
Prepare -
Create an alert silence for the gstg
Main and CI patroni cluster nodes - https://alerts.gitlab.net/#/silences -
Get the console VM console-01-sv-gstg.c.gitlab-staging-1.internal
ready for action-
SSH to the console VM ( ssh -A console-01-sv-gstg.c.gitlab-staging-1.internal
) -
Start a / resume the tmux session tmux a -t pg_switchover || tmux new -s pg_switchover
-
Create an access_token with at least read_repository
for the next step -
Clone repos: rm -rf ~/src && mkdir ~/src cd ~/src git clone https://gitlab.com/gitlab-com/gl-infra/db-migration.git
-
Ensure you have the pre-requisites installed sudo apt install ansible
-
Ensure that Ansible can talk to all the hosts listed in the inventory file cd ~/src/db-migration/dbre-toolkit ansible -i inventory/gstg-main.yml all -m ping ansible -i inventory/gstg-ci.yml all -m ping
-
-
Disable the DDL-related feature flags -
Disable feature flags by typing the following into #production -
/chatops run feature set disallow_database_ddl_feature_flags true --staging
-
-
Inform the database team that DDL feature flags have been disabled until the CR is complete. Post the following comment in the CR. Hi @gl-database, Please note that the `execute_batched_migrations_on_schedule` and `execute_background_migrations`, reindexing, async_foreign_key, async_index features and partition_manager_sync_partitions feature flags will be disabled in the `STAGING` environment, as we are carrying out Patroni Leader Switchover. We will re-enable the feature flags once the CR is complete. Thanks!
-
Open a new rails console. SSH to the rails console VM ( ssh -A console-ro-01-sv-gstg.c.gitlab-staging-1.internal
) and rungitlab-rails console
sudo su - gitlab-rails console
-
Paste the script in the console def output(name, value) color = value ? '31' : '32' result = value ? 'enabled' : 'disabled' puts "\e[33m#{name} is\e[0m \e[#{color}m#{result}.\e[0m" end def check ActiveRecord::Base.logger = nil output('Database::BatchedBackgroundMigration::MainExecutionWorker', Database::BatchedBackgroundMigration::MainExecutionWorker.new.send(:enabled?)) output('Database::BatchedBackgroundMigration::CiExecutionWorker', Database::BatchedBackgroundMigration::CiExecutionWorker.new.send(:enabled?)) output('Database::BatchedBackgroundMigration::CiDatabaseWorker', Database::BatchedBackgroundMigration::CiDatabaseWorker.enabled?) output('Database::BatchedBackgroundMigrationWorker', Database::BatchedBackgroundMigrationWorker.enabled?) output('Gitlab::Database::Reindexing', Gitlab::Database::Reindexing.enabled?) is_execute_background_migrations_enabled = !(Feature.enabled?(:disallow_database_ddl_feature_flags, type: :ops) || Feature.disabled?(:execute_background_migrations, type: :ops)) output('BackgroundMigration::CiDatabaseWorker', is_execute_background_migrations_enabled) output('BackgroundMigrationWorker', is_execute_background_migrations_enabled) is_database_async_index_operations_enabled = !(Feature.enabled?(:disallow_database_ddl_feature_flags, type: :ops) || Feature.disabled?(:database_async_index_operations, type: :ops)) output('rake gitlab:db:execute_async_index_operations', is_database_async_index_operations_enabled) is_database_async_foreign_key_validation_enabled = Feature.disabled?(:disallow_database_ddl_feature_flags, type: :ops) && Feature.enabled?(:database_async_foreign_key_validation, type: :ops) output('rake gitlab:db:validate_async_constraints', is_database_async_foreign_key_validation_enabled) output('Gitlab::Database::AsyncConstraints', is_database_async_foreign_key_validation_enabled) is_database_async_index_creation_enabled = Feature.disabled?(:disallow_database_ddl_feature_flags, type: :ops) && Feature.enabled?(:database_async_index_creation, type: :ops) output('Gitlab::Database::AsyncIndexes', is_database_async_index_creation_enabled) is_partition_manager_sync_partitions_enabled = !(Feature.enabled?(:disallow_database_ddl_feature_flags, type: :ops) || Feature.disabled?(:partition_manager_sync_partitions, type: :ops)) output('Gitlab::Database::Partitioning#sync_partitions', is_partition_manager_sync_partitions_enabled) output('Gitlab::Database::Partitioning#drop_detached_partitions', is_partition_manager_sync_partitions_enabled) end check
-
Check the output - All workers/tasks should be disabled, like for example: Database::BatchedBackgroundMigration::MainExecutionWorker is disabled. Database::BatchedBackgroundMigration::CiExecutionWorker is disabled. Database::BatchedBackgroundMigration::CiDatabaseWorker is disabled. Database::BatchedBackgroundMigrationWorker is disabled. Gitlab::Database::Reindexing is disabled. BackgroundMigration::CiDatabaseWorker is disabled. BackgroundMigrationWorker is disabled. rake gitlab:db:execute_async_index_operations is disabled. rake gitlab:db:validate_async_constraints is disabled. Gitlab::Database::AsyncConstraints is disabled. Gitlab::Database::AsyncIndexes is disabled. Gitlab::Database::Partitioning#sync_partitions is disabled. Gitlab::Database::Partitioning#drop_detached_partitions is disabled.
-
-
-
Main Cluster -
Validate the Patroni Leader Switchover playbook without performing the actual switchover and capture all the logs, patroni Service Apdex
from the Patroni Dashboard.cd ~/src/db-migration/dbre-toolkit ansible -i inventory/gstg-main.yml all -m ping ssh patroni-main-v14-101-db-gstg.c.gitlab-staging-1.internal "sudo gitlab-patronictl list" ansible-playbook -i inventory/gstg-main.yml switchover_patroni_leader.yml --skip-tags "patroni-switchover" 2>&1 | ts | tee -a ansible_switchover_patroni_leader_gstg-main_$(date +%Y%m%d).log ssh patroni-main-v14-101-db-gstg.c.gitlab-staging-1.internal "sudo gitlab-patronictl list"
-
Perform Patroni Leader Switchover, validate options of the swtichover playbook and capture all the logs, patroni Service Apdex
from the Patroni Dashboard..cd ~/src/db-migration/dbre-toolkit ssh patroni-main-v14-101-db-gstg.c.gitlab-staging-1.internal "sudo gitlab-patronictl list" ansible-playbook -i inventory/gstg-main.yml switchover_patroni_leader.yml 2>&1 | ts | tee -a ansible_switchover_patroni_leader_gstg-main_$(date +%Y%m%d).log ssh patroni-main-v14-101-db-gstg.c.gitlab-staging-1.internal "sudo gitlab-patronictl list"
-
-
CI Cluster -
Validate the Patroni Leader Switchover playbook without performing the actual switchover and capture all the logs, patroni-ci Service Apdex
from the Patroni Dashboard.cd ~/src/db-migration/dbre-toolkit ansible -i inventory/gstg-ci.yml all -m ping ssh patroni-ci-v14-101-db-gstg.c.gitlab-staging-1.internal "sudo gitlab-patronictl list" ansible-playbook -i inventory/gstg-ci.yml switchover_patroni_leader.yml --skip-tags "patroni-switchover" 2>&1 | ts | tee -a ansible_switchover_patroni_leader_gstg-ci_$(date +%Y%m%d).log ssh patroni-ci-v14-101-db-gstg.c.gitlab-staging-1.internal "sudo gitlab-patronictl list"
-
Perform Patroni Leader Switchover, validate options of the swtichover playbook and capture all the logs, patroni-ci Service Apdex
from the Patroni Dashboard..cd ~/src/db-migration/dbre-toolkit ssh patroni-ci-v14-101-db-gstg.c.gitlab-staging-1.internal "sudo gitlab-patronictl list" ansible-playbook -i inventory/gstg-ci.yml switchover_patroni_leader.yml 2>&1 | ts | tee -a ansible_switchover_patroni_leader_gstg-ci_$(date +%Y%m%d).log ssh patroni-ci-v14-101-db-gstg.c.gitlab-staging-1.internal "sudo gitlab-patronictl list"
-
-
Enable the DDL-related feature flags -
Enable feature flags by typing the following into #production -
/chatops run feature set disallow_database_ddl_feature_flags false --staging
-
-
Inform the database team that the CR is completed and that the background migrations and reindexing feature flags have been re-enabled. Post the following comment in the CR. Hi @gl-database, Please note that we have completed the work for this CR in the `gstg` environment. Therefore we have re-enabled the `execute_batched_migrations_on_schedule`, `execute_background_migrations`, reindexing, async_foreign_key, sync_index and partition_manager_sync_partitions features and tasks in `gstg`. Could you please confirm that they have been re-enabled correctly? Thanks!
-
Open a new rails console. SSH to the rails console VM ( ssh -A console-ro-01-sv-gstg.c.gitlab-staging-1.internal
) and rungitlab-rails console
sudo su - gitlab-rails console
-
Paste the script in the console def output(name, value) color = value ? '31' : '32' result = value ? 'enabled' : 'disabled' puts "\e[33m#{name} is\e[0m \e[#{color}m#{result}.\e[0m" end def check ActiveRecord::Base.logger = nil output('Database::BatchedBackgroundMigration::MainExecutionWorker', Database::BatchedBackgroundMigration::MainExecutionWorker.new.send(:enabled?)) output('Database::BatchedBackgroundMigration::CiExecutionWorker', Database::BatchedBackgroundMigration::CiExecutionWorker.new.send(:enabled?)) output('Database::BatchedBackgroundMigration::CiDatabaseWorker', Database::BatchedBackgroundMigration::CiDatabaseWorker.enabled?) output('Database::BatchedBackgroundMigrationWorker', Database::BatchedBackgroundMigrationWorker.enabled?) output('Gitlab::Database::Reindexing', Gitlab::Database::Reindexing.enabled?) is_execute_background_migrations_enabled = !(Feature.enabled?(:disallow_database_ddl_feature_flags, type: :ops) || Feature.disabled?(:execute_background_migrations, type: :ops)) output('BackgroundMigration::CiDatabaseWorker', is_execute_background_migrations_enabled) output('BackgroundMigrationWorker', is_execute_background_migrations_enabled) is_database_async_index_operations_enabled = !(Feature.enabled?(:disallow_database_ddl_feature_flags, type: :ops) || Feature.disabled?(:database_async_index_operations, type: :ops)) output('rake gitlab:db:execute_async_index_operations', is_database_async_index_operations_enabled) is_database_async_foreign_key_validation_enabled = Feature.disabled?(:disallow_database_ddl_feature_flags, type: :ops) && Feature.enabled?(:database_async_foreign_key_validation, type: :ops) output('rake gitlab:db:validate_async_constraints', is_database_async_foreign_key_validation_enabled) output('Gitlab::Database::AsyncConstraints', is_database_async_foreign_key_validation_enabled) is_database_async_index_creation_enabled = Feature.disabled?(:disallow_database_ddl_feature_flags, type: :ops) && Feature.enabled?(:database_async_index_creation, type: :ops) output('Gitlab::Database::AsyncIndexes', is_database_async_index_creation_enabled) is_partition_manager_sync_partitions_enabled = !(Feature.enabled?(:disallow_database_ddl_feature_flags, type: :ops) || Feature.disabled?(:partition_manager_sync_partitions, type: :ops)) output('Gitlab::Database::Partitioning#sync_partitions', is_partition_manager_sync_partitions_enabled) output('Gitlab::Database::Partitioning#drop_detached_partitions', is_partition_manager_sync_partitions_enabled) end check
-
Check the output - All workers/tasks should be disabled, like for example: Database::BatchedBackgroundMigration::MainExecutionWorker is enabled. Database::BatchedBackgroundMigration::CiExecutionWorker is enabled. Database::BatchedBackgroundMigration::CiDatabaseWorker is enabled. Database::BatchedBackgroundMigrationWorker is enabled. Gitlab::Database::Reindexing is enabled. BackgroundMigration::CiDatabaseWorker is enabled. BackgroundMigrationWorker is enabled. rake gitlab:db:execute_async_index_operations is enabled. rake gitlab:db:validate_async_constraints is enabled. Gitlab::Database::AsyncConstraints is enabled. Gitlab::Database::AsyncIndexes is enabled. Gitlab::Database::Partitioning#sync_partitions is enabled. Gitlab::Database::Partitioning#drop_detached_partitions is enabled.
-
-
Expire the alert silence created earlier for the gstg
Main and CI patroni cluster nodes - https://alerts.gitlab.net/#/silences -
Set label changecomplete /label ~change::complete
Rollback
Rollback steps - steps to be taken in the event of a need to rollback this change
Estimated Time to Complete (mins) -
-
Expire the alert silence created earlier for the gstg
Main and CI patroni cluster nodes - https://alerts.gitlab.net/#/silences -
No rollback steps are necessary as the Patroni Leader Switchover Playbook only performs the switchover. Just validate the health of the Patroni Main and CI cluster's health. ssh patroni-main-v14-101-db-gstg.c.gitlab-staging-1.internal "sudo gitlab-patronictl list" ssh patroni-ci-v14-101-db-gstg.c.gitlab-staging-1.internal "sudo gitlab-patronictl list"
-
Enable the DDL-related feature flags -
Enable feature flags by typing the following into #production -
/chatops run feature set disallow_database_ddl_feature_flags false --staging
-
-
Inform the database team that the CR is completed and that the background migrations and reindexing feature flags have been re-enabled. Post the following comment in the CR. Hi @gl-database, Please note that we have completed the work for this CR in the `gstg` environment. Therefore we have re-enabled the `execute_batched_migrations_on_schedule`, `execute_background_migrations`, reindexing, async_foreign_key, sync_index and partition_manager_sync_partitions features and tasks in `gstg`. Could you please confirm that they have been re-enabled correctly? Thanks!
-
Open a new rails console. SSH to the rails console VM ( ssh -A console-ro-01-sv-gstg.c.gitlab-staging-1.internal
) and rungitlab-rails console
sudo su - gitlab-rails console
-
Paste the script in the console def output(name, value) color = value ? '31' : '32' result = value ? 'enabled' : 'disabled' puts "\e[33m#{name} is\e[0m \e[#{color}m#{result}.\e[0m" end def check ActiveRecord::Base.logger = nil output('Database::BatchedBackgroundMigration::MainExecutionWorker', Database::BatchedBackgroundMigration::MainExecutionWorker.new.send(:enabled?)) output('Database::BatchedBackgroundMigration::CiExecutionWorker', Database::BatchedBackgroundMigration::CiExecutionWorker.new.send(:enabled?)) output('Database::BatchedBackgroundMigration::CiDatabaseWorker', Database::BatchedBackgroundMigration::CiDatabaseWorker.enabled?) output('Database::BatchedBackgroundMigrationWorker', Database::BatchedBackgroundMigrationWorker.enabled?) output('Gitlab::Database::Reindexing', Gitlab::Database::Reindexing.enabled?) is_execute_background_migrations_enabled = !(Feature.enabled?(:disallow_database_ddl_feature_flags, type: :ops) || Feature.disabled?(:execute_background_migrations, type: :ops)) output('BackgroundMigration::CiDatabaseWorker', is_execute_background_migrations_enabled) output('BackgroundMigrationWorker', is_execute_background_migrations_enabled) is_database_async_index_operations_enabled = !(Feature.enabled?(:disallow_database_ddl_feature_flags, type: :ops) || Feature.disabled?(:database_async_index_operations, type: :ops)) output('rake gitlab:db:execute_async_index_operations', is_database_async_index_operations_enabled) is_database_async_foreign_key_validation_enabled = Feature.disabled?(:disallow_database_ddl_feature_flags, type: :ops) && Feature.enabled?(:database_async_foreign_key_validation, type: :ops) output('rake gitlab:db:validate_async_constraints', is_database_async_foreign_key_validation_enabled) output('Gitlab::Database::AsyncConstraints', is_database_async_foreign_key_validation_enabled) is_database_async_index_creation_enabled = Feature.disabled?(:disallow_database_ddl_feature_flags, type: :ops) && Feature.enabled?(:database_async_index_creation, type: :ops) output('Gitlab::Database::AsyncIndexes', is_database_async_index_creation_enabled) is_partition_manager_sync_partitions_enabled = !(Feature.enabled?(:disallow_database_ddl_feature_flags, type: :ops) || Feature.disabled?(:partition_manager_sync_partitions, type: :ops)) output('Gitlab::Database::Partitioning#sync_partitions', is_partition_manager_sync_partitions_enabled) output('Gitlab::Database::Partitioning#drop_detached_partitions', is_partition_manager_sync_partitions_enabled) end check
-
Check the output - All workers/tasks should be disabled, like for example: Database::BatchedBackgroundMigration::MainExecutionWorker is enabled. Database::BatchedBackgroundMigration::CiExecutionWorker is enabled. Database::BatchedBackgroundMigration::CiDatabaseWorker is enabled. Database::BatchedBackgroundMigrationWorker is enabled. Gitlab::Database::Reindexing is enabled. BackgroundMigration::CiDatabaseWorker is enabled. BackgroundMigrationWorker is enabled. rake gitlab:db:execute_async_index_operations is enabled. rake gitlab:db:validate_async_constraints is enabled. Gitlab::Database::AsyncConstraints is enabled. Gitlab::Database::AsyncIndexes is enabled. Gitlab::Database::Partitioning#sync_partitions is enabled. Gitlab::Database::Partitioning#drop_detached_partitions is enabled.
-
-
Set label changeaborted /label ~change::aborted
Monitoring
Key metrics to observe
- Metric: Patroni Overview dashboard
- Location: https://dashboards.gitlab.net/d/patroni-main/patroni3a-overview?orgId=1&var-PROMETHEUS_DS=PA258B30F88C30650&var-environment=gstg
- What changes to this metric should prompt a rollback: Any deviation from the normal state, especially Error Ratio and Saturation (CPU Utilization and Disk sustained write).
- Metric: Patroni CI Overview dashboard
- Location: https://dashboards.gitlab.net/d/patroni-ci-main/patroni-ci3a-overview?orgId=1&var-PROMETHEUS_DS=PA258B30F88C30650&var-environment=gstg
- What changes to this metric should prompt a rollback: Any deviation from the normal state, especially Error Ratio and Saturation (CPU Utilization and Disk sustained write).
Change Reviewer checklist
-
Check if the following applies: - The scheduled day and time of execution of the change is appropriate.
- The change plan is technically accurate.
- The change plan includes estimated timing values based on previous testing.
- The change plan includes a viable rollback plan.
- The specified metrics/monitoring dashboards provide sufficient visibility for the change.
-
Check if the following applies: - The complexity of the plan is appropriate for the corresponding risk of the change. (i.e. the plan contains clear details).
- The change plan includes success measures for all steps/milestones during the execution.
- The change adequately minimizes risk within the environment/service.
- The performance implications of executing the change are well-understood and documented.
- The specified metrics/monitoring dashboards provide sufficient visibility for the change.
- If not, is it possible (or necessary) to make changes to observability platforms for added visibility?
- The change has a primary and secondary SRE with knowledge of the details available during the change window.
- The labels blocks deployments and/or blocks feature-flags are applied as necessary
Change Technician checklist
-
Check if all items below are complete: - The change plan is technically accurate.
- This Change Issue is linked to the appropriate Issue and/or Epic
- Change has been tested in staging and results noted in a comment on this issue.
- A dry-run has been conducted and results noted in a comment on this issue.
- The change execution window respects the Production Change Lock periods.
- For C1 and C2 change issues, the change event is added to the GitLab Production calendar.
- For C1 and C2 change issues, the SRE on-call has been informed prior to change being rolled out. (In #production channel, mention
@sre-oncall
and this issue and await their acknowledgement.) - For C1 and C2 change issues, the SRE on-call provided approval with the eoc_approved label on the issue.
- For C1 and C2 change issues, the Infrastructure Manager provided approval with the manager_approved label on the issue.
- Release managers have been informed (If needed! Cases include DB change) prior to change being rolled out. (In #production channel, mention
@release-managers
and this issue and await their acknowledgment.) - There are currently no active incidents that are severity1 or severity2
- If the change involves doing maintenance on a database host, an appropriate silence targeting the host(s) should be added for the duration of the change.