[GPRD][Sec Decomp] 2025-04-19 05:00 UTC: Decompose GitLab.com's PostgreSQL Database into Main and Sec
Production Change
NOTE: This issue has been copied to: https://ops.gitlab.net/gitlab-com/gl-infra/production/-/issues/12 to ensure availability in the event of an unplanned outage of gitlab.com. Execution of the CR will take place via the ops CR.
Change Summary
The Sec DB Decomposition Working Group aims to move Sec tables to a separate database, in a similar fashion done in Decompose GitLab.com's database to improve scal... (gitlab-org&6168 - closed) for CI tables (Related CR).
Approximately 25% of all writes are caused by Sec-related features. In order to scale GitLab's database capacity, we are decomposing the PostgreSQL main
cluster into two clusters: A Sec cluster (sec
) for high-write Sec-related features and a Main cluster for other features (main
). By functionally decomposing the database, we increase GitLab's database capacity by roughly 2x.
Further details available in Rollout Epic and most recent status update.
Phases
For IMOC
Timing
Should this maintenance appear on our Status Page?
⚠️ If Yes, add the CMOC Required label to this issue⚠️
-
Yes -
No
Will the CMOC need to be actively engaged during the maintenance window?
-
Yes -
No
Will it require downtime?
-
Yes -
No
Change Details
- Services Impacted - ServicePatroni ServicePatroniSec ServiceWeb ServiceSidekiq ServiceAPI
- Change Technician - @jjsisson
-
Change Reviewer -
@rhenchen.gitlab
@bprescott\_
@zbraddock
- Scheduled Date and Time (UTC in format YYYY-MM-DD HH:MM) - 2025-04-19 05:00 UTC
- Time tracking - 120 minutes
- Downtime Component - NONE
Staffing
Role | Assigned To |
---|---|
|
@theoretick |
|
@jjsisson |
|
TBD |
|
@jay_mccure |
|
Can be from PD schedule |
|
Can be from PD schedule |
|
Can be from PD schedule |
|
@ghavenga |
|
@ghavenga |
Communications Plan
Set Maintenance Mode in GitLab
If your change involves scheduled maintenance, add a step to set and unset maintenance mode per our runbooks. This will make sure SLA calculations adjust for the maintenance period.
Detailed steps for the change
Note: This CR will be copied to ops.gitlab.net, where it will be utilized in the event of an unexpected downtime for gitlab.com Link to gitlab.com CR: TBD
Collaboration
During the change window, the rollout team will collaborate using the following communications channels:
App | Direct Link |
---|---|
Slack | #g_database_operations |
Video Call | TBD |
Immediately
Perform these steps when the issue is created.
-
🐺 Coordinator : Fill out the names of the rollout team in the table above.
Support Options
Provider | Plan | Details | Create Ticket |
---|---|---|---|
Google Cloud Platform | Gold Support | 24x7, email & phone, 1hr response on critical issues | Create GCP Support Ticket |
Entry points
Entry point | Before | Blocking mechanism | Allowlist | QA needs | Notes |
---|---|---|---|---|---|
Pages | Available via *.gitlab.io, and various custom domains | Unavailable if GitLab.com goes down for a brief time. There is a cache but it will expire in gitlab_cache_expiry minutes |
N/A | N/A | |
Database hosts
Accessing the rails and database consoles
- rails:
ssh $USER-rails@console-01-sv-gprd.c.gitlab-production.internal
- main db replica:
ssh $USER-db@console-01-sv-gprd.c.gitlab-production.internal
- main db primary:
ssh $USER-db-primary@console-01-sv-gprd.c.gitlab-production.internal
- main db psql:
ssh -t patroni-main-v16-103-db-gprd.c.gitlab-production.internal sudo gitlab-psql
- sec db replica:
ssh $USER-db-sec@console-01-sv-gprd.c.gitlab-production.internal
- sec db primary:
ssh $USER-db-sec-primary@console-01-sv-gprd.c.gitlab-production.internal
- sec db psql:
ssh -t patroni-sec-v16-01-db-gprd.c.gitlab-production.internal sudo gitlab-psql
Dashboards and debugging
These dashboards might be useful during the rollout: postgresql: Database Decomposition using logical
Production
- PostgreSQL replication overview
- Triage overview
- Sidekiq overview
- Sentry - includes application errors
- Logs (Kibana)
Destination db: sec
- monitoring_pgbouncer_gitlab_user_conns
- monitoring_chef_client_enabled
- monitoring_chef_client_last_run
- monitoring_chef_client_error
- monitoring_snapshot_last_run
- monitoring_user_tables_writes
- monitoring_user_tables_reads
- monitoring_gitlab_maintenance_mode
Source db: main
- monitoring_pgbouncer_gitlab_user_conns
- monitoring_chef_client_enabled
- monitoring_chef_client_last_run
- monitoring_chef_client_error
- monitoring_snapshot_last_run
- monitoring_user_tables_writes
- monitoring_user_tables_reads
- monitoring_gitlab_maintenance_mode
Repos used during the rollout
The following Ansible playbooks are referenced throughout this issue:
- Postgres Physical-to-Logical Replication, Decomposition, and Rollback: https://gitlab.com/gitlab-com/gl-infra/db-migration/-/tree/master/pg-physical-to-logical
High level overview
This gives an high level overview on the procedure.
Decomposition Flowchart
flowchart TB
subgraph Prepare new environment
A[Create new cluster sec as a carbon copy of main] --> B
B[Attach sec as a standby-only-cluster to main via physical replication] --> C
end
C[Make sure both clusters are in sync] --> D1
subgraph Break Physical Replication: ansible-playbook physical-to-logical.yml
D1[Disable Chef] --> D2
D2[Perform clean shutdown of sec] --> D3
D3[On main, create a replication slot and publication FOR ALL main TABLES; remember its LSN] --> D4
D4[Configure recovery_target_lsn on sec] --> D5
D5[Start sec] --> D6
D6[Let sec reach the slot's LSN, still using physical replication] --> D7
D7[Once slot's LSN is reached, promote sec leader] --> D8
D9[Create logical subscription with copy_data=false] --> D10
D10[Let sec catch up using logical replication] --> H
end
subgraph Redirect RO to sec
H[Redirect RO only to sec] --> R
R[Check if cluster is operational and metrics are normal] --"Normal"--> S
R --"Abnormal"--> GR
S[DBRE verify E2E tests run as expected with DevEx help] --"Normal"--> T
S --"Abnormal"-->GR
end
T[Switchover: Redirect RW traffic to sec] --> U1
subgraph Post Switchover Verification
U1[Check if cluster is operational and metrics are normal]--"Normal"--> U2
U1 --"Abnormal"--> LR
U2[Enable Chef, run Chef-Client] --"Normal"--> U3
U2 --"Abnormal"--> LR
U3[Check if cluster is operational and metrics are normal] --"Normal"--> Success
U3 --"Abnormal"--> LR
Success[Success!]
end
subgraph GR[Graceful Rollback - no data loss]
GR1[Start graceful rollback]
end
subgraph LR[Fix forward]
LR1[Fix all issues] -->LR2
LR2[Return to last failed step]
end
Playbook source: https://gitlab.com/gitlab-com/gl-infra/db-migration/-/tree/master/pg-physical-to-logical
Prep Tasks
- [-] SWITCHOVER minus 1 week (2025-04-12 13:00 UTC)
-
☎️ Comms-Handler : Coordinate with@release-managers
at #g_delivery .- Message:
Hi @release-managers :waves:, We will be undergoing scheduled maintenance to our MAIN and SEC database layers in `PRODUCTION`. The operational lock and PCL will start at 2025-04-19 05:00 UTC and should finish at 2025-04-19 17:00 UTC (including performance regression observability period). We would like confirm that deployments that affect MAIN and SEC database clusters would need to be stopped during the window. All details can be found in the CR. Please be so kind and comment the acknowledgement on https://gitlab.com/gitlab-com/gl-infra/production/-/issues/19581. :bow:
- Message:
-
SWITCHOVER minus 3 days (2025-04-16 13:00 UTC)
-
PRODUCTION ONLY ☎️ Comms-Handler : Share message from#whats-happening-at-gitlab
to the following channels:-
#infrastructure-lounge
(cc@sre-oncall
) -
#g_delivery
(cc@release-managers
) -
#support_gitlab-com
(Inform Support SaaS team)-
Share with team a link to the change request regarding the maintenance
-
-
-
🏆 DevEx On-Call : Check that you haveMaintainer
orOwner
permission in https://ops.gitlab.net/gitlab-org/quality to be able to trigger Smoke QA pipeline in schedules (Staging, Production). Reach out to Test Platform to get access if you don't have permission to trigger scheduled pipelines in the linked projects.
-
PCL Start time (2025-04-19 05:00 UTC) - DECOMPOSITION minus 4 hours
-
🔪 Playbook-Runner : Ensure the CR is reviewed by the🚑 EOC -
☎️ Comms-Handler : Coordinate with@release-managers
at #g_delivery the operational lock theMAIN
andSEC
databaseHi @release-managers :waves:, As scheduled we started the Deployment Hard PCL and enabled DDL block feature flat for the Decomposition in the MAIN and SEC databases in the GPRD environment, until 2025-04-19 17:00 UTC. If there’s any incident with potential necessity to revert/apply db-migrations, please reach out @dbo members during the weekend as they are on-call and will evaluate if there will be impact in the upgrade or not. All details can be found in the CR - https://gitlab.com/gitlab-com/gl-infra/production/-/issues/19581. :bow:
-
☎️ Comms-Handler : Inform the database at #g_database_operations and #g_database_frameworksHi @dbo and @db-team, Please note that we started the operational block for the `MAIN` and `SEC` database decomposition, therefore we are blocking database model/structure modifications, by disabling the following tasks(`execute_batched_migrations_on_schedule` and `execute_background_migrations`, reindexing, async_foreign_key, async_index features and partition_manager_sync_partitions) in the GPRD environment. We will re-enable DDLs once the CR is finished and the rollback window is closed at 2025-04-19 17:00 UTC. Thanks!
-
🔪 Playbook-Runner : Disable the DDL-related feature flags:-
Disable feature flags by typing the following into #production
:-
/chatops run feature set disallow_database_ddl_feature_flags true
-
-
-
🏆 DevEx On-Call : Confirm that QA tests are passing as a pre-decomp sanity check-
🏆 DevEx On-Call : Confirm that smoke QA tests are passing on the current cluster by checking latest status forSmoke
Type tests in Production and Canary Allure reports listed in QA pipelines.-
🏆 DevEx On-Call : Trigger Smoke E2E suite against the environment that was decomposed: Production:Four hourly smoke tests
. This has an estimated duration of 15 minutes. -
🏆 DevEx On-Call : If the smoke tests fail, we should re-run the failed job to see if it is reproducible. -
🏆 DevEx On-Call : In parallel reach out to on-call Test Platform DRI for the help with investigation. If there is no avialable on-call DRI, reach out to#test-platform
and escalate with the management team.
-
-
Prepare the environment
-
[ ] 🔪 Playbook-Runner : Check that all needed MRs are rebased and contain the proper changes.-
Separate gitlab-sec DB connection for teleport-ro nodes https://gitlab.com/gitlab-com/gl-infra/chef-repo/-/merge_requests/5870 -
GPRD-CNY MR, to add sec configuration to gprd-cny: MR for gprd-cny: gitlab-com/gl-infra/k8s-workloads/gitlab-com!4342 (merged) -
GPRD-SIDEKIQ MR, to move sec read-only over to sec-db-replica - MR for gprd-sidekiq-sec: gitlab-com/gl-infra/k8s-workloads/gitlab-com!4343 (merged)
-
GPRD WEB MR, to move sec read-only over to sec-db-replica - MR for gprd-web gitlab-com/gl-infra/k8s-workloads/gitlab-com!4344 (merged)
-
GPRD-BASE MR, to move sec read-only over to sec-db-replica - MR for gprd-base: https://gitlab.com/gitlab-com/gl-infra/chef-repo/-/merge_requests/5871
-
Make configuration changes for pgbouncer{,-sidekiq}-sec permanent https://gitlab.com/gitlab-com/gl-infra/chef-repo/-/merge_requests/5872 -
GPRD-PATRONI-SEC MR, to remove standby configuration https://gitlab.com/gitlab-com/gl-infra/chef-repo/-/merge_requests/5873 -
K8s MR to set databasetasks: true gitlab-com/gl-infra/k8s-workloads/gitlab-com!4345 (merged) -
Chef MR to set databasetasks: true https://gitlab.com/gitlab-com/gl-infra/chef-repo/-/merge_requests/5874
-
-
🔪 Playbook-Runner : Get the console VM ready for action-
SSH to the console VM in gprd
ssh console-01-sv-gprd.c.gitlab-production.internal
-
Configure dbupgrade user -
Disable screen sharing to reduce risk of exposing private key -
Change to user dbupgrade sudo su - dbupgrade
-
Copy dbupgrade user's private key from 1Password to ~/.ssh/id_dbupgrade
-
chmod 600 ~/.ssh/id_dbupgrade
-
Use key as default ln -s /home/dbupgrade/.ssh/id_dbupgrade /home/dbupgrade/.ssh/id_rsa
-
Repeat the same steps steps on target leader (it also has to have the private key) -
Enable re-screen sharing
-
-
Create an access_token with at least read_repository
for the next step -
Clone repos: rm -rf ~/src \ && mkdir ~/src \ && cd ~/src \ && git clone https://gitlab.com/gitlab-com/gl-infra/db-migration.git \ && cd db-migration \ && git checkout master
-
Ensure you have Ansible installed: python3 -m venv ansible source ansible/bin/activate python3 -m pip install --upgrade pip python3 -m pip install ansible python3 -m pip install jmespath ansible --version
-
Ensure that Ansible can talk to all the hosts in gprd-main and gprd-sec cd ~/src/db-migration/pg-physical-to-logical ansible -e "ansible_ssh_private_key_file=/home/dbupgrade/.ssh/id_dbupgrade" -i inventory/gprd-sec-decomp.yml all -m ping
You shouldn't see any failed hosts!
-
Ensure that Ansible is being run via modern ansible on the vm not the old system ansible. This needs to hold true for the entire CR. You can tell if you are running ansible correctly because you will be able to see (ansible) at the beginning of the line in your terminal or with the following cmd
which ansible
-
-
🔪 Playbook-Runner : Add the following silences at https://alerts.gitlab.net to silence alerts in main and sec nodes until 4 hours after the switchover time:- Start time:
2025-04-19T13:00:00.000Z
- Duration:
4h
- Matchers
- main
env="gprd"
fqdn=~"patroni-main-v16.*"
- sec
env="gprd"
fqdn=~"patroni-sec-v16.*"
- main
- Start time:
-
🐺 Coordinator : Get a green light from the🚑 EOC
SEC Decomposition Prep Work
-
Prepare Environment
-
[ ] [ ] ☎️ Comms-Handler : Coordinate with@release-managers
at #g_deliveryHi @release-managers :waves:, We would like to make sure that deployments have been stopped for our `MAIN` and `SEC` database in the `PRODUCTION` environment, until 2025-04-19 17:00 UTC. Be aware that we are deactivating certain feature flags during this time. All details can be found in the CR. Please be so kind and comment the acknowledgement on https://gitlab.com/gitlab-com/gl-infra/production/-/issues/19581. :bow:
-
[ ] ☎️ Comms-Handler : Inform the database team at #g_database_frameworks and #g_database_operationsHi @dbo and @db-team, Please note that we started the operational block for the `MAIN` and `SEC` clusters for SEC Decomposition, therefore we are blocking database model/structure modifications, by disabling the following tasks (`execute_batched_migrations_on_schedule` and `execute_background_migrations`, reindexing, async_foreign_key, async_index features and partition_manager_sync_partitions) in the `PRODUCTION` environment. We will re-enable DDLs once the CR is finished and the rollback window is closed at 2025-04-19 17:00 UTC Thanks!
-
🔪 Playbook-Runner : Disable the DDL-related feature flags:-
Disable feature flags by typing the following into #production
:-
/chatops run feature set disallow_database_ddl_feature_flags true
-
-
-
Prechecks
-
🐺 Coordinator : Check if disallow_database_ddl_feature_flags is ENABLED:-
On slack /chatops run feature get disallow_database_ddl_feature_flags
-
-
🔪 Playbook-Runner : ADD the following silences at https://alerts.gitlab.net to silenceWALGBaseBackup
alerts in patroni-main-v16 until the end of the maintenance:- Start time:
2025-04-19T09:00:00.000Z
- Duration:
56h
-
env="gprd"
-
type="gprd-patroni-main-v16"
-
alertname=~"WALGBaseBackupFailed|walgBaseBackupDelayed"
-
- Start time:
-
🔪 Playbook-Runner : ADD the following silences at https://alerts.gitlab.net to silenceWALGBaseBackup
alerts in patroni-sec-v16 until the end of the maintenance:- Start time:
2025-04-19T09:00:00.000Z
- Duration:
56h
-
env="gprd"
-
type="gprd-patroni-sec-v16"
-
alertname=~"WALGBaseBackupFailed|walgBaseBackupDelayed"
-
- Start time:
-
🔪 Playbook-Runner : Monitor what pgbouncer pool has connections: [monitoring_pgbouncer_gitlab_user_conns][monitoring_pgbouncer_gitlab_user_conns] -
🔪 Playbook-Runner : Disable chef on the main db cluster, sec db cluster and sec pgbouncers
knife ssh "role:gprd-base-db-patroni-main-v16*" "sudo /usr/local/bin/chef-client-disable 'GPRD Sec Decomp CR 19581'"
knife ssh "role:gprd-base-db-patroni-sec-v16" "sudo /usr/local/bin/chef-client-disable 'GPRD Sec Decomp CR 19581'"
knife ssh "role:gprd*pgbouncer*sec*" "sudo /usr/local/bin/chef-client-disable 'GPRD Sec Decomp CR 19581'"
-
🔪 Playbook-Runner : Check if anyone except application is connected to source primary and interrupt them:-
Confirm the source primary (note this will only run on 101, currently)
knife ssh "role:gprd-base-db-patroni-main-v16" "sudo gitlab-patronictl list"
-
Login to source primary ssh patroni-main-v16-103-db-gprd.c.gitlab-production.internal
-
Check all connections that are not gitlab
:gitlab-psql -c " select pid, client_addr, usename, application_name, backend_type, clock_timestamp() - backend_start as connected_ago, state, left(query, 200) as query from pg_stat_activity where pid <> pg_backend_pid() and not backend_type ~ '(walsender|logical replication|pg_wait_sampling)' and usename not in ('gitlab', 'gitlab-registry', 'pgbouncer', 'postgres_exporter', 'gitlab-consul') and application_name <> 'Patroni' "
-
If there are sessions that potentially can perform any writes, spend up to 10 minutes to make an attempt to find the actors and ask them to stop. -
Finally, terminate all the remaining sessions that are not coming from application/infra components and potentially can cause writes: gitlab-psql -c " select pg_terminate_backend(pid) from pg_stat_activity where pid <> pg_backend_pid() and not backend_type ~ '(walsender|logical replication|pg_wait_sampling)' and usename not in ('gitlab', 'gitlab-registry', 'pgbouncer', 'postgres_exporter', 'gitlab-consul') and application_name <> 'Patroni' "
-
-
🔪 Playbook-Runner : Run physical_prechecks playbook:cd ~/src/db-migration/pg-physical-to-logical ansible-playbook \ -e "ansible_ssh_private_key_file=/home/dbupgrade/.ssh/id_dbupgrade" \ -i inventory/gprd-sec-decomp.yml physical_prechecks.yml 2>&1 \ | ts | tee -a ansible_physical-to-logical_gprd_sec_$(date +%Y%m%d).log
-
🔪 Playbook-Runner : Check pgpass, .pgpass are the same on both the source and target cluster primaries.
ssh patroni-sec-v16-01-db-gprd.c.gitlab-production.internal "sudo cat /var/opt/gitlab/postgresql/.pgpass /var/opt/gitlab/postgresql/pgpass"
ssh patroni-main-v16-03-db-gprd.c.gitlab-production.internal "sudo cat /var/opt/gitlab/postgresql/.pgpass /var/opt/gitlab/postgresql/pgpass"
-
[ ] 🔪 Playbook-Runner : Verify configuration of pgbouncer-sec and pgbouncer-sidekiq-secknife ssh "role:gprd*pgbouncer*sec*" "sudo grep master.patroni /var/opt/gitlab/pgbouncer/databases.ini" # should return master.patroni.service.consul prior to switchover!
Break physical replication and configure logical replication
-
Convert Physical Replication to Logical
-
🔪 Playbook-Runner : Run physical-to-logical playbook:cd ~/src/db-migration/pg-physical-to-logical ansible-playbook \ -e "ansible_ssh_private_key_file=/home/dbupgrade/.ssh/id_dbupgrade" \ -i inventory/gprd-sec-decomp.yml physical_to_logical.yml 2>&1 \ | ts | tee -a ansible_physical-to-logical_gprd_sec_$(date +%Y%m%d).log
-
🔪 Playbook-Runner : Verify sec cluster is no longer a Standby Leader:ssh patroni-sec-v16-01-db-gprd.c.gitlab-production.internal "sudo gitlab-patronictl list"
-
🔪 Playbook-Runner : Remove thestandby_cluster
configuration for sec in chef:- Merge chef MR: https://gitlab.com/gitlab-com/gl-infra/chef-repo/-/merge_requests/5873
- Verify chef MR pipeline completes on ops: https://ops.gitlab.net/gitlab-com/gl-infra/chef-repo/-/pipelines
- enable and run chef-client on patroni-sec leader node:
ssh patroni-sec-v16-01-db-gprd.c.gitlab-production.internal "sudo chef-client-enable" ssh patroni-sec-v16-01-db-gprd.c.gitlab-production.internal "sudo chef-client" ssh patroni-sec-v16-01-db-gprd.c.gitlab-production.internal "sudo grep -A2 standby_cluster /var/opt/gitlab/patroni/patroni.yml" # should return no values for last command! Stop if `standby_cluster` is in output!
- enable and run chef-client on patroni-sec remaining nodes:
knife ssh "role:gprd*patroni*sec*" "sudo chef-client-enable" knife ssh "role:gprd*patroni*sec*" "sudo chef-client" knife ssh "role:gprd*patroni*sec*" "sudo grep -A2 standby_cluster /var/opt/gitlab/patroni/patroni.yml" # should return no values for last command! Stop if `standby_cluster` is in output!
-
🔪 Playbook-Runner : Verify sec cluster is still healthy:ssh patroni-sec-v16-01-db-gprd.c.gitlab-production.internal "sudo gitlab-patronictl list"
-
🔪 Playbook-Runner : Re-disable chef on patroni-sec:knife ssh "role:gprd*patroni*sec*" "sudo chef-client-disable 'SEC Decomp #19581 '"
Read-Only Traffic Configs
-
Read Only Traffic Switchover
-
Console Node Rollout
-
[ ] [ ] 🔪 Playbook-Runner : Switchover gprd rails console (teleport) chef connection configuration to newpatroni-sec-v16
DB. Writes will go through PGBouncer host tomain
and reads tosec
replicas. -
🔪 Playbook-Runner : Simple checks if application sees a proper configuration. Expected: sec load balancer and sec_replica for read connection[1] pry(main)> ApplicationRecord.load_balancer.name => :main [2] pry(main)> Gitlab::Database::SecApplicationRecord.load_balancer.name => :sec [3] pry(main)> ApplicationRecord.connection.pool.db_config.name => "main" [4] pry(main)> Gitlab::Database::SecApplicationRecord.connection.pool.db_config.name => "sec" [5] pry(main)> Gitlab::Database::SecApplicationRecord.load_balancer.read { |connection| connection.pool.db_config.name } => "sec_replica" [6] Gitlab::Database::SecApplicationRecord.load_balancer.read_write { |connection| connection.pool.db_config.name } => "sec"
-
🔪 Playbook-Runner : Simple checks to see if application can still talk to sec_replica database. Expected: db_config_name:sec_replica[10] pry(main)> ActiveRecord::Base.logger = Logger.new(STDOUT) [11] pry(main)> Gitlab::Database::SecApplicationRecord.load_balancer.read { |connection| connection.select_all("SELECT COUNT(*) FROM vulnerability_user_mentions") } (20.3ms) SELECT COUNT(*) FROM vulnerability_user_mentions /*application:console,db_config_name:main_replica,line:/data/cache/bundle-2.7.4/ruby/2.7.0/gems/marginalia-1.10.0/lib/marginalia/comment.rb:25:in `block in construct_comment'*/ => #<ActiveRecord::Result:0x00007fcfc79ccdb0 @column_types={}, @columns=["count"], @hash_rows=nil, @rows=[[1]]>
-
Web Node Canary Rollout
-
Verify connectivity, monitor pgbouncer connections -
Observe logs
andprometheus
for errors
-
Observable Logs and Prometheus Metrics
All logs will split db_*_count
metrics into separate buckets describing each used connection:
- Primary connection usage by state -
pg_stat_activity_count
pgbouncer_stats_queries_pooled_total
-
Sidekiq Node Rollout
-
🔪 Playbook-Runner : Switchover gprd sidekiq configuration to new gbouncer-sec` {
-
Verify connectivity, monitor pgbouncer connections -
Observe logs
andprometheus
for errors
-
Observable Logs and Prometheus Metrics
All logs will split db_*_count
metrics into separate buckets describing each used connection:
- Primary connection usage by state -
pg_stat_activity_count
pgbouncer_stats_queries_pooled_total
-
Web Node Rollout
-
🔪 Playbook-Runner : Switchover gprd web configuration to new gbouncer-sec` { -
🔪 Playbook-Runner : Verify connectivity, monitor pgbouncer connections -
🔪 Playbook-Runner : Observelogs
andprometheus
for errors -
🔪 Playbook-Runner : Cleanup: Remove overrides in each configuration node and promote chef database connection configuration to gstg-base.-
https://gitlab.com/gitlab-com/gl-infra/chef-repo/-/merge_requests/5871 -
run chef-client on the console node
-
Revert MR for the GPRD-CNY configuration
-
🔪 Playbook-Runner : Revert MR for the GPRD-CNY configuration so it uses global config -
MR for k8s-workload: gitlab-com/gl-infra/k8s-workloads/gitlab-com!4342 (merged)
-
Observable Logs and Prometheus Metrics
4.4.1 Observable logs
All logs will split db_*_count
metrics into separate buckets describing each used connection:
4.4.2. Observable prometheus metrics
- Primary connection usage by state -
pg_stat_activity_count
pgbouncer_stats_queries_pooled_total
-
Verify Read Traffic to patroni-sec
-
🔪 Playbook-Runner : monitoring_pgbouncer_gitlab_user_conns- Ensure traffic is now being seen for monitoring_pgbouncer_gitlab_user_conns
Phase 7 – execute!
-
Phase 7 - switchover
-
🔪 Playbook-Runner : Schedule a job to enablegitlab_maintenance_mode
into a node exporter, during the upgrade window:-
SSH to a console VM in gprd
(eg.ssh console-01-sv-gprd.c.gitlab-production.internal
)-
Schedule jobs: sudo su - echo -e "# HELP gitlab_maintenance_mode record maintenance window\n# TYPE gitlab_maintenance_mode untyped\ngitlab_maintenance_mode 1\n" > /opt/prometheus/node_exporter/metrics/gitlab_maintenance_mode.prom | at -t 202504191300 echo -e "# HELP gitlab_maintenance_mode record maintenance window\n# TYPE gitlab_maintenance_mode untyped\ngitlab_maintenance_mode 0\n" > /opt/prometheus/node_exporter/metrics/gitlab_maintenance_mode.prom | at -t 202504191700 cat /opt/prometheus/node_exporter/metrics/gitlab_maintenance_mode.prom atq
-
-
-
PRODUCTION ONLY ☁️ 🔪 Playbook-Runner : Create a maintenance window in PagerDuty with the following:-
Which services are affected?
-
Why is this maintenance happening?
Decomposing sec data from patroni-main-v16 to patroni-sec-v16 so silencing the pager.
-
Select
Start at a scheduled time
:- Timezone:
(UTC+00:00) UTC
- Start:
04/19/2025 | 01:00 PM
- End:
04/19/2025 | 05:00 PM
- Timezone:
-
-
🔪 Playbook-Runner : Run Ansible playbook for Database Decomposition for the gprd-sec cluster:cd ~/src/db-migration/pg-physical-to-logical ansible-playbook \ -e "ansible_ssh_private_key_file=/home/dbupgrade/.ssh/id_dbupgrade" \ -i inventory/gprd-sec-decomp.yml switchover.yml 2>&1 \ | ts | tee -a ansible_upgrade_gprd_sec_$(date +%Y%m%d).log
Midway through the playbook it will ask "Are you sure you want to continue resuming on pgbouncer?" this is the time to verify configuration of pgbouncer-sec and pgbouncer-sidekiq-sec as below, before saying 'Yes' to resuming pgbouncers
-
🔪 Playbook-Runner : Verify configuration of pgbouncer-sec and pgbouncer-sidekiq-secknife ssh "role:gprd*pgbouncer*sec*" "sudo grep master.patroni /var/opt/gitlab/pgbouncer/databases.ini" # should return master.patroni-sec.service.consul after switchover!
-
🔪 Playbook-Runner : Edit the /var/opt/gitlab/gitlab-rails/etc/database.yml file on the console node to set database_tasks: true for the sec cluster -
🔪 Playbook-Runner : Block writes to main cluster in sec cluster and sec cluster in main cluster by running this on the console node- single threaded
gitlab-rake gitlab:db:lock_writes
- multi-threaded
SCOPE_TO_DATABASE=sec INCLUDE_PARTITIONS=false rake gitlab::database::lock_tables SCOPE_TO_DATABASE=main INCLUDE_PARTITIONS=false rake gitlab::database::lock_tables SCOPE_TO_DATABASE=ci INCLUDE_PARTITIONS=false rake gitlab::database::lock_tables SCOPE_TO_DATABASE=sec INCLUDE_PARTITIONS=true rake gitlab::database::lock_tables SCOPE_TO_DATABASE=main INCLUDE_PARTITIONS=true rake gitlab::database::lock_tables SCOPE_TO_DATABASE=ci INCLUDE_PARTITIONS=true rake gitlab::database::lock_tables
-
🔪 Playbook-Runner : Verify reverse logical replication lag is low on patroni-sec leader:-
ssh patroni-sec-v16-01-db-gprd.c.gitlab-production.internal
- sudo gitlab-psql
select pg_wal_lsn_diff(pg_current_wal_lsn(), confirmed_flush_lsn) from pg_replication_slots where slot_name like 'logical_replication_slot%' order by 1 desc limit 1;
- sudo gitlab-psql
-
-
Persist Correct configurations
-
[ ] 🔪 Playbook-Runner : Merge the MR that reconfigures patroni/pgbouncer in Chef for patroni-sec-v16. First confirm there are no errors in merge pipeline. If the MR was merged, then revert it, and get it merged properly.-
MR for pgbouncer-sec: https://gitlab.com/gitlab-com/gl-infra/chef-repo/-/merge_requests/5872
-
-
🔪 Playbook-Runner : Run chef-client on one pgbouncer host and verify the configuration was not changed (changes require a reload to migrate traffic, so check nothing changed. If needed, revert the MR and update to resolve)knife ssh "role:gprd*pgbouncer*sec*" "sudo grep master.patroni /var/opt/gitlab/pgbouncer/databases.ini" # should return master.patroni-sec.service.consul after switchover!
-
🔪 Playbook-Runner : Check WRITES going to the TARGET cluster,patroni-sec-v16
: [monitoring_user_tables_writes][monitoring_user_tables_writes] -
🔪 Playbook-Runner : Check READS going to the TARGET cluster,patroni-sec-v16
: [monitoring_user_tables_reads][monitoring_user_tables_reads]. -
🔪 Playbook-Runner : Re-enable Chef in all nodes:knife ssh "role:gprd-base-db-patroni-main-v16*" "sudo chef-client-enable" knife ssh "role:gprd-base-db-patroni-sec-v16" "sudo chef-client-enable" knife ssh "role:gprd*pgbouncer*sec" "sudo chef-client-enable"
-
🔪 Playbook-Runner : Confirm chef-client is ENABLED in all nodes [monitoring_chef_client_enabled][monitoring_chef_client_enabled]
-
Enable databaseTasks for k8s workloads
-
🔪 Playbook-Runner : Merge the MR that enables db_database_tasks for k8s nodes-
MR for k8s-workloads: gitlab-com/gl-infra/k8s-workloads/gitlab-com!4345 (merged)
-
-
Enable databaseTasks for deploy nodes
-
🔪 Playbook-Runner : Merge the MR that enables db_database_tasks for deploy nodes-
MR for chef-repo: https://gitlab.com/gitlab-com/gl-infra/chef-repo/-/merge_requests/5874 -
run chef client on a deploy node and check it worked (should not have databasetasks false in var/opt/gitlab/gitlab-rails/etc/database.ini)
-
Post Switchover QA Tests
-
Post Switchover QA Testing
-
Start Post Switchover QA -
🏆 DevEx On-Call : Full E2E suite against the environment that was decomposed: Production:Full run - manual
- It will take 1+ hour to run these tests, so you can continue with the
Wrapping up
of the upgrade and check the test result latter;
- It will take 1+ hour to run these tests, so you can continue with the
-
Communicate
-
Communication
-
PRODUCTION ONLY 📣 CMOC : Post update from Status.io maintenance site, publish on@gitlabstatus
. Workflow: https://about.gitlab.com/handbook/support/workflows/cmoc_workflows.html#sending-updates-about-maintenance-events- Message:
Gitlab.com SEC database decomposition was performed. We'll continue to monitor for any performance issues until the end of the maintenance window. Thank you for your patience. See <https://gitlab.com/gitlab-com/gl-infra/production/-/issues/19581>
- Message:
-
Check @gitlab retweeted from @gitlabstatus
-
PRODUCTION ONLY ☎️ Comms-Handler : In the same thread from the earlier post, post the following message and click on the checkbox "Also send to X channel" so the threaded message would be published to the channel:- Message:
:done: *GitLab.com database layer maintenance decomposition is complete now.* :celebrate: We’ll continue to monitor the platform to ensure all systems are functioning correctly.
-
#whats-happening-at-gitlab
-
#infrastructure-lounge
(cc@sre-oncall
)
-
- Message:
-
Wrapping Up
-
PRODUCTION ONLY 🔪 Playbook-Runner : If the scheduled maintenance is still active in PagerDuty, click onUpdate
thenEnd Now
. -
[ ] 🔪 Playbook-Runner : Remove silences offqdn=~"patroni-main-v16.*"
andfqdn=~"patroni-sec-v16.*"
we created during this process from https://alerts.gitlab.net -
🔪 Playbook-Runner : Create the wal-g daily restore schedule for the [gprd] - [sec] cluster at https://ops.gitlab.net/gitlab-com/gl-infra/data-access/durability/gitlab-restore/postgres-gprd/-/pipeline_schedules-
Change the following variables: PSQL_VERSION = 16
-
BACKUP_PATH = ?
(? = use the "directory" from the new v16 GCS backup location at: https://console.cloud.google.com/storage/browser/gitlab-gprd-postgres-backup/pitr-walg-sec-v16)
-
-
🐺 Coordinator : Check if gitlab_maintenance_mode is DISABLED for gprd [monitoring_gitlab_maintenance_mode][monitoring_gitlab_maintenance_mode]- If is not disabled ask
🔪 Playbook-Runner to manually disable it by:- SSH to a console VM in
gprd
(eg.ssh console-01-sv-gprd.c.gitlab-production.internal
)- Set
gitlab_maintenance_mode=0
on node exporter :sudo su - echo -e "# HELP gitlab_maintenance_mode record maintenance window\n# TYPE gitlab_maintenance_mode untyped\ngitlab_maintenance_mode 0\n" > /opt/prometheus/node_exporter/metrics/gitlab_maintenance_mode.prom cat /opt/prometheus/node_exporter/metrics/gitlab_maintenance_mode.prom atq
- Set
- SSH to a console VM in
- If is not disabled ask
-
🏆 DevEx On-Call : (after an hour): Check that the Smoke (as executed via MR enablingdb_database_tasks
for k8s nodes), andFull run - manual
has passed. If there are failures, reach out to on-call Test Platform DRI for the help with investigation. If there is no avialable on-call DRI, reach out to#test-platform
and escalate with the management team.-
🏆 DevEx On-Call : If the Smoke or Full E2E tests fail, DevEx performs an initial triage of the failure. If DevEx cannot determine failure is 'unrelated', team decides on declaring an incident and following the incident process.
-
Close Rollback Window
-
SWITCHOVER plus 4 hours - Close PCL (2025-04-19 17:00 UTC)
-
🔪 Playbook-Runner : Run Ansible playbook to Stop the Reverse Logical Replication:cd ~/src/db-migration/pg-physical-to-logical ansible-playbook \ -e "ansible_ssh_private_key_file=/home/dbupgrade/.ssh/id_dbupgrade" \ -i inventory/gprd-sec-decomp.yml \ stop_reverse_replication.yml 2>&1 \ | ts | tee -a stop_reverse_replication_gprd_sec_$(date +%Y%m%d).log
-
🔪 Playbook-Runner : On the SOURCE cluster patroni-main-v16 Leader/Writer, drop subscription (if still existing) for logical replication:-
Check if the subscription still exist: gitlab-psql \ -Xc "select subname, subenabled, subconninfo, subslotname, subpublications from pg_subscription"
-
-
🔪 Playbook-Runner : On the TARGET cluster patroni-sec-v16 Leader/Writer, drop publication and logical_replication_slot for reverse replication:-
Check if the publication and replication slots still exist: gitlab-psql \ -Xc "select pubname from pg_publication" \ -Xc "select slot_name, plugin, slot_type, active from pg_replication_slots"
-
-
[ ] 🔪 Playbook-Runner : Enable feature flags by typing the following into#production
:- PRODUCTION:
-
/chatops run feature set disallow_database_ddl_feature_flags false
-
-
🐺 Coordinator : Check if the underlying DDL lock FF is DISABLED:-
On slack /chatops run feature get disallow_database_ddl_feature_flags
should return DISABLED
-
-
☎️ Comms-Handler : Inform the database team that the CR is completed at #g_database_operations and g_database_frameworks:Hi @dbo and @db-team, We are reaching out to inform that we have completed the work for the https://gitlab.com/gitlab-com/gl-infra/production/-/issues/19581 CR. Therefore we have re-enabled the `execute_batched_migrations_on_schedule`, `execute_background_migrations`, reindexing, async_foreign_key, sync_index and partition_manager_sync_partitions features and tasks in `gprd` environment. Thanks!
- PRODUCTION:
-
PRODUCTION ONLY 📣 CMOC : End of maintenance from Status.io maintenance site, publish on@gitlabstatus
. Workflow: https://about.gitlab.com/handbook/support/workflows/cmoc_workflows.html#sending-updates-about-maintenance-events-
Click "Finish Maintenance" and send the following: - Message:
GitLab.com scheduled maintenace for the MAIN and SEC database layers is complete. We'll continue to monitor the platform to ensure all systems are functioning correctly. Thank you for your patience.
- Message:
-
Check @gitlab retweeted from @gitlabstatus
-
-
☎️ Comms-Handler : Inform@release-managers
at #g_delivery about the end of the operational lockHi @release-managers :waves:, We are reaching out to inform that we have completed the work for the https://gitlab.com/gitlab-com/gl-infra/production/-/issues/19581 CR in our `gprd` SaaS environment. We are closing the operational block for deployments in the `MAIN` and `SEC` database, so regular deployment operations can be fully resumed.
-
🔪 Playbook-Runner : Open a separate issue to create/rebuild the SEC DR Archive and Delayed replicas. It will be completed in the next couple of working days. -
🐺 Coordinator : Mark the change request as/label ~"change::complete"
Rollback
Estimated Time to Complete (mins) - 120
-
Rollback (if required)
-
PRODUCTION ONLY 📣 CMOC : Post an update from Status.io maintenance site, publish on@gitlabstatus
. Workflow: https://about.gitlab.com/handbook/support/workflows/cmoc_workflows.html#sending-updates-about-maintenance-events-
Message:
Due to an issue during the planned maintenance for the database layer, we have initiated a rollback of the MAIN and SEC database layers and some performance impact still might be expected. We will provide update once the rollback process is completed.
-
-
PRODUCTION ONLY ☎️ Comms-Handler : Ask the IMOC or the Head Honcho if this message should be sent to any slack rooms:#whats-happening-at-gitlab
-
#infrastructure-lounge
(cc@sre-oncall
) -
#g_delivery
(cc@release-managers
)
- There will be no rollback after closing the rollback window!
-
🔪 Playbook-Runner : Monitor what pgbouncer pool has connections [monitoring_pgbouncer_gitlab_user_conns][monitoring_pgbouncer_gitlab_user_conns]
ROLLBACK – execute!
Goal: Set gprd-main cluster as Primary cluster
-
[ ] 🔪 Playbook-Runner : Verify reverse logical replication lag is low on patroni-sec leader. This must be done using cmds run on the database not the graph. This must be done by a human. This must be done even if you have previously checked replication lag:ssh patroni-sec-v16-01-db-gprd.c.gitlab-production.internal` sudo gitlab-psql `select pg_wal_lsn_diff(pg_current_wal_lsn(), confirmed_flush_lsn) from pg_replication_slots where slot_name like 'logical_replication_slot%' order by 1 desc limit 1;`
-
🔪 Playbook-Runner : Executeswitchover_rollback.yml
playbook to rollback to MAIN cluster:cd ~/src/db-migration/pg-physical-to-logical ansible-playbook \ -e "ansible_ssh_private_key_file=/home/dbupgrade/.ssh/id_dbupgrade" \ -i inventory/gprd-sec-decomp.yml \ switchover_rollback.yml 2>&1 \ | ts | tee -a ansible_switchover_rollback_gprd_sec_$(date +%Y%m%d).log
Midway through the playbook it will ask "Are you sure you want to continue resuming on pgbouncer?" this is the time to verify configuration of pgbouncer-sec and pgbouncer-sidekiq-sec as below, before saying 'Yes' to resuming pgbouncers
-
🔪 Playbook-Runner : Verify configuration of pgbouncer-sec and pgbouncer-sidekiq-sec step after rollbackknife ssh "role:gprd*pgbouncer*sec*" "sudo grep master.patroni /var/opt/gitlab/pgbouncer/databases.ini" # should return master.patroni.service.consul after rollback!
-
🔪 Playbook-Runner : Unlock writes to main cluster in sec cluster and sec cluster in main cluster by running this on the console nodesingle thread:
gitlab-rake gitlab:db:unlock_writes
multi-threaded:
SCOPE_TO_DATABASE=sec INCLUDE_PARTITIONS=false rake gitlab::database::unlock_tables SCOPE_TO_DATABASE=main INCLUDE_PARTITIONS=false rake gitlab::database::unlock_tables SCOPE_TO_DATABASE=ci INCLUDE_PARTITIONS=false rake gitlab::database::unlock_tables SCOPE_TO_DATABASE=sec INCLUDE_PARTITIONS=true rake gitlab::database::unlock_tables SCOPE_TO_DATABASE=main INCLUDE_PARTITIONS=true rake gitlab::database::unlock_tables SCOPE_TO_DATABASE=ci INCLUDE_PARTITIONS=true rake gitlab::database::unlock_tables
-
🔪 Playbook-Runner : Check WRITES going to the SOURCE cluster,patroni-main-v16
: [monitoring_user_tables_writes][monitoring_user_tables_writes] -
🔪 Playbook-Runner : Check READS going to the SOURCE cluster,patroni-main-v16
: [monitoring_user_tables_reads][monitoring_user_tables_reads]. -
🔪 Playbook-Runner : On the TARGET cluster patroni-main-v16 Leader/Writer, drop subscription (if still existing) for logical replication:-
Check if the subscription still exist: gitlab-psql \ -Xc "select subname, subenabled, subconninfo, subslotname, subpublications from pg_subscription"
-
-
🔪 Playbook-Runner : On the SOURCE cluster patroni-sec-v16 Leader/Writer, drop publication and logical_replication_slot for reverse replication:-
Check if the publication and replication slots still exist: gitlab-psql \ -Xc "select pubname from pg_publication" \ -Xc "select slot_name, plugin, slot_type, active from pg_replication_slots"
-
Complete the rollback
-
🔪 DevEx On-Call : Confirm that our smoke tests are still passing (continue the rollback as this might take an hour...) -
🔪 Playbook-Runner : Revert all the applied MRs (the amount of MRs is variable depending on where the CR failed) -
[ ]
-
🔪 Playbook-Runner : Re-enable Chef in all nodes:knife ssh "role:gprd-base-db-patroni-main-v16*" "sudo chef-client-enable" knife ssh "role:gprd-base-db-patroni-sec-v16" "sudo chef-client-enable" knife ssh "role:gprd*pgbouncer*sec" "sudo chef-client-enable"
-
🔪 Playbook-Runner : Confirm chef-client is ENABLED in all nodes [monitoring_chef_client_enabled][monitoring_chef_client_enabled] -
🔪 Playbook-Runner : Run chef-client on Patroni Nodes:knife ssh "role:gprd-base-db-patroni-main-v16*" "sudo chef-client" knife ssh "role:gprd-base-db-patroni-sec-v16" "sudo chef-client" knife ssh "role:gprd*pgbouncer*sec" "sudo chef-client"
-
🔪 Playbook-Runner : Confirm no errors while running chef-client [monitoring_chef_client_error][monitoring_chef_client_error] -
🔪 Playbook-Runner : Shutdown the TARGET gprd-base-db-patroni-sec-v16 cluster to avoid any risk of splitbrain:knife ssh "role:gprd-base-db-patroni-sec-v16" "sudo systemctl stop patroni"
-
PRODUCTION ONLY 📣 CMOC : Post update from Status.io maintenance site, publish on@gitlabstatus
. Workflow: https://about.gitlab.com/handbook/support/workflows/cmoc_workflows.html#sending-updates-about-maintenance-events-
Click "Finish Maintenance" and send the following: -
Message:
GitLab.com rollback for the database layer is complete, and we're back up and running. We'll be monitoring the platform to ensure all systems are functioning correctly. Thank you for your patience.
-
-
-
PRODUCTION ONLY ☎️ Comms-Handler : Send the following message to slack rooms:GitLab.com rollback for the database layer is complete and we're back up and running. We'll be monitoring the platform to ensure all systems are functioning correctly. Thank you for your patience.
-
#whats-happening-at-gitlab
-
#infrastructure-lounge
(cc@sre-oncall
) -
#g_delivery
(cc@release-managers
)
-
-
🔪 Playbook-Runner : Enable feature flags by typing the following into#production
:- PRODUCTION:
-
/chatops run feature set disallow_database_ddl_feature_flags false
-
-
☎️ Comms-Handler : Inform the database team that the CR is completed at #g_database_operations and #g_database_frameworks:Hi @dbo and @db-team, We are reaching out to inform that we have aborted and rolled back the https://gitlab.com/gitlab-com/gl-infra/production/-/issues/19581 CR. Therefore we have re-enabled the `execute_batched_migrations_on_schedule`, `execute_background_migrations`, reindexing, async_foreign_key, sync_index and partition_manager_sync_partitions features and tasks in `gprd` environment. Thanks!
- PRODUCTION:
-
☎️ Comms-Handler : Inform@release-managers
at #g_delivery about the end of the operational lockHi @release-managers :waves:, We are reaching out to inform that we have aborted and rolled back the https://gitlab.com/gitlab-com/gl-infra/production/-/issues/19581 CR in our `gprd` SaaS environment. We are closing the operational block for deployments in the `MAIN` and `SEC` databases, so regular deployment operations can be fully resumed.
-
🔪 Playbook-Runner : Check if the underlying DDL lock FF is DISABLED:-
On slack /chatops run feature get disallow_database_ddl_feature_flags
should return DISABLED
-
-
🔪 Playbook-Runner : On two nodes, console and target leader, remove the private keys temporarily placed in~dbupgrade/.ssh
:rm ~dbupgrade/.ssh/id_rsa rm ~dbupgrade/.ssh/id_dbupgrade
-
🔪 Playbook-Runner : ADD the following silences at https://alerts.gitlab.net to silenceWALGBaseBackup
alerts in patroni-sec-v16 for 2 weeks (14 days = 336 hours)- Start time:
2025-04-19T13:00:00.000Z
- Duration:
336h
-
env="gprd"
-
type="gprd-patroni-sec-v16"
-
alertname=~"WALGBaseBackupFailed|walgBaseBackupDelayed"
-
- Start time:
-
🔪 Playbook-Runner : UPDATE the following silence at https://alerts.gitlab.net to silence alerts in v16 nodes for 2 weeks (14 days = 336 hours):-
Start time:
2025-04-19T13:00:00.000Z
-
Duration:
336h
-
Matcher:
-
PRODUCTION
-
env="gprd"
-
fqdn=~"patroni-sec-v16.*"
-
-
PRODUCTION
-
-
🔪 Playbook-Runner : DELETE the following silences at https://alerts.gitlab.net-
Matcher:
-
PRODUCTION
-
env="gprd"
-
fqdn=~"patroni-main-v16.*"
-
-
PRODUCTION
-
-
🐺 Coordinator : Check if gitlab_maintenance_mode is DISABLED for gprd [monitoring_gitlab_maintenance_mode][monitoring_gitlab_maintenance_mode]- If is not disabled ask
🔪 Playbook-Runner to manually disable it by:- SSH to a console VM in
gprd
(eg.ssh console-01-sv-gprd.c.gitlab-production.internal
)- Set
gitlab_maintenance_mode=0
on node exporter :sudo su - echo -e "# HELP gitlab_maintenance_mode record maintenance window\n# TYPE gitlab_maintenance_mode untyped\ngitlab_maintenance_mode 0\n" > /opt/prometheus/node_exporter/metrics/gitlab_maintenance_mode.prom cat /opt/prometheus/node_exporter/metrics/gitlab_maintenance_mode.prom atq
- Set
- SSH to a console VM in
- If is not disabled ask
-
🐺 Coordinator : Mark the change request as label changeaborted`
Extra details
In case the Playbook-Runner is disconnected
As most of the steps are executed in a tmux session owned by the Playbook-Runner role we need a safety net in case this person loses their internet connection or otherwise drops off half way through. Since other SREs/DBREs also have root access on the console node where everything is running they should be able to recover it in different ways. We tested the following approach to recovering the tmux session, updating the ssh agent and taking over as a new ansible user.
ssh host
- Add your public SSH key to
/home/PREVIOUS_PLAYBOOK_USERNAME/.ssh/authorized_keys
-
sudo chef-client-disable https://gitlab.com/gitlab-com/gl-infra/production/-/issues/19581
so that we don't override the above ssh -A PREVIOUS_PLAYBOOK_USERNAME@host
echo $SSH_AUTH_SOCK
tmux attach -t 0
export SSH_AUTH_SOCK=<VALUE from previous SSH_AUTH_SOCK output>
<ctrl-b> :
set-environment -g 'SSH_AUTH_SOCK' <VALUE from previous SSH_AUTH_SOCK output>
export ANSIBLE_REMOTE_USER=NEW_PLAYBOOK_USERNAME
<ctrl-b> :
set-environment -g 'ANSIBLE_REMOTE_USER' <your-user>
Change Reviewer checklist
-
Check if the following applies: - The scheduled day and time of execution of the change is appropriate.
- The change plan is technically accurate.
- The change plan includes estimated timing values based on previous testing.
- The change plan includes a viable rollback plan.
- The specified metrics/monitoring dashboards provide sufficient visibility for the change.
-
Check if the following applies: - The complexity of the plan is appropriate for the corresponding risk of the change. (i.e. the plan contains clear details).
- The change plan includes success measures for all steps/milestones during the execution.
- The change adequately minimizes risk within the environment/service.
- The performance implications of executing the change are well-understood and documented.
- The specified metrics/monitoring dashboards provide sufficient visibility for the change.
- If not, is it possible (or necessary) to make changes to observability platforms for added visibility?
- The change has a primary and secondary SRE with knowledge of the details available during the change window.
- The change window has been agreed with Release Managers in advance of the change. If the change is planned for APAC hours, this issue has an agreed pre-change approval.
- The labels blocks deployments and/or blocks feature-flags are applied as necessary.
Change Technician checklist
-
Check if all items below are complete: - The change plan is technically accurate.
- This Change Issue is linked to the appropriate Issue and/or Epic
- Change has been tested in staging and results noted in a comment on this issue.
- A dry-run has been conducted and results noted in a comment on this issue.
- The change execution window respects the Production Change Lock periods.
- For C1 and C2 change issues, the change event is added to the GitLab Production calendar.
- For C1 and C2 change issues, the SRE on-call has been informed prior to change being rolled out. (In #production channel, mention
@sre-oncall
and this issue and await their acknowledgement.) - For C1 and C2 change issues, the SRE on-call provided approval with the eoc_approved label on the issue.
- For C1 and C2 change issues, the Infrastructure Manager provided approval with the manager_approved label on the issue. Mention
@gitlab-org/saas-platforms/inframanagers
in this issue to request approval and provide visibility to all infrastructure managers. - Release managers have been informed prior to any C1, C2, or blocks deployments change being rolled out. (In #production channel, mention
@release-managers
and this issue and await their acknowledgment.) - There are currently no active incidents that are severity1 or severity2
- If the change involves doing maintenance on a database host, an appropriate silence targeting the host(s) should be added for the duration of the change.