Use new DB connection pool logic for production
C3
Production Change - Criticality 3Change Component | Description |
---|---|
Change Objective | Use new DB connection pool logic for production |
Change Type | ConfigurationChange |
Services Impacted | ServiceSidekiq ServiceWeb ServiceAPI ServiceGit ServicePostgres ServicePgbouncer |
Change Team Members | @reprazent @cmiskell |
Change Criticality | C3 |
Change Reviewer or tested in staging | Tested in staging #2449 (closed) |
Dry-run output | N/A |
Due Date | 2020-07-28 01:30 UTC (13:30 engineer time) |
Time tracking | 1.5 hours |
Detailed steps for the change
Production
-
Stop chef on the gprd sidekiq nodes: knife ssh 'roles:gprd-base-be-sidekiq OR roles:gprd-base-fe-web OR roles:gprd-base-fe-api OR roles:gprd-base-fe-git' "sudo systemctl stop chef-client"
-
UN WIP and merge https://ops.gitlab.net/gitlab-cookbooks/chef-repo/-/merge_requests/3945, verify the role diff in apply_to_staging, and execute the manual apply_to_production job -
Run chef on the gprd sidekiq nodes: knife ssh -C1 'roles:gprd-base-be-sidekiq' "sudo chef-client"
-
Verify this has had the desired effect (see Monitoring) -
On the canary nodes first (to the later run is a noop) at reduced frequency because of the small number of nodes: -
Run chef on the gprd api nodes: knife ssh -C1 'roles:gprd-base-fe-api-cny' "sudo chef-client"
-
Run chef on the gprd git nodes: knife ssh -C1 'roles:gprd-base-fe-git-cny' "sudo chef-client"
-
Run chef on the gprd web nodes: knife ssh -C2 'roles:gprd-base-fe-web-cny' "sudo chef-client"
-
-
In parallel (1/6th of each fleet at a time):
-
Run chef on the gprd api nodes: knife ssh -C3 'roles:gprd-base-fe-api' "sudo chef-client"
-
Run chef on the gprd git nodes: knife ssh -C4 'roles:gprd-base-fe-git' "sudo chef-client"
-
Run chef on the gprd web nodes: knife ssh -C4 'roles:gprd-base-fe-web' "sudo chef-client"
-
-
Verify this has had the desired effect (see Monitoring)
Rollback steps
Revert https://ops.gitlab.net/gitlab-cookbooks/chef-repo/-/merge_requests/3945, apply as above.
Monitoring
Sidekiq
-
Check the gprd sidekiq logs, ensure there are no ActiveRecord::ConnectionTimeoutError
errors: https://log.gprd.gitlab.net/goto/0d6e478f5dc61f34591f8fb9282fe8fa -
Verify that jobs continue to be processed on the catchall
shard: https://dashboards.gitlab.net/d/sidekiq-shard-detail/sidekiq-shard-detail?orgId=1&refresh=30s- There should be no-to-minimal change in the long-term behavior of all of the graphs there, with some lee-way given for the fact that we're restarting sidekiq on the nodes.
-
Verify that https://dashboards.gitlab.net/d/alerts-sat_rails_db_connection_pool/alerts-rails_db_connection_pool-saturation-detail?orgId=1&from=now-1h&to=now&var-PROMETHEUS_DS=Global&var-environment=gprd&var-type=sidekiq&var-stage=main has reduced saturation and not increased it
Web/API/GIT
-
Check that gitlab.com is still fundamentally functional -
Check the rails logs show no ActiveRecord::ConnectionTimeoutError
errors https://log.gprd.gitlab.net/goto/d23fcf2b88fa62111db292ce9a628220 - Check on the graphs that we have reduced saturation and not increased it:
-
https://dashboards.gitlab.net/d/alerts-sat_rails_db_connection_pool/alerts-rails_db_connection_pool-saturation-detail?orgId=1&from=now-1h&to=now&var-PROMETHEUS_DS=Global&var-environment=gprd&var-type=api&var-stage=main -
https://dashboards.gitlab.net/d/alerts-sat_rails_db_connection_pool/alerts-rails_db_connection_pool-saturation-detail?orgId=1&from=now-1h&to=now&var-PROMETHEUS_DS=Global&var-environment=gprd&var-type=git&var-stage=main -
https://dashboards.gitlab.net/d/alerts-sat_rails_db_connection_pool/alerts-rails_db_connection_pool-saturation-detail?orgId=1&from=now-1h&to=now&var-PROMETHEUS_DS=Global&var-environment=gprd&var-type=web&var-stage=main
-
Key metrics to observe
- Metric: Connection Pool usage
- Location: https://dashboards.gitlab.net/d/alerts-sat_rails_db_connection_pool/alerts-rails_db_connection_pool-saturation-detail?orgId=1&var-PROMETHEUS_DS=Global&var-environment=gprd&var-type=sidekiq&var-stage=main&from=now-1h&to=now
- Location: https://dashboards.gitlab.net/d/alerts-sat_rails_db_connection_pool/alerts-rails_db_connection_pool-saturation-detail?orgId=1&var-PROMETHEUS_DS=Global&var-environment=gprd&var-type=api&var-stage=main&from=now-1h&to=now
- Location: https://dashboards.gitlab.net/d/alerts-sat_rails_db_connection_pool/alerts-rails_db_connection_pool-saturation-detail?orgId=1&var-PROMETHEUS_DS=Global&var-environment=gprd&var-type=git&var-stage=main&from=now-1h&to=now
- Location: https://dashboards.gitlab.net/d/alerts-sat_rails_db_connection_pool/alerts-rails_db_connection_pool-saturation-detail?orgId=1&var-PROMETHEUS_DS=Global&var-environment=gprd&var-type=web&var-stage=main&from=now-1h&to=now
Changes checklist
-
Detailed steps and rollback steps have been filled prior to commencing work -
SRE on-call has been informed prior to change being rolled out -
There are currently no active incidents