Use new DB connection pool logic for staging
C3
Production Change - Criticality 3Change Component | Description |
---|---|
Change Objective | Use new DB connection pool logic for staging |
Change Type | ConfigurationChange |
Services Impacted | ServiceSidekiq ServiceWeb ServiceAPI ServiceGit ServicePostgres ServicePgbouncer |
Change Team Members | @reprazent @cmiskell |
Change Criticality | C3 |
Change Reviewer or tested in staging | This is the staging test |
Dry-run output | N/A |
Due Date | 2020-07-27 01:30 UTC (13:30 engineer time) |
Time tracking | 1 hour |
Detailed steps for the change
Staging
-
Stop chef on the gstg sidekiq nodes: knife ssh 'roles:gstg-base-be-sidekiq OR roles:gstg-base-fe-web OR roles:gstg-base-fe-api OR roles:gstg-base-fe-git' "sudo systemctl stop chef-client"
-
UN WIP and merge https://ops.gitlab.net/gitlab-cookbooks/chef-repo/-/merge_requests/3938, wait for the apply_to_staging to complete -
Run chef on the gstg sidekiq nodes: knife ssh 'roles:gstg-base-be-sidekiq' "sudo chef-client"
- There's only two: one catchall so this can just run immediately
-
Verify this has had the desired effect (see Monitoring) - In parallel:
-
Run chef on the gstg api nodes: knife ssh -C1 'roles:gstg-base-fe-api' "sudo chef-client"
-
Run chef on the gstg git nodes: knife ssh -C1 'roles:gstg-base-fe-git' "sudo chef-client"
-
Run chef on the gstg web nodes: knife ssh -C1 'roles:gstg-base-fe-web' "sudo chef-client"
-
-
Verify this has had the desired effect (see Monitoring)
Rollback steps
Revert https://ops.gitlab.net/gitlab-cookbooks/chef-repo/-/merge_requests/3938, apply as above.
Monitoring
Sidekiq
-
Check the gstg sidekiq logs, ensure jobs are still processing on the catchall and urgent-cpu-bound nodes: https://nonprod-log.gitlab.net/goto/dbfe3b19bcdfd713021a246d78a7df1a - Hard to see this on graphs in gstg due to low volume.
-
Verify that https://dashboards.gitlab.net/d/alerts-sat_rails_db_connection_pool/alerts-rails_db_connection_pool-saturation-detail?orgId=1&from=now-6h&to=now&var-PROMETHEUS_DS=Global&var-environment=gstg&var-type=sidekiq&var-stage=main has reduced saturation and not increased it
Web/API/GIT
-
Check that staging is still fundamentally functional: Check https://staging.gitlab.com/explore and click through to a few selected projects. -
Check the rails logs: https://nonprod-log.gitlab.net/goto/324a7e49b7dafbedd0f240a6aab3d4f6 - There are a degree of
normal
errors, including e.g.PG::QueryCanceled: ERROR: canceling statement due to statement timeout
. We're looking to see if there are more errors than before, after this is rolled out - Specifically, we don't want to see errors like
ActiveRecord::ConnectionTimeoutError: could not obtain a connection from the pool within 5.000 seconds (waited 5.000 seconds); all pooled connections were in use
- There are a degree of
- Check on the graphs that we have reduced saturation and not increased it:
-
https://dashboards.gitlab.net/d/alerts-sat_rails_db_connection_pool/alerts-rails_db_connection_pool-saturation-detail?orgId=1&from=now-6h&to=now&var-PROMETHEUS_DS=Global&var-environment=gstg&var-type=api&var-stage=main -
https://dashboards.gitlab.net/d/alerts-sat_rails_db_connection_pool/alerts-rails_db_connection_pool-saturation-detail?orgId=1&from=now-6h&to=now&var-PROMETHEUS_DS=Global&var-environment=gstg&var-type=git&var-stage=main -
https://dashboards.gitlab.net/d/alerts-sat_rails_db_connection_pool/alerts-rails_db_connection_pool-saturation-detail?orgId=1&from=now-6h&to=now&var-PROMETHEUS_DS=Global&var-environment=gstg&var-type=web&var-stage=main
-
Key metrics to observe
- Metric: Connection Pool usage
- Location: https://dashboards.gitlab.net/d/alerts-sat_rails_db_connection_pool/alerts-rails_db_connection_pool-saturation-detail?orgId=1&var-PROMETHEUS_DS=Global&var-environment=gprd&var-type=sidekiq&var-stage=main&from=now-6h&to=now
- Location: https://dashboards.gitlab.net/d/alerts-sat_rails_db_connection_pool/alerts-rails_db_connection_pool-saturation-detail?orgId=1&var-PROMETHEUS_DS=Global&var-environment=gprd&var-type=api&var-stage=main&from=now-6h&to=now
- Location: https://dashboards.gitlab.net/d/alerts-sat_rails_db_connection_pool/alerts-rails_db_connection_pool-saturation-detail?orgId=1&var-PROMETHEUS_DS=Global&var-environment=gprd&var-type=git&var-stage=main&from=now-6h&to=now
- Location: https://dashboards.gitlab.net/d/alerts-sat_rails_db_connection_pool/alerts-rails_db_connection_pool-saturation-detail?orgId=1&var-PROMETHEUS_DS=Global&var-environment=gprd&var-type=web&var-stage=main&from=now-6h&to=now
Changes checklist
-
Detailed steps and rollback steps have been filled prior to commencing work -
SRE on-call has been informed prior to change being rolled out -
There are currently no active incidents
Edited by Craig Miskell