[production] Increase pgbouncer connection pool size for sidekiq read requests to database read replicas from `2` to `15` per replica node
Production Change
Change Summary
Increase pgbouncer connection pool size for sidekiq read requests to database read replicas from 2 to 15 per replica node.
Change Details
-
Services Impacted -
[ pgbouncer ] - Change Technician - @nnelson
- Change Criticality - C1
- Change Type - changescheduled
- Change Reviewer - @Finotto
-
Due Date -
2021-04-05 2100 utc -
Time tracking -
30 minutes -
Downtime Component -
no downtime expected/required
Detailed steps for the change
Pre-Change Steps - steps to be completed before execution of the change
Estimated Time to Complete (mins) - Completed
-
Prepare a merge request for the chef roles to Increase pgbouncer connection pool size for sidekiq read requests to database read replicas from 2to15per replica node: https://ops.gitlab.net/gitlab-cookbooks/chef-repo/-/merge_requests/5262 -
Have the merge request reviewed by a colleague. -
Verify that the pipeline stages have all passed.
Change Steps - steps to take to execute the change
Estimated Time to Complete (mins) - 1 minute
-
Click the Trigger this manual actionbutton to apply the role changes to theproductionenvironment: https://ops.gitlab.net/gitlab-cookbooks/chef-repo/-/jobs/3528422
Post-Change Steps - steps to take to verify the change
Estimated Time to Complete (mins) - 10 minutes
-
Execute the following command: export GITLAB_ENVIRONMENT='gprd' bundle exec knife ssh "fqdn:patroni-*-db-${GITLAB_ENVIRONMENT}*" 'sudo grep "gitlabhq_production_sidekiq" /var/opt/gitlab/pgbouncer/databases.ini' --concurrency 1 -
Verify that the output for all hosts is: gitlabhq_production_sidekiq = host=127.0.0.1 port=5432 pool_size=15 auth_user=pgbouncer dbname=gitlabhq_production
Rollback
Rollback steps - steps to be taken in the event of a need to rollback this change
Estimated Time to Complete (mins) - 10 minutes
-
Revert the merge request above, and add a link to the reversion merge request here: Reversion merge request: https://ops.gitlab.net/gitlab-cookbooks/chef-repo/-/merge_requests/0000 -
Have the reversion MR reviewed by a colleague. -
Apply the reversion MR to production by repeating the change steps from above. -
Repeat the post-change steps from above, but ensure that the output for all hosts is instead: gitlabhq_production_sidekiq = host=127.0.0.1 port=5432 pool_size=2 auth_user=pgbouncer dbname=gitlabhq_production
Monitoring
Key metrics to observe
- Metrics:
- Location:
-
pgbouncer_async_replica_pool: Saturation: https://dashboards.gitlab.net/d/alerts-sat_pgbouncer_async_pool_replica/alerts-pgbouncer_async_replica_pool-saturation-detail?orgId=1&var-PROMETHEUS_DS=Global&var-environment=gprd&var-type=patroni&var-stage=main -
Service Apdex: https://dashboards.gitlab.net/d/patroni-main/patroni-overview?viewPanel=3543037459&orgId=1&var-PROMETHEUS_DS=Global&var-environment=gprd&var-stage=main&var-sigma=2 -
Service Error Ratio: https://dashboards.gitlab.net/d/patroni-main/patroni-overview?viewPanel=379598196&orgId=1&var-PROMETHEUS_DS=Global&var-environment=gprd&var-stage=main&var-sigma=2 -
Requests Per Second: https://dashboards.gitlab.net/d/patroni-main/patroni-overview?viewPanel=1598389883&orgId=1&var-PROMETHEUS_DS=Global&var-environment=gprd&var-stage=main&var-sigma=2
-
- What changes to this metric should prompt a rollback:
- Any prolonged (more than 1 minute) increase of
pgbouncer_async_replica_pool component saturation - Any prolonged (more than 1 minute) reduction of
Service Apdex - Any prolonged (more than 1 minute) increase in
Service Error Ratio - Any prolonged (more than 1 minute) reduction in
Requests Per Second
- Any prolonged (more than 1 minute) increase of
- Location:
Summary of infrastructure changes
-
Does this change introduce new compute instances? No
-
Does this change re-size any existing compute instances? No
-
Does this change introduce any additional usage of tooling like Elastic Search, CDNs, Cloudflare, etc? No
Summary of the above
Changes checklist
-
This issue has a criticality label (e.g. C1, C2, C3, C4) and a change-type label (e.g. changeunscheduled, changescheduled) based on the Change Management Criticalities. -
This issue has the change technician as the assignee. -
Pre-Change, Change, Post-Change, and Rollback steps and have been filled out and reviewed. -
Necessary approvals have been completed based on the Change Management Workflow. -
Change has been tested in staging and results noted in a comment on this issue. -
A dry-run has been conducted and results noted in a comment on this issue. -
SRE on-call has been informed prior to change being rolled out. (In #production channel, mention @sre-oncalland this issue and await their acknowledgement.) -
There are currently no active incidents.
Edited by Nels Nelson