[production] Enable replica database connection pool load balancing for asynchronous sidekiq
Production Change
Change Summary
Enable replica database connection pool load balancing for asynchronous sidekiq.
We should set environment variable ENABLE_LOAD_BALANCING_FOR_SIDEKIQ to true in order to enable load balancing for Sidekiq.
This environment variable will only enable load balancing. All workers will still default to :always data consistency - the worker is required to use primary!
We already enabled this on staging (gitlab-org/gitlab#325520 (closed)):
- https://ops.gitlab.net/gitlab-cookbooks/chef-repo/-/merge_requests/5258
- gitlab-com/gl-infra/k8s-workloads/gitlab-com!760 (merged)
Also, the connection pool size has been increased in production to 15 per pgbouncer on the patroni read-only replica cluster participants: #4135 (comment 545162312)
Change Details
- Services Impacted - ServiceSidekiq, ServicePatroni, ServicePostgres, ServiceCI Runners, ServiceAPI, ServiceWeb, ServiceGit, ServiceGitLab Rails
- Change Technician - @nnelson
- Change Criticality - C1
- Change Type - changescheduled
- Change Reviewer - @ahmadsherif
-
Due Date -
2021-04-23 1945 utc -
Time tracking -
60-120 minutes -
Downtime Component -
No downtime expected/required
Detailed steps for the change
Pre-Change Steps - steps to be completed before execution of the change
Estimated Time to Complete (mins) - Finished
-
Prepare for sidekiq read-replica traffic to pgbouncer -
Review, approve, and merge the merge request to Define postgresql load balancing in the consul for gprdin k8s:releases/gitlab/values/gprd.yaml.gotmpl. -
Have reviewed and approved the merge request to Set environment variable ENABLE_LOAD_BALANCING_FOR_SIDEKIQtotrueingprdfor thememory-boundsidekiqshard in k8s:releases/gitlab/values/gprd.yaml.gotmpl. -
Have reviewed and approved subsequent merge requests to define the same environment variable in the remaining sidekiq shards in k8s: releases/gitlab/values/gprd.yaml.gotmpl.
Change Steps - steps to take to execute the change
Estimated Time to Complete (mins) - 20-100 minutes
-
Merge Set environment variable ENABLE_LOAD_BALANCING_FOR_SIDEKIQtotrueingprdfor thememory-boundsidekiqshard in k8s:releases/gitlab/values/gprd.yaml.gotmpl. -
Wait 2-5minutes to proceed with thePost-Change Stepsbelow and monitor the enumerated metrics before proceeding. -
Iteratively merge subsequent merge requests to define the same environment variable in the remaining sidekiq shards in k8s: releases/gitlab/values/gprd.yaml.gotmpl, waiting 2-5 minutes per shard. -
Set environment variable ENABLE_LOAD_BALANCING_FOR_SIDEKIQ to true for gprdfor VMs in the Chef role: https://ops.gitlab.net/gitlab-cookbooks/chef-repo/-/blob/master/roles/gprd-base-be-sidekiq.json
Post-Change Steps - steps to take to verify the change
Estimated Time to Complete (mins) - 5 minutes
-
Confirm that the connection saturation metric values to not exceed tolerances. Also confirm that the pgbouncer_async_primary_poolcomponent saturation backs away from the 100% ceiling. -
Confirm that the Patroni apdex is not significantly affected. -
Confirm that Postgres Async (Sidekiq) replica Connection Pool Utilization per Node increases by 15 as expected
Rollback
Rollback steps - steps to be taken in the event of a need to rollback this change
Estimated Time to Complete (mins) - 20 minutes
-
Rollback Set environment variable ENABLE_LOAD_BALANCING_FOR_SIDEKIQ to false for gprdfor k8s -
Set environment variable ENABLE_LOAD_BALANCING_FOR_SIDEKIQ to 'false' for gprdfor VMs in the Chef role- Example Revert MR:
Draft: Disables load balancing for catchall sidekiq: gitlab-com/gl-infra/k8s-workloads/gitlab-com!812 (closed)- Note that this is one of the largest clusters, so in the event a rollback is required, please merge this Revert MR first.
- Other revert MRs will be added shortly.
- Revert MR:
Draft: Disables use of load-balanced database read-replica pools for low-urgency-cpu-bound and database-throttled sidekiq shards: gitlab-com/gl-infra/k8s-workloads/gitlab-com!813 (closed) - Revert MR:
Draft: Disables use of load balanced database read-replica pools for urgent-other sidekiq shard: gitlab-com/gl-infra/k8s-workloads/gitlab-com!814 (closed)
- Example Revert MR:
Monitoring
Key metrics to observe
- Metric:
Postgres Async (Sidekiq) replica Connection Pool Utilization per Node - Metric:
component saturation - Metric:
patroni Service apdex - Metric:
Postgres Async (Sidekiq) replica Connection Pool Utilization per Node
Summary of infrastructure changes
-
Does this change introduce new compute instances? No
-
Does this change re-size any existing compute instances? No
-
Does this change introduce any additional usage of tooling like Elastic Search, CDNs, Cloudflare, etc? No
None
Changes checklist
-
This issue has a criticality label (e.g. C1, C2, C3, C4) and a change-type label (e.g. changeunscheduled, changescheduled) based on the Change Management Criticalities. -
This issue has the change technician as the assignee. -
Pre-Change, Change, Post-Change, and Rollback steps and have been filled out and reviewed. -
Necessary approvals have been completed based on the Change Management Workflow. -
Change has been tested in staging and results noted in a comment on this issue. -
A dry-run has been conducted and results noted in a comment on this issue. -
SRE on-call has been informed prior to change being rolled out. (In #production channel, mention @sre-oncalland this issue and await their acknowledgement.) -
There are currently no active incidents.