Scale up web fleet and pgbouncer
C1
Production Change - Criticality 1Change Objective | Scale up web fleet and pgbouncer |
---|---|
Change Type | Scale up web fleet and pgbouncer |
Services Impacted | web, pgbouncer |
Change Team Members | @aamarsanaa |
Change Severity | C1 |
Buddy check | @andrewn or @bjk-gitlab or @mwasilewski-gitlab |
Tested in staging | This was done this morning in #896 (closed) |
Schedule of the change | June 14th 17:50PM UTC+8 |
Duration of the change | 15 mins |
Downtime Component | No |
Detailed steps for the change. Each step must include: | * pre-conditions for execution of the step, * execution commands for the step, * post-execution validation for the step , * rollback of the step |
Pre-conditions
- We currently have 32 web nodes in UP state AND 2 cny web nodes in UP state
- We currently have 4 web nodes in MAINT state
- Unicorn Queueing is in bad shape
Execution
max_client_conn
Bump up pgbouncer ssh patroni-04-db-gprd.c.gitlab-production.internal
sudo -u gitlab-psql psql -h /var/opt/gitlab/pgbouncer -p 6432 -d pgbouncer -U pgbouncer
show config; # and document the max_client_conn
show pools; # and document the current pools stats
set max_client_conn=4496;
Rationale behind using 4496: Currently, the max_client_conn is 4296 and we don't run into max_client_conn
error. We will be adding 4 more nodes back. Each web node has 42 unicorn workers. 4 x 42 = 168 more connections at the same time. Thus, if we use 200 as a safer number (with some headroom) it would work.
Put 4 web nodes back to LB in UP state
# From local workstation, cd to chef-repo/bin directory
# The 4 new web nodes that are in MAIN status are: web-33, web-34, web-35, web-36
./bin/set-server-state gprd ready web-33
./bin/set-server-state gprd ready web-34
./bin/set-server-state gprd ready web-35
./bin/set-server-state gprd ready web-36
Each of the above steps confirms that we are only targeting the specific node we want (asks for a confirmation) so that is the safety belt.
Post-execution validation
-
./get-server-state gprd web
and make sure all web nodes are in UP status (including the 4 nodes above) -
Watch the Unicorn Workers graph to make sure queueing stops and there is a gap between the Workers and Max on the 2nd-to-the-right graph for Active Connections -
Watch the Latency Apdex SLO graph to make sure the latency improves
Rollback
For pgbouncer
set max_client_conn=4296;
For web nodes
./bin/set-server-state gprd maint web-33
./bin/set-server-state gprd maint web-34
./bin/set-server-state gprd maint web-35
./bin/set-server-state gprd maint web-36
Edited by Amarbayar Amarsanaa