Rolling restart of web fleet to propagate an env var change

Production Change - Criticality 2 C2

Change Objective	Propagate an env var change from #1170 (closed)
Services Impacted	Web
Change Team Members	@craigf
Change Severity	C2
Buddy check	@T4cC0re will review
Tested in staging	Will execute plan in staging first
Schedule of the change	2019-10-16 12:00 UTC
Duration of the change	1h

Steps

Execute this on staging first. The only differences are: replace "unicorn" with "puma", omit "--production" in the chatops commands, and replace "gprd" with "gstg" in set-server-state commands.

Drain the canary fleet
1. /chatops run canary --drain --backend canary_web --production in Slack
2. Look at the dashboard for web cny stage. Wait for the traffic to drop to 0.
Restart unicorn in the canary fleet
1. knife ssh -C3 'name:web-cny-*-sv-gprd*' -- sudo pkill -QUIT -f "unicorn master"
2. Wait for unicorn to exit then runsv to bring it back up: knife ssh -C3 'name:web-cny-*-sv-gprd*' -- sudo gitlab-ctl status unicorn and look at the uptime.
Bring traffic back to the canary fleet
1. /chatops run canary --ready --backend canary_web --production in slack
2. As the RPS rises on the dashboard, check the error ratio doesn't spike.
Repeat the below steps for the main fleet, in batches of 4, until the rollout is complete.
Drain a segment of the main fleet
1. Change the dashboard to the web main stage. While executing these steps, keep an eye on the error ratio.
2. <chef-repo>/bin/set-server-state gprd drain web-X[Y]. Using a batch size of 4, an example final parameter would be web-0[1-4].
3. Keep an eye on the saturation chart. We have removed nodes from load balancing, but due to unicorn's pre-forking nature CPU shouldn't really climb, but it's worth keeping an eye on things.
Restart unicorn on segment of the main fleet
1. knife node list | grep -E 'web\-0[1-4]\-sv\-gprd' | xargs -P0 -I% ssh % 'sudo pkill -QUIT -f "unicorn master"' (⚠ tested on GNU grep). Replace numbers in grep as appropriate.
2. Wait for unicorn to exit and runsv to bring it back up: knife node list | grep -E 'web\-0[1-4]\-sv\-gprd' | xargs -P0 -I% ssh % 'sudo gitlab-ctl status unicorn and check the uptime.
Bring traffic back to segment of main fleet;
1. <chef-repo>/bin/set-server-state gprd ready web-X[Y]
2. Keep an eye on error ratio before moving to next step.

Rollback

There isn't really rollback for this. The rollback of restarting is restarting. In the unlikely event that the new env vars themselves cause problems, execute the rollback described in #1170 (closed) (remove the env vars from GKMS), run chef on the web fleet, and restart again.

Edited Oct 16, 2019 by Craig Furman