Rolling restart of web fleet to propagate an env var change
|Change Objective||Propagate an env var change from #1170 (closed)|
|Change Team Members||@craigf|
|Buddy check||@T4cC0re will review|
|Tested in staging||Will execute plan in staging first|
|Schedule of the change||2019-10-16 12:00 UTC|
|Duration of the change||1h|
Execute this on staging first. The only differences are: replace "unicorn" with "puma", omit "--production" in the chatops commands, and replace "gprd" with "gstg" in set-server-state commands.
Drain the canary fleet
/chatops run canary --drain --backend canary_web --productionin Slack
- Look at the dashboard for web cny stage. Wait for the traffic to drop to 0.
Restart unicorn in the canary fleet
knife ssh -C3 'name:web-cny-*-sv-gprd*' -- sudo pkill -QUIT -f "unicorn master"
- Wait for unicorn to exit then runsv to bring it back up:
knife ssh -C3 'name:web-cny-*-sv-gprd*' -- sudo gitlab-ctl status unicornand look at the uptime.
Bring traffic back to the canary fleet
/chatops run canary --ready --backend canary_web --productionin slack
- As the RPS rises on the dashboard, check the error ratio doesn't spike.
Repeat the below steps for the main fleet, in batches of 4, until the rollout is complete.
Drain a segment of the main fleet
- Change the dashboard to the web main stage. While executing these steps, keep an eye on the error ratio.
<chef-repo>/bin/set-server-state gprd drain web-X[Y]. Using a batch size of 4, an example final parameter would be
- Keep an eye on the saturation chart. We have removed nodes from load balancing, but due to unicorn's pre-forking nature CPU shouldn't really climb, but it's worth keeping an eye on things.
Restart unicorn on segment of the main fleet
knife node list | grep -E 'web\-0[1-4]\-sv\-gprd' | xargs -P0 -I% ssh % 'sudo pkill -QUIT -f "unicorn master"'(
⚠tested on GNU grep). Replace numbers in grep as appropriate.
- Wait for unicorn to exit and runsv to bring it back up:
knife node list | grep -E 'web\-0[1-4]\-sv\-gprd' | xargs -P0 -I% ssh % 'sudo gitlab-ctl status unicornand check the uptime.
Bring traffic back to segment of main fleet;
<chef-repo>/bin/set-server-state gprd ready web-X[Y]
- Keep an eye on error ratio before moving to next step.
There isn't really rollback for this. The rollback of restarting is restarting. In the unlikely event that the new env vars themselves cause problems, execute the rollback described in #1170 (closed) (remove the env vars from GKMS), run chef on the web fleet, and restart again.