Spin up urgent-other sidekiq shard on gprd

Context: scalability#27 (closed) and scalability#226 (closed)

Production Change - Criticality 2 C2

Change Objective	Add a urgent-other selector-based sidekiq shard to gprd
Change Type	DeploymentNewFeature
Services Impacted	Sidekiq
Change Team Members	@cmiskell
Change Severity	C2
Change Reviewer	@hphilipps
Tested in staging	scalability#226 (comment 307433745)
Dry-run output
Due Date	2020-04-01 00:30UTC (13:30 engineer time)
Time tracking	1hr

Detailed steps for the change

Pre-conditions:

https://ops.gitlab.net/gitlab-cookbooks/chef-repo/-/merge_requests/2926 is approved and merged

Steps:

Merge https://ops.gitlab.net/gitlab-com/gitlab-com-infrastructure/-/merge_requests/1512
Apply (probably manually from tf plan -target module.sidekiq if there are unclean changes in the gprd env)
Follow the bootstrapping process: gcloud compute --project=gitlab-production instances tail-serial-port-output sidekiq-urgent-other-01-sv-gprd --zone=us-east1-c
Once it has finished bootstrapping, ensure gitlab-ee is installed: ssh sidekiq-urgent-other-01-sv-gprd.c.gitlab-production.internal "dpkg -l gitlab-ee || sudo apt install gitlab-ee"
- It will not install automatically if a deploy is active at the time the node is boot-strapped, because of chef override attributes set by the deployer.
Monitor
1. sidekiq-cluster logs sudo tail -f /var/log/gitlab/sidekiq-cluster/current. We're looking for it to be picking up several 10s of jobs per second; compare primarily to the besteffort nodes (roughly equivalent, although not a direct mapping)
2. Host performance; use top, and https://dashboards.gitlab.net/d/bd2Kl9Imk/host-stats?orgId=1&var-environment=gprd&var-node=sidekiq-urgent-other-01-sv-gprd.c.gitlab-production.internal&var-promethus=prometheus-01-inf-gprd. It should be using a noticeable amount of CPU, but should not be running at 100%. It's a difficult exercise to predict accurately just how busy it will be, although this node should be fairly quiet; the goal is to ensure it's not saturated and causing blockages.
3. General sidekiq stats https://dashboards.gitlab.net/d/sidekiq-main/sidekiq-overview - looking out for queueing or latency changes (little change, should be lower if anything, not higher)
4. https://dashboards.gitlab.net/d/alerts-saturation_component/alerts-saturation-component-alert?orgId=1&from=now-3h&to=now&panelId=2&tz=UTC&var-environment=gprd&var-type=redis-sidekiq&var-stage=main&var-component=single_threaded_cpu&fullscreen - there may be a small increase in single-CPU saturation, but it is expected to be within the noise levels (<5% absolute) A large increase (>5% absolute) would be cause for re-evaluation and likely rollback

Monitoring should be continuous for the first 30 minutes, and periodic for another 2 hours.

The change has succeeded if the new node is processing jobs, is not overloaded, and there is no negative effect on sidekiq throughput, apdex, saturation, or queue lengths that could be attributed to this change.

Rollback steps

Minimum, should be enough to get it out of the way:

Stop sidekiq-cluster on the new node: sudo gitlab-ctl stop sidekiq-cluster
Stop chef to avoid any other activity that might restart services: sudo systemctl stop chef-client

Optional:

Shut the new VM down entirely if a resolution to any problems is not imminent; this will de-register it from chef and prevent it starting up again e.g. from deploys.

Changes checklist

Detailed steps and rollback steps have been filled prior to commencing work
Person on-call has been informed prior to change being rolled out

Edited Apr 01, 2020 by Craig Miskell