Spin up memory-bound sidekiq shard on gprd

Context: scalability#27 (closed) and scalability#276 (closed)

Production Change - Criticality 2 C2

Change Objective	Add a memory-bound selector-based sidekiq shard to gprd
Change Type	DeploymentNewFeature
Services Impacted	Sidekiq
Change Team Members	@cmiskell
Change Severity	C2
Change Reviewer	@hphilipps
Tested in staging	scalability#262 (comment 315889033)
Dry-run output
Due Date	2020-04-08 01:30UTC (13:30 engineer time)
Time tracking	1hr

Detailed steps for the change

Pre-conditions:

https://ops.gitlab.net/gitlab-cookbooks/chef-repo/-/merge_requests/3078 is approved, merged, and apply_to_prod pipeline job has run

Steps:

Merge https://ops.gitlab.net/gitlab-com/gitlab-com-infrastructure/-/merge_requests/1580
Apply (probably manually from tf plan -target module.sidekiq if there are unclean changes in the gprd env)
Follow the bootstrapping process: gcloud compute --project=gitlab-production instances tail-serial-port-output sidekiq-memory-bound-01-sv-gprd --zone=us-east1-c
Once it has finished bootstrapping, ensure gitlab-ee is installed: ssh sidekiq-memory-bound-01-sv-gprd.c.gitlab-production.internal "dpkg -l gitlab-ee || sudo apt install gitlab-ee"
- It will not install automatically if a deploy is active at the time the node is boot-strapped, because of chef override attributes set by the deployer.
Monitor
1. sidekiq-cluster logs sudo tail -F /var/log/gitlab/sidekiq-cluster/current. It will listen to design_management_new_version but the rate of jobs is exceptionally low (measured in 10s-per-day)
2. Host performance; use top, and https://dashboards.gitlab.net/d/bd2Kl9Imk/host-stats?orgId=1&var-environment=gprd&var-node=sidekiq-memory-bound-01-sv-gprd.c.gitlab-production.internal&var-promethus=prometheus-01-inf-gprd. CPU is expected to be near zero; we will rebalance later.
3. General sidekiq stats https://dashboards.gitlab.net/d/sidekiq-main/sidekiq-overview - looking out for queueing or latency changes (little change, should be lower if anything, not higher)
4. https://dashboards.gitlab.net/d/alerts-saturation_component/alerts-saturation-component-alert?orgId=1&from=now-3h&to=now&panelId=2&tz=UTC&var-environment=gprd&var-type=redis-sidekiq&var-stage=main&var-component=single_threaded_cpu&fullscreen - no substantial change is expected at all.

Monitor for 30 minutes; no longer is required for this shard.

The change has succeeded if the new node is listening to the right queue and there is no negative effect on sidekiq throughput, apdex, saturation, or queue lengths that could be attributed to this change.

Rollback steps

Minimum, should be enough to get it out of the way:

Stop sidekiq-cluster on the new node: sudo gitlab-ctl stop sidekiq-cluster
Stop chef to avoid any other activity that might restart services: sudo systemctl stop chef-client

Optional:

Shut the new VM down entirely if a resolution to any problems is not imminent; this will de-register it from chef and prevent it starting up again e.g. from deploys.

Changes checklist

Detailed steps and rollback steps have been filled prior to commencing work
Person on-call has been informed prior to change being rolled out

Edited Apr 08, 2020 by Craig Miskell