Spin up memory-bound sidekiq shard on gprd
Context: scalability#27 (closed) and scalability#276 (closed)
C2
Production Change - Criticality 2Change Objective | Add a memory-bound selector-based sidekiq shard to gprd |
---|---|
Change Type | DeploymentNewFeature |
Services Impacted | Sidekiq |
Change Team Members | @cmiskell |
Change Severity | C2 |
Change Reviewer | @hphilipps |
Tested in staging | scalability#262 (comment 315889033) |
Dry-run output | |
Due Date | 2020-04-08 01:30UTC (13:30 engineer time) |
Time tracking | 1hr |
Detailed steps for the change
Pre-conditions:
-
https://ops.gitlab.net/gitlab-cookbooks/chef-repo/-/merge_requests/3078 is approved, merged, and apply_to_prod pipeline job has run
Steps:
-
Merge https://ops.gitlab.net/gitlab-com/gitlab-com-infrastructure/-/merge_requests/1580 -
Apply (probably manually from tf plan -target module.sidekiq
if there are unclean changes in the gprd env) -
Follow the bootstrapping process: gcloud compute --project=gitlab-production instances tail-serial-port-output sidekiq-memory-bound-01-sv-gprd --zone=us-east1-c
-
Once it has finished bootstrapping, ensure gitlab-ee is installed: ssh sidekiq-memory-bound-01-sv-gprd.c.gitlab-production.internal "dpkg -l gitlab-ee || sudo apt install gitlab-ee"
- It will not install automatically if a deploy is active at the time the node is boot-strapped, because of chef override attributes set by the deployer.
- Monitor
-
sidekiq-cluster logs sudo tail -F /var/log/gitlab/sidekiq-cluster/current
. It will listen to design_management_new_version but the rate of jobs is exceptionally low (measured in 10s-per-day) -
Host performance; use top, and https://dashboards.gitlab.net/d/bd2Kl9Imk/host-stats?orgId=1&var-environment=gprd&var-node=sidekiq-memory-bound-01-sv-gprd.c.gitlab-production.internal&var-promethus=prometheus-01-inf-gprd. CPU is expected to be near zero; we will rebalance later. -
General sidekiq stats https://dashboards.gitlab.net/d/sidekiq-main/sidekiq-overview - looking out for queueing or latency changes (little change, should be lower if anything, not higher) -
https://dashboards.gitlab.net/d/alerts-saturation_component/alerts-saturation-component-alert?orgId=1&from=now-3h&to=now&panelId=2&tz=UTC&var-environment=gprd&var-type=redis-sidekiq&var-stage=main&var-component=single_threaded_cpu&fullscreen - no substantial change is expected at all.
-
Monitor for 30 minutes; no longer is required for this shard.
The change has succeeded if the new node is listening to the right queue and there is no negative effect on sidekiq throughput, apdex, saturation, or queue lengths that could be attributed to this change.
Rollback steps
Minimum, should be enough to get it out of the way:
- Stop sidekiq-cluster on the new node:
sudo gitlab-ctl stop sidekiq-cluster
- Stop chef to avoid any other activity that might restart services:
sudo systemctl stop chef-client
Optional:
- Shut the new VM down entirely if a resolution to any problems is not imminent; this will de-register it from chef and prevent it starting up again e.g. from deploys.
Changes checklist
-
Detailed steps and rollback steps have been filled prior to commencing work -
Person on-call has been informed prior to change being rolled out
Edited by Craig Miskell