Spin up urgent-other sidekiq shard on gprd
Context: scalability#27 (closed) and scalability#226 (closed)
C2
Production Change - Criticality 2Change Objective | Add a urgent-other selector-based sidekiq shard to gprd |
---|---|
Change Type | DeploymentNewFeature |
Services Impacted | Sidekiq |
Change Team Members | @cmiskell |
Change Severity | C2 |
Change Reviewer | @hphilipps |
Tested in staging | scalability#226 (comment 307433745) |
Dry-run output | |
Due Date | 2020-04-01 00:30UTC (13:30 engineer time) |
Time tracking | 1hr |
Detailed steps for the change
Pre-conditions:
-
https://ops.gitlab.net/gitlab-cookbooks/chef-repo/-/merge_requests/2926 is approved and merged
Steps:
-
Merge https://ops.gitlab.net/gitlab-com/gitlab-com-infrastructure/-/merge_requests/1512 -
Apply (probably manually from tf plan -target module.sidekiq
if there are unclean changes in the gprd env) -
Follow the bootstrapping process: gcloud compute --project=gitlab-production instances tail-serial-port-output sidekiq-urgent-other-01-sv-gprd --zone=us-east1-c
-
Once it has finished bootstrapping, ensure gitlab-ee is installed: ssh sidekiq-urgent-other-01-sv-gprd.c.gitlab-production.internal "dpkg -l gitlab-ee || sudo apt install gitlab-ee"
- It will not install automatically if a deploy is active at the time the node is boot-strapped, because of chef override attributes set by the deployer.
- Monitor
-
sidekiq-cluster logs sudo tail -f /var/log/gitlab/sidekiq-cluster/current
. We're looking for it to be picking up several 10s of jobs per second; compare primarily to thebesteffort
nodes (roughly equivalent, although not a direct mapping) -
Host performance; use top, and https://dashboards.gitlab.net/d/bd2Kl9Imk/host-stats?orgId=1&var-environment=gprd&var-node=sidekiq-urgent-other-01-sv-gprd.c.gitlab-production.internal&var-promethus=prometheus-01-inf-gprd. It should be using a noticeable amount of CPU, but should not be running at 100%. It's a difficult exercise to predict accurately just how busy it will be, although this node should be fairly quiet; the goal is to ensure it's not saturated and causing blockages. -
General sidekiq stats https://dashboards.gitlab.net/d/sidekiq-main/sidekiq-overview - looking out for queueing or latency changes (little change, should be lower if anything, not higher) -
https://dashboards.gitlab.net/d/alerts-saturation_component/alerts-saturation-component-alert?orgId=1&from=now-3h&to=now&panelId=2&tz=UTC&var-environment=gprd&var-type=redis-sidekiq&var-stage=main&var-component=single_threaded_cpu&fullscreen - there may be a small increase in single-CPU saturation, but it is expected to be within the noise levels (<5% absolute) A large increase (>5% absolute) would be cause for re-evaluation and likely rollback
-
Monitoring should be continuous for the first 30 minutes, and periodic for another 2 hours.
The change has succeeded if the new node is processing jobs, is not overloaded, and there is no negative effect on sidekiq throughput, apdex, saturation, or queue lengths that could be attributed to this change.
Rollback steps
Minimum, should be enough to get it out of the way:
- Stop sidekiq-cluster on the new node:
sudo gitlab-ctl stop sidekiq-cluster
- Stop chef to avoid any other activity that might restart services:
sudo systemctl stop chef-client
Optional:
- Shut the new VM down entirely if a resolution to any problems is not imminent; this will de-register it from chef and prevent it starting up again e.g. from deploys.
Changes checklist
-
Detailed steps and rollback steps have been filled prior to commencing work -
Person on-call has been informed prior to change being rolled out
Edited by Craig Miskell