Resize sidekiq-pipeline nodes to custom-8-15
C3
Production Change - Criticality 3Change Objective | More CPU for the pipeline nodes to reduce contention and increase queue performance. See https://gitlab.com/gitlab-com/gl-infra/infrastructure/issues/7294#note_196018603 for some discussion |
---|---|
Change Type | ConfigurationChange |
Services Impacted | Pipeline scheduling (sidekiq) |
Change Team Members | @cmiskell |
Change Severity | C3 |
Buddy check or tested in staging | Not done in staging. Checker @ahanselka |
Schedule of the change | 2019-07-26 02:00 UTC for sidekiq-pipeline--0{4,5,6}, 2019-07-29 04:00 UTC |
Duration of the change | 15 minutes |
Detailed steps for the change. Each step must include: | - pre-conditions for execution of the step, - execution commands for the step, - post-execution validation for the step , - rollback of the step |
Stage 1:
-
Change n1-standard-4
tocustom-8-15360
in themachine_types
map ofenvironments/gprd/variables.tf
in terraform, forsidekiq-pipeline
-
Drain connections on the target node: knife ssh 'TARGETNODE' $'for pid in $(ps -ef|awk \'/sidekiq.*queues/ {print $2}\'|sort -u); do echo "Sending TSTP signal to ${pid}..."; sudo kill -TSTP $pid; done'
-
Shutdown -
Resize to custom-8-15 by hand; the pipeline nodes in the sidekiq module do not have allow_stopping_for_update configured (the rest do). This needs fixing later gcloud compute instances set-machine-type sidekiq-pipeline-INDEX-sv-gprd --machine-type custom-8-15360 --zone us-east1-ZONEID --project gitlab-production
gcloud compute instances start sidekiq-pipeline-INDEX-sv-gprd --zone us-east1-ZONEID --project gitlab-production
-
Repeat for instance [4] and [5], sequentially. -
Plan: tf plan -target 'module.sidekiq.google_compute_instance.sidekiq_pipeline[3]' -target 'module.sidekiq.google_compute_instance.sidekiq_pipeline[4]' -target 'module.sidekiq.google_compute_instance.sidekiq_pipeline[5]' -out /tmp/tfplan
and verify no changes are outstanding
Stage 2: After #963 (closed) is completed and looks stable, repeat on instances [0], [1], [2]. If we do these now, then when they restart chef will run, and sidekiq will be reconfigured and start processing from the old redis queue. Fixing that is possible, but fiddly, and I'd rather avoid the mess.
Edited by Craig Miskell