Skip to content

Resize sidekiq-pipeline nodes to custom-8-15

Production Change - Criticality 3 C3

Change Objective More CPU for the pipeline nodes to reduce contention and increase queue performance. See https://gitlab.com/gitlab-com/gl-infra/infrastructure/issues/7294#note_196018603 for some discussion
Change Type ConfigurationChange
Services Impacted Pipeline scheduling (sidekiq)
Change Team Members @cmiskell
Change Severity C3
Buddy check or tested in staging Not done in staging. Checker @ahanselka
Schedule of the change 2019-07-26 02:00 UTC for sidekiq-pipeline--0{4,5,6}, 2019-07-29 04:00 UTC
Duration of the change 15 minutes
Detailed steps for the change. Each step must include: - pre-conditions for execution of the step, - execution commands for the step, - post-execution validation for the step , - rollback of the step

Stage 1:

  1. Change n1-standard-4 to custom-8-15360 in the machine_types map of environments/gprd/variables.tf in terraform, for sidekiq-pipeline
  2. Drain connections on the target node: knife ssh 'TARGETNODE' $'for pid in $(ps -ef|awk \'/sidekiq.*queues/ {print $2}\'|sort -u); do echo "Sending TSTP signal to ${pid}..."; sudo kill -TSTP $pid; done'
  3. Shutdown
  4. Resize to custom-8-15 by hand; the pipeline nodes in the sidekiq module do not have allow_stopping_for_update configured (the rest do). This needs fixing later
    • gcloud compute instances set-machine-type sidekiq-pipeline-INDEX-sv-gprd --machine-type custom-8-15360 --zone us-east1-ZONEID --project gitlab-production
    • gcloud compute instances start sidekiq-pipeline-INDEX-sv-gprd --zone us-east1-ZONEID --project gitlab-production
  5. Repeat for instance [4] and [5], sequentially.
  6. Plan: tf plan -target 'module.sidekiq.google_compute_instance.sidekiq_pipeline[3]' -target 'module.sidekiq.google_compute_instance.sidekiq_pipeline[4]' -target 'module.sidekiq.google_compute_instance.sidekiq_pipeline[5]' -out /tmp/tfplan and verify no changes are outstanding

Stage 2: After #963 (closed) is completed and looks stable, repeat on instances [0], [1], [2]. If we do these now, then when they restart chef will run, and sidekiq will be reconfigured and start processing from the old redis queue. Fixing that is possible, but fiddly, and I'd rather avoid the mess.

Edited by Craig Miskell