Build/cutover to new Redis-Sidekiq instances

Production Change - Criticality 1 C1

Change Objective	Cutover to new redis-sidekiq cluster
Change Type	Architecture change
Services Impacted	~"Service:Redis" ~"Service:Sidekiq"
Change Team Members	@craig @cmiskell
Change Severity	C1
Buddy check	A colleague will review the change
Tested in staging	The change was tested on staging environment. See infrastructure#7199 for progress/details
Schedule of the change	2019-07-22 @ 0200UTC
Duration of the change	Time to execute the change ( including a possible rollback ) TBD
Downtime Component	No - jobs will continue to queue and processing will resume pulling jobs from the new Redis queue after chef attributes updated/applied

Wait for the sidekiq-tmp fleet to finish all it's pending jobs (leave it for a few hours)
Update node counts to original values (scale down)

(If necessary) Update node counts to double current values (scale up) and (re-)apply temporary role(s)
Ensure sidekiq-tmp nodes are processing jobs
Stop chef-client on all Sidekiq nodes
Drain Sidekiq jobs on sidekiq-keep nodes
NB: need to send TSTP to the 'queues' processes, not the root sidekiq-cluster process
knife ssh 'roles:sidekiq-keep AND environment:gprd' $'for pid in $(ps -ef|awk \'/sidekiq.*queues/ {print $2}\'|sort -u); do echo "Sending TSTP signal to ${pid}..."; sudo kill -TSTP $pid; done'
Monitor queues at https://staging.gitlab.com/admin/background_jobs / https://gitlab.com/admin/background_jobs
Update redis_queues_sentinels role attribute in gprd-base.json with prior Redis Sentinel node names
Remove the "redis_queues_instance" key from the gitlab-omnibus-secrets GKMS vault
Remove /var/opt/gitlab/gitlab-rails/etc/redis.queues.yml and restart Sidekiq:
for role in gprd-base-be-sidekiq; do knife ssh "roles:${role}" "sudo rm /var/opt/gitlab/gitlab-rails/etc/redis.queues.yml && sudo gitlab-ctl restart sidekiq-cluster"
Remove /var/opt/gitlab/gitlab-rails/etc/redis.queues.yml and restart Unicorn:
for role in gprd-base-fe-{api,web{,-pages}}; do knife ssh "roles:${role}" "sudo rm /var/opt/gitlab/gitlab-rails/etc/redis.queues.yml && sudo gitlab-ctl restart unicorn"
Re-enable/run chef-client on all nodes (minimally, gprd-base-be-sidekiq, gprd-base-fe-api, gprd-base-fe-web, & gprd-base-fe-web-pages)

Edited Aug 02, 2019 by Craig Miskell