Build/cutover to new Redis-Sidekiq instances
C1
Production Change - Criticality 1Change Objective | Cutover to new redis-sidekiq cluster |
---|---|
Change Type | Architecture change |
Services Impacted | ~"Service:Redis" ~"Service:Sidekiq" |
Change Team Members | @craig @cmiskell |
Change Severity | C1 |
Buddy check | A colleague will review the change |
Tested in staging | The change was tested on staging environment. See infrastructure#7199 for progress/details |
Schedule of the change | 2019-07-22 @ 0200UTC |
Duration of the change | Time to execute the change ( including a possible rollback ) TBD |
Downtime Component | No - jobs will continue to queue and processing will resume pulling jobs from the new Redis queue after chef attributes updated/applied |
Implementation
Pre-conditions
-
Stand up a new Redis Primary/Secondary/Sentinel fleet
Execution
-
Setup/apply roles for current and temporary Sidekiq nodes ( sidekiq-keep
&sidekiq-tmp
) to target each independently viaknife ssh
-
Add redis_queues_sentinels
attribute to role files (with current sentinel hosts) -
Validate functionality using redis_queue_sentinels
attribute value -
Update node counts to double current values (scale up) -
Ensure sidekiq-tmp
nodes are processing jobs -
Stop chef-client on all Sidekiq nodes -
Drain Sidekiq jobs on sidekiq-keep
nodes
NB: need to send TSTP to the 'queues' processes, not the root sidekiq-cluster process
knife ssh 'roles:sidekiq-keep AND environment:gprd' $'for pid in $(ps -ef|awk \'/sidekiq.*queues/ {print $2}\'|sort -u); do echo "Sending TSTP signal to ${pid}..."; sudo kill -TSTP $pid; done'
Monitor queues at https://staging.gitlab.com/admin/background_jobs / https://gitlab.com/admin/background_jobs In the 'busy' tab, we're expecting the keep nodes (lower numeric indexes) to be markedquiet
, and the tmp nodes to not be. Enqueued jobs should stay low/fluctuate; if it rises substantially, the tmp cluster may not be processing jobs -
Update redis_queues_sentinels
role attribute ingprd-base.json
with new Redis Sentinel node names-
Edit roles/gprd-base.json
in chef-repo -
Change prefixes from e.g. redis-01-db-gprd.c.GCP_PROJECT.internal
toredis-sidekiq-01-db-gprd.c.GCP_PROJECT.internal
-
Run knife role from file roles/gprd-base.json
(and submit MR to finalize change)
-
-
Update redis_queues_instance
attribute in thegitlab-omnibus-secrets
GKMS vault with new Redis Sentinel node names-
Run ./bin/gkms-vault-edit gitlab-omnibus-secrets gprd
to update the vault forgprd
-
Nested under omnibus-gitlab[:gitlab_rb][:gitlab-rails]
add a new attribute namedredis_queues_instance
with a value ofredis://:PASSWORD@gprd-redis-sidekiq
, wherePASSWORD
is the same as the value for the attributeredis_password
, also nested underomnibus-gitlab[:gitlab_rb][:gitlab-rails]
-
-
Re-enable/run chef-client on sidekiq-keep
and api, web, & web-pages nodes (transition to new Redis nodes)knife ssh 'roles:sidekiq-keep AND environment:gprd' 'sudo chef-client'
knife ssh 'roles:gprd-base-fe-api OR roles:gprd-base-fe-web OR roles:gprd-base-fe-web-pages' 'sudo chef-client'
- On the background jobs page, the tmp nodes should disappear and the keep nodes be the only ones remaining, and no longer be marked 'quiet'.
Post-execution/validation
-
Wait for the sidekiq-tmp
fleet to finish all it's pending jobs (leave it for a few hours) -
Update node counts to original values (scale down)
Rollback
-
(If necessary) Update node counts to double current values (scale up) and (re-)apply temporary role(s) -
Ensure sidekiq-tmp
nodes are processing jobs -
Stop chef-client on all Sidekiq nodes -
Drain Sidekiq jobs on sidekiq-keep
nodes
NB: need to send TSTP to the 'queues' processes, not the root sidekiq-cluster process
knife ssh 'roles:sidekiq-keep AND environment:gprd' $'for pid in $(ps -ef|awk \'/sidekiq.*queues/ {print $2}\'|sort -u); do echo "Sending TSTP signal to ${pid}..."; sudo kill -TSTP $pid; done'
Monitor queues at https://staging.gitlab.com/admin/background_jobs / https://gitlab.com/admin/background_jobs -
Update redis_queues_sentinels
role attribute ingprd-base.json
with prior Redis Sentinel node names -
Remove the "redis_queues_instance"
key from the gitlab-omnibus-secrets GKMS vault -
Remove /var/opt/gitlab/gitlab-rails/etc/redis.queues.yml
and restart Sidekiq:
for role in gprd-base-be-sidekiq; do knife ssh "roles:${role}" "sudo rm /var/opt/gitlab/gitlab-rails/etc/redis.queues.yml && sudo gitlab-ctl restart sidekiq-cluster"
-
Remove /var/opt/gitlab/gitlab-rails/etc/redis.queues.yml
and restart Unicorn:
for role in gprd-base-fe-{api,web{,-pages}}; do knife ssh "roles:${role}" "sudo rm /var/opt/gitlab/gitlab-rails/etc/redis.queues.yml && sudo gitlab-ctl restart unicorn"
-
Re-enable/run chef-client on all nodes (minimally, gprd-base-be-sidekiq
,gprd-base-fe-api
,gprd-base-fe-web
, &gprd-base-fe-web-pages
)
Edited by Craig Miskell