Upgrade Redis persistent to 6.0 in gstg
Production Change
Change Summary
We are upgrading redis on the redis-persistent cluster in gstg from 5.0 to 6.0. The motivation for this is documented in this epic: &395 (closed).
This first test on gstg also aims to discover issues with the upgrade process, so that we can refine it before moving to gprd.
Change Details
- Services Impacted - ServiceRedis
- Change Technician - @igorwwwwwwwwwwwwwwwwwwww
- Change Criticality - C2
- Change Reviewer - @craigf
- Due Date - 2021-02-12 10:00 UTC
- Time tracking - 2h
- Downtime Component - Reads will remain available, some writes will be lost during failover.
Detailed steps for the change
Pre-Change Steps - steps to be completed before execution of the change
Estimated Time to Complete (mins) - 1-3 minutes
-
Create MR on chef-repo that bumps omnibus version pin on redisnode to13.9.202101260505-6ddf2ab9a1e.4e39551fc9f. -
Setup env export gitlab_env=gstg export gitlab_project=gitlab-staging-1 export gitlab_redis_cluster=redis export gitlab_release=13.9.202101260505-6ddf2ab9a1e.4e39551fc9f export gitlab_release_old=12.8.1-ee.0 export gitlab_production_change=3417 export redis_cli='REDISCLI_AUTH="$(sudo grep ^requirepass /var/opt/gitlab/redis/redis.conf|cut -d" " -f2|tr -d \")" /opt/gitlab/embedded/bin/redis-cli' export hosts=$(seq -f "${gitlab_redis_cluster}-%02g-db-${gitlab_env}" 1 3) -
Dashboard -
Logs echo multitail $(echo $hosts | xargs -n1 -I{} echo -l \'ssh {}.c.${gitlab_project}.internal sudo tail -F /var/log/gitlab/redis/current\') echo multitail $(echo $hosts | xargs -n1 -I{} echo -l \'ssh {}.c.${gitlab_project}.internal sudo tail -F /var/log/gitlab/sentinel/current\') -
Silence alert name RedisReplicasFlappingfor 90 minutes -
Disable chef cd chef-repo knife ssh "roles:gstg-base-db-redis-server-single AND chef_environment:gstg" "hostname" knife ssh "roles:gstg-base-db-redis-server-single AND chef_environment:gstg" "chef-client-disable 'see production change $gitlab_production_change'"
Change Steps - steps to take to execute the change
Estimated Time to Complete (mins) - 30 minutes
-
Merge aforementioned chef-repo MR - DO NOT APPLY WITH CI - that would cause all nodes to be restarted, or to fail if TF does not allow restarts (likely).
-
Ensure cluster is in a good state # check that hosts are as expected echo $hosts | xargs -n1 -I{} gcloud --project $gitlab_project compute instances list --format json --filter 'name=({})' | jq -r '.[].selfLink' # check roles ("slave" [sic]) echo $hosts | xargs -n1 -I{} ssh "{}.c.${gitlab_project}.internal" 'hostname; '$redis_cli' role | head -n1; echo' # check sentinel status echo $hosts | xargs -n1 -I{} ssh "{}.c.${gitlab_project}.internal" "hostname; /opt/gitlab/embedded/bin/redis-cli -p 26379 sentinel ckquorum ${gitlab_env}-redis" -
Perform disk snapshots # check the disks we are about to snapshot echo $hosts | xargs -n1 -I{} gcloud --project $gitlab_project compute disks list --format json --filter 'name=({}-data)' | jq -r '.[].selfLink' # snapshot all disks echo $hosts | xargs -n1 -I{} gcloud --project $gitlab_project compute disks list --format json --filter 'name=({}-data)' | jq -r '.[].selfLink' | xargs -n1 -I{} gcloud --project $gitlab_project compute disks snapshot '{}' -
Pick host export i=01 export fqdn="redis-$i-db-${gitlab_env}.c.${gitlab_project}.internal" export host_self_link="$(gcloud --project $gitlab_project compute instances list --format json --filter "name=(redis-$i-db-${gitlab_env})" | jq -r '.[].selfLink')" echo $fqdn echo $host_self_link -
Backup config files ssh $fqdn sudo cp /etc/gitlab/gitlab.rb /etc/gitlab/gitlab.rb.bak ssh $fqdn sudo cp /var/opt/gitlab/sentinel/sentinel.conf /var/opt/gitlab/sentinel/sentinel.conf.bak ssh $fqdn sudo cp /var/opt/gitlab/redis/redis.conf /var/opt/gitlab/redis/redis.conf.bak -
Failover if we are a master # check role ssh $fqdn "$redis_cli role | head -n1" # if role is master, perform failover ssh $fqdn "/opt/gitlab/embedded/bin/redis-cli -p 26379 sentinel failover ${gitlab_env}-redis" # wait for master to step down and sync (expect "slave" [sic] and "connected") ssh $fqdn "$redis_cli --no-raw role" # wait for sentinel to ack the master change ssh $fqdn "/opt/gitlab/embedded/bin/redis-cli -p 26379 --no-raw sentinel master ${gitlab_env}-redis" -
Upgrade sentinel and redis # double check that we are dealing with a replica ssh $fqdn "$redis_cli --no-raw role" # get version ssh $fqdn "/opt/gitlab/embedded/bin/redis-cli -p 26379 info | grep ^redis_version:" ssh $fqdn "$redis_cli info | grep ^redis_version:" # ensure config is written out ssh $fqdn "/opt/gitlab/embedded/bin/redis-cli -p 26379 sentinel flushconfig" # check versions, sentinel quorum, and roles ssh $fqdn "/opt/gitlab/embedded/bin/redis-cli -p 26379 info | grep ^redis_version:" echo $hosts | xargs -n1 -I{} ssh "{}.c.${gitlab_project}.internal" "hostname; /opt/gitlab/embedded/bin/redis-cli -p 26379 sentinel ckquorum ${gitlab_env}-redis" ssh $fqdn "$redis_cli info | grep ^redis_version:" ssh $fqdn "$redis_cli --no-raw role" # fixup gitlab.rb before installing new packages # this _will_ restart processes ssh $fqdn sudo sed -i '/^gitlab_kas/d' /etc/gitlab/gitlab.rb ssh $fqdn sudo gitlab-ctl reconfigure # install packages # this _might_ restart processes ssh $fqdn sudo apt-get update ssh $fqdn sudo apt-get install -y "gitlab-ee=12.10.14-ee.0" ssh $fqdn sudo apt-get install -y "gitlab-ee=13.0.14-ee.0" ssh $fqdn sudo apt-get install -y "gitlab-ee=13.1.11-ee.0" ssh $fqdn sudo apt-get install -y "gitlab-ee=13.5.3-ee.0" ssh $fqdn sudo apt-get install -y "gitlab-ee=$gitlab_release" # reconfigure # this _will_ restart processes ssh $fqdn sudo gitlab-ctl reconfigure ssh $fqdn sudo gitlab-ctl restart sentinel ssh $fqdn sudo gitlab-ctl restart redis # ensure we are running the new version ssh $fqdn "/opt/gitlab/embedded/bin/redis-cli -p 26379 info | grep ^redis_version:" ssh $fqdn "$redis_cli info | grep ^redis_version:" # check sentinel status echo $hosts | xargs -n1 -I{} ssh "{}.c.${gitlab_project}.internal" "hostname; /opt/gitlab/embedded/bin/redis-cli -p 26379 sentinel ckquorum ${gitlab_env}-redis" -
Failover to upgraded node # check roles echo $hosts | xargs -n1 -I{} ssh "{}.c.${gitlab_project}.internal" 'hostname; '$redis_cli' role | head -n1; echo' # check replica priority echo $hosts | xargs -n1 -I{} ssh "{}.c.${gitlab_project}.internal" 'hostname; '$redis_cli' --no-raw config get replica-priority; echo' # take other node out of the pool echo $hosts | grep -v ^redis-$i-db-${gitlab_env}$ | xargs -n1 -I{} ssh "{}.c.${gitlab_project}.internal" 'hostname; '$redis_cli' --no-raw config set replica-priority 0; echo' # check replica priority echo $hosts | xargs -n1 -I{} ssh "{}.c.${gitlab_project}.internal" 'hostname; '$redis_cli' --no-raw config get replica-priority; echo' # perform failover ssh $fqdn "/opt/gitlab/embedded/bin/redis-cli -p 26379 sentinel failover ${gitlab_env}-redis" # check roles echo $hosts | xargs -n1 -I{} ssh "{}.c.${gitlab_project}.internal" 'hostname; '$redis_cli' --no-raw role; echo' # wait for sentinel to ack the master change ssh $fqdn "/opt/gitlab/embedded/bin/redis-cli -p 26379 --no-raw sentinel master ${gitlab_env}-redis" # restore replica-priority echo $hosts | grep -v ^redis-$i-db-${gitlab_env}$ | xargs -n1 -I{} ssh "{}.c.${gitlab_project}.internal" 'hostname; '$redis_cli' --no-raw config set replica-priority 100; echo' -
Repeat process for node 02 export i=02 export fqdn="redis-$i-db-${gitlab_env}.c.${gitlab_project}.internal" export host_self_link="$(gcloud --project $gitlab_project compute instances list --format json --filter "name=(redis-$i-db-${gitlab_env})" | jq -r '.[].selfLink')" echo $fqdn echo $host_self_link -
Repeat process for node 03 export i=03 export fqdn="redis-$i-db-${gitlab_env}.c.${gitlab_project}.internal" export host_self_link="$(gcloud --project $gitlab_project compute instances list --format json --filter "name=(redis-$i-db-${gitlab_env})" | jq -r '.[].selfLink')" echo $fqdn echo $host_self_link
Post-Change Steps - steps to take to verify the change
Estimated Time to Complete (mins) - 5 minutes
-
Ensure we have the same version everywhere # sentinels echo $hosts | xargs -n1 -I{} ssh "{}.c.${gitlab_project}.internal" 'hostname; /opt/gitlab/embedded/bin/redis-cli -p 26379 info | grep ^redis_version:; echo' # redises echo $hosts | xargs -n1 -I{} ssh "{}.c.${gitlab_project}.internal" 'hostname; '$redis_cli' info | grep ^redis_version:; echo' -
Ensure replica-priority is set everywhere # inspect replica-priority echo $hosts | xargs -n1 -I{} ssh "{}.c.${gitlab_project}.internal" 'hostname; '$redis_cli' --no-raw config get replica-priority; echo' # set replica-priority if needed export i=01 export fqdn="redis-$i-db-${gitlab_env}.c.${gitlab_project}.internal" ssh $fqdn "$redis_cli config set replica-priority 100" -
Cleanup backed-up config files export i=01 export fqdn="redis-$i-db-${gitlab_env}.c.${gitlab_project}.internal" ssh $fqdn sudo rm /etc/gitlab/gitlab.rb.bak /var/opt/gitlab/sentinel/sentinel.conf.bak /var/opt/gitlab/redis/redis.conf.bak -
Re-enable chef export i=01 export fqdn="redis-$i-db-${gitlab_env}.c.${gitlab_project}.internal" ssh $fqdn sudo chef-client-enable ssh $fqdn sudo chef-client
Rollback
Rollback steps - steps to be taken in the event of a need to rollback this change
Estimated Time to Complete (mins) - 30 minutes
- Pick host
export i=01 export fqdn="redis-$i-db-${gitlab_env}.c.${gitlab_project}.internal" export host_self_link="$(gcloud --project $gitlab_project compute instances list --format json --filter "name=(redis-$i-db-${gitlab_env})" | jq -r '.[].selfLink')" export disk_self_link=$(gcloud --project $gitlab_project compute disks list --format json --filter "name=(redis-$i-db-${gitlab_env}-data)" | jq -r '.[].selfLink') echo $fqdn echo $host_self_link echo $disk_self_link - Failover away from recently upgraded node
# check role (expecting "master") ssh $fqdn "$redis_cli role | head -n1" # take this node out of the pool ssh $fqdn "$redis_cli config set replica-priority 0" # perform failover ssh $fqdn "/opt/gitlab/embedded/bin/redis-cli -p 26379 sentinel failover ${gitlab_env}-redis" # wait for master to step down and sync (expect "slave" [sic] and "connected") ssh $fqdn "$redis_cli --no-raw role" # wait for sentinel to ack the master change ssh $fqdn "/opt/gitlab/embedded/bin/redis-cli -p 26379 --no-raw sentinel master ${gitlab_env}-redis" - Rollback (downgrade)
- stop chef
# stop chef knife ssh "roles:gstg-base-db-redis-server-single AND chef_environment:gstg" "chef-client-disable 'see production change $gitlab_production_change'" - downgrade sentinel
# get sentinel version ssh $fqdn "/opt/gitlab/embedded/bin/redis-cli -p 26379 info | grep ^redis_version:" # ensure config is written out ssh $fqdn "/opt/gitlab/embedded/bin/redis-cli -p 26379 sentinel flushconfig" # apply downgrade ssh $fqdn sudo apt-get install -y "gitlab-ee=$gitlab_release_old" ssh $fqdn sudo gitlab-ctl reconfigure # fixup config, restart process ssh $fqdn sudo gitlab-ctl stop sentinel ssh $fqdn sudo cp /var/opt/gitlab/sentinel/sentinel.conf /var/opt/gitlab/sentinel/sentinel.conf.bak ssh $fqdn sudo sed -i '/^user /d' /var/opt/gitlab/sentinel/sentinel.conf ssh $fqdn sudo gitlab-ctl start sentinel # ensure we are running the old version ssh $fqdn "/opt/gitlab/embedded/bin/redis-cli -p 26379 info | grep ^redis_version:" # check sentinel status echo $hosts | xargs -n1 -I{} ssh "{}.c.${gitlab_project}.internal" "hostname; /opt/gitlab/embedded/bin/redis-cli -p 26379 sentinel ckquorum ${gitlab_env}-redis" - Downgrade redis
# double check that we are dealing with a replica ssh $fqdn "$redis_cli --no-raw role" # get redis version ssh $fqdn "$redis_cli info | grep ^redis_version:" # (no apt-get install or gitlab-ctl reconfigure needed, as this already happened during sentinel upgrade) # fixup config, restart process ssh $fqdn sudo gitlab-ctl stop redis ssh $fqdn sudo cp /var/opt/gitlab/redis/redis.conf /var/opt/gitlab/redis/redis.conf.bak ssh $fqdn sudo sed -i '/^user /d' /var/opt/gitlab/redis/redis.conf ssh $fqdn sudo gitlab-ctl start redis # ensure we are running the old version ssh $fqdn "$redis_cli info | grep ^redis_version:" - revert chef-repo MR
- stop chef
- Alternatively: If needed, we can recover from disk snapshot
Cleanup
-
Ensure we have the same version everywhere # sentinels echo $hosts | xargs -n1 -I{} ssh "{}.c.${gitlab_project}.internal" 'hostname; /opt/gitlab/embedded/bin/redis-cli -p 26379 info | grep ^redis_version:; echo' # redises echo $hosts | xargs -n1 -I{} ssh "{}.c.${gitlab_project}.internal" 'hostname; '$redis_cli' info | grep ^redis_version:; echo' -
Ensure replica-priority is set everywhere # inspect replica-priority echo $hosts | xargs -n1 -I{} ssh "{}.c.${gitlab_project}.internal" 'hostname; '$redis_cli' --no-raw config get replica-priority; echo' # set replica-priority if needed export i=01 export fqdn="redis-$i-db-${gitlab_env}.c.${gitlab_project}.internal" ssh $fqdn "$redis_cli config set replica-priority 100" -
Cleanup backed-up config files export i=01 export fqdn="redis-$i-db-${gitlab_env}.c.${gitlab_project}.internal" ssh $fqdn sudo rm /etc/gitlab/gitlab.rb.bak /var/opt/gitlab/sentinel/sentinel.conf.bak /var/opt/gitlab/redis/redis.conf.bak -
Re-enable chef export i=01 export fqdn="redis-$i-db-${gitlab_env}.c.${gitlab_project}.internal" ssh $fqdn sudo chef-client-enable ssh $fqdn sudo chef-client
Monitoring
Key metrics to observe
- Metric: Redis SLOs
- Location: https://dashboards.gitlab.net/d/redis-main/redis-overview?orgId=1&from=now-6h%2Fm&to=now%2Fm&var-PROMETHEUS_DS=Global&var-environment=gstg&var-stage=main&var-sigma=2
- What changes to this metric should prompt a rollback: Significant increases in latency, error rates, saturation.
Summary of infrastructure changes
-
Does this change introduce new compute instances? -
Does this change re-size any existing compute instances? -
Does this change introduce any additional usage of tooling like Elastic Search, CDNs, Cloudflare, etc?
Summary of the above
Changes checklist
-
This issue has a criticality label (e.g. C1, C2, C3, C4) and a change-type label (e.g. changeunscheduled, changescheduled) based on the Change Management Criticalities. -
This issue has the change technician as the assignee. -
Pre-Change, Change, Post-Change, and Rollback steps and have been filled out and reviewed. -
Necessary approvals have been completed based on the Change Management Workflow. -
Change has been tested in staging and results noted in a comment on this issue. -
A dry-run has been conducted and results noted in a comment on this issue. -
SRE on-call has been informed prior to change being rolled out. (In #production channel, mention @sre-oncalland this issue and await their acknowledgement.) -
There are currently no active incidents.
refs https://gitlab.com/gitlab-com/gl-infra/infrastructure/-/issues/12344
Edited by Igor