Upgrade Redis persistent to 6.0 in gstg

Production Change

Change Summary

We are upgrading redis on the redis-persistent cluster in gstg from 5.0 to 6.0. The motivation for this is documented in this epic: &395 (closed).

This first test on gstg also aims to discover issues with the upgrade process, so that we can refine it before moving to gprd.

Change Details

Services Impacted - ServiceRedis
Change Technician - @igorwwwwwwwwwwwwwwwwwwww
Change Criticality - C2
Change Reviewer - @craigf
Due Date - 2021-02-12 10:00 UTC
Time tracking - 2h
Downtime Component - Reads will remain available, some writes will be lost during failover.

Detailed steps for the change

Pre-Change Steps - steps to be completed before execution of the change

Estimated Time to Complete (mins) - 1-3 minutes

Create MR on chef-repo that bumps omnibus version pin on redis node to 13.9.202101260505-6ddf2ab9a1e.4e39551fc9f.

Setup env

export gitlab_env=gstg
export gitlab_project=gitlab-staging-1
export gitlab_redis_cluster=redis
export gitlab_release=13.9.202101260505-6ddf2ab9a1e.4e39551fc9f
export gitlab_release_old=12.8.1-ee.0
export gitlab_production_change=3417

export redis_cli='REDISCLI_AUTH="$(sudo grep ^requirepass /var/opt/gitlab/redis/redis.conf|cut -d" " -f2|tr -d \")" /opt/gitlab/embedded/bin/redis-cli'

export hosts=$(seq -f "${gitlab_redis_cluster}-%02g-db-${gitlab_env}" 1 3)

Dashboard
- https://dashboards.gitlab.net/d/redis-main/redis-overview?orgId=1&from=now-6h%2Fm&to=now%2Fm&var-PROMETHEUS_DS=Global&var-environment=gstg&var-stage=main&var-sigma=2

Logs

echo multitail $(echo $hosts | xargs -n1 -I{} echo -l \'ssh {}.c.${gitlab_project}.internal sudo tail -F /var/log/gitlab/redis/current\')
echo multitail $(echo $hosts | xargs -n1 -I{} echo -l \'ssh {}.c.${gitlab_project}.internal sudo tail -F /var/log/gitlab/sentinel/current\')

Silence alert name RedisReplicasFlapping for 90 minutes

Disable chef

cd chef-repo
knife ssh "roles:gstg-base-db-redis-server-single AND chef_environment:gstg" "hostname"
knife ssh "roles:gstg-base-db-redis-server-single AND chef_environment:gstg" "chef-client-disable 'see production change $gitlab_production_change'"

Change Steps - steps to take to execute the change

Estimated Time to Complete (mins) - 30 minutes

Merge aforementioned chef-repo MR
- DO NOT APPLY WITH CI - that would cause all nodes to be restarted, or to fail if TF does not allow restarts (likely).

Ensure cluster is in a good state

# check that hosts are as expected
echo $hosts | xargs -n1 -I{} gcloud --project $gitlab_project compute instances list --format json --filter 'name=({})' | jq -r '.[].selfLink'

# check roles ("slave" [sic])
echo $hosts | xargs -n1 -I{} ssh "{}.c.${gitlab_project}.internal" 'hostname; '$redis_cli' role | head -n1; echo'

# check sentinel status
echo $hosts | xargs -n1 -I{} ssh "{}.c.${gitlab_project}.internal" "hostname; /opt/gitlab/embedded/bin/redis-cli -p 26379 sentinel ckquorum ${gitlab_env}-redis"

Perform disk snapshots

# check the disks we are about to snapshot
echo $hosts | xargs -n1 -I{} gcloud --project $gitlab_project compute disks list --format json --filter 'name=({}-data)' | jq -r '.[].selfLink'

# snapshot all disks
echo $hosts | xargs -n1 -I{} gcloud --project $gitlab_project compute disks list --format json --filter 'name=({}-data)' | jq -r '.[].selfLink' | xargs -n1 -I{} gcloud --project $gitlab_project compute disks snapshot '{}'

Pick host

export i=01

export fqdn="redis-$i-db-${gitlab_env}.c.${gitlab_project}.internal"
export host_self_link="$(gcloud --project $gitlab_project compute instances list --format json --filter "name=(redis-$i-db-${gitlab_env})" | jq -r '.[].selfLink')"

echo $fqdn
echo $host_self_link

Backup config files

ssh $fqdn sudo cp /etc/gitlab/gitlab.rb /etc/gitlab/gitlab.rb.bak
ssh $fqdn sudo cp /var/opt/gitlab/sentinel/sentinel.conf /var/opt/gitlab/sentinel/sentinel.conf.bak
ssh $fqdn sudo cp /var/opt/gitlab/redis/redis.conf /var/opt/gitlab/redis/redis.conf.bak

Failover if we are a master

# check role
ssh $fqdn "$redis_cli role | head -n1"

# if role is master, perform failover
ssh $fqdn "/opt/gitlab/embedded/bin/redis-cli -p 26379 sentinel failover ${gitlab_env}-redis"

# wait for master to step down and sync (expect "slave" [sic] and "connected")
ssh $fqdn "$redis_cli --no-raw role"

# wait for sentinel to ack the master change
ssh $fqdn "/opt/gitlab/embedded/bin/redis-cli -p 26379 --no-raw sentinel master ${gitlab_env}-redis"

Upgrade sentinel and redis

# double check that we are dealing with a replica
ssh $fqdn "$redis_cli --no-raw role"

# get version
ssh $fqdn "/opt/gitlab/embedded/bin/redis-cli -p 26379 info | grep ^redis_version:"
ssh $fqdn "$redis_cli info | grep ^redis_version:"

# ensure config is written out
ssh $fqdn "/opt/gitlab/embedded/bin/redis-cli -p 26379 sentinel flushconfig"

# check versions, sentinel quorum, and roles
ssh $fqdn "/opt/gitlab/embedded/bin/redis-cli -p 26379 info | grep ^redis_version:"
echo $hosts | xargs -n1 -I{} ssh "{}.c.${gitlab_project}.internal" "hostname; /opt/gitlab/embedded/bin/redis-cli -p 26379 sentinel ckquorum ${gitlab_env}-redis"
ssh $fqdn "$redis_cli info | grep ^redis_version:"
ssh $fqdn "$redis_cli --no-raw role"

# fixup gitlab.rb before installing new packages
# this _will_ restart processes
ssh $fqdn sudo sed -i '/^gitlab_kas/d' /etc/gitlab/gitlab.rb
ssh $fqdn sudo gitlab-ctl reconfigure

# install packages
# this _might_ restart processes
ssh $fqdn sudo apt-get update
ssh $fqdn sudo apt-get install -y "gitlab-ee=12.10.14-ee.0"
ssh $fqdn sudo apt-get install -y "gitlab-ee=13.0.14-ee.0"
ssh $fqdn sudo apt-get install -y "gitlab-ee=13.1.11-ee.0"
ssh $fqdn sudo apt-get install -y "gitlab-ee=13.5.3-ee.0"
ssh $fqdn sudo apt-get install -y "gitlab-ee=$gitlab_release"

# reconfigure
# this _will_ restart processes
ssh $fqdn sudo gitlab-ctl reconfigure
ssh $fqdn sudo gitlab-ctl restart sentinel
ssh $fqdn sudo gitlab-ctl restart redis

# ensure we are running the new version
ssh $fqdn "/opt/gitlab/embedded/bin/redis-cli -p 26379 info | grep ^redis_version:"
ssh $fqdn "$redis_cli info | grep ^redis_version:"

# check sentinel status
echo $hosts | xargs -n1 -I{} ssh "{}.c.${gitlab_project}.internal" "hostname; /opt/gitlab/embedded/bin/redis-cli -p 26379 sentinel ckquorum ${gitlab_env}-redis"

Failover to upgraded node

# check roles
echo $hosts | xargs -n1 -I{} ssh "{}.c.${gitlab_project}.internal" 'hostname; '$redis_cli' role | head -n1; echo'

# check replica priority
echo $hosts | xargs -n1 -I{} ssh "{}.c.${gitlab_project}.internal" 'hostname; '$redis_cli' --no-raw config get replica-priority; echo'

# take other node out of the pool
echo $hosts | grep -v ^redis-$i-db-${gitlab_env}$ | xargs -n1 -I{} ssh "{}.c.${gitlab_project}.internal" 'hostname; '$redis_cli' --no-raw config set replica-priority 0; echo'

# check replica priority
echo $hosts | xargs -n1 -I{} ssh "{}.c.${gitlab_project}.internal" 'hostname; '$redis_cli' --no-raw config get replica-priority; echo'

# perform failover
ssh $fqdn "/opt/gitlab/embedded/bin/redis-cli -p 26379 sentinel failover ${gitlab_env}-redis"

# check roles
echo $hosts | xargs -n1 -I{} ssh "{}.c.${gitlab_project}.internal" 'hostname; '$redis_cli' --no-raw role; echo'

# wait for sentinel to ack the master change
ssh $fqdn "/opt/gitlab/embedded/bin/redis-cli -p 26379 --no-raw sentinel master ${gitlab_env}-redis"

# restore replica-priority
echo $hosts | grep -v ^redis-$i-db-${gitlab_env}$ | xargs -n1 -I{} ssh "{}.c.${gitlab_project}.internal" 'hostname; '$redis_cli' --no-raw config set replica-priority 100; echo'

Repeat process for node 02

export i=02

export fqdn="redis-$i-db-${gitlab_env}.c.${gitlab_project}.internal"
export host_self_link="$(gcloud --project $gitlab_project compute instances list --format json --filter "name=(redis-$i-db-${gitlab_env})" | jq -r '.[].selfLink')"

echo $fqdn
echo $host_self_link

Repeat process for node 03

export i=03

export fqdn="redis-$i-db-${gitlab_env}.c.${gitlab_project}.internal"
export host_self_link="$(gcloud --project $gitlab_project compute instances list --format json --filter "name=(redis-$i-db-${gitlab_env})" | jq -r '.[].selfLink')"

echo $fqdn
echo $host_self_link

Post-Change Steps - steps to take to verify the change

Estimated Time to Complete (mins) - 5 minutes

Ensure we have the same version everywhere

# sentinels
echo $hosts | xargs -n1 -I{} ssh "{}.c.${gitlab_project}.internal" 'hostname; /opt/gitlab/embedded/bin/redis-cli -p 26379 info | grep ^redis_version:; echo'

# redises
echo $hosts | xargs -n1 -I{} ssh "{}.c.${gitlab_project}.internal" 'hostname; '$redis_cli' info | grep ^redis_version:; echo'

Ensure replica-priority is set everywhere

# inspect replica-priority
echo $hosts | xargs -n1 -I{} ssh "{}.c.${gitlab_project}.internal" 'hostname; '$redis_cli' --no-raw config get replica-priority; echo'

# set replica-priority if needed
export i=01
export fqdn="redis-$i-db-${gitlab_env}.c.${gitlab_project}.internal"

ssh $fqdn "$redis_cli config set replica-priority 100"

Cleanup backed-up config files

export i=01
export fqdn="redis-$i-db-${gitlab_env}.c.${gitlab_project}.internal"

ssh $fqdn sudo rm /etc/gitlab/gitlab.rb.bak /var/opt/gitlab/sentinel/sentinel.conf.bak /var/opt/gitlab/redis/redis.conf.bak

Re-enable chef

export i=01
export fqdn="redis-$i-db-${gitlab_env}.c.${gitlab_project}.internal"

ssh $fqdn sudo chef-client-enable
ssh $fqdn sudo chef-client

Rollback

Rollback steps - steps to be taken in the event of a need to rollback this change

Estimated Time to Complete (mins) - 30 minutes

Pick host

export i=01

export fqdn="redis-$i-db-${gitlab_env}.c.${gitlab_project}.internal"

export host_self_link="$(gcloud --project $gitlab_project compute instances list --format json --filter "name=(redis-$i-db-${gitlab_env})" | jq -r '.[].selfLink')"
export disk_self_link=$(gcloud --project $gitlab_project compute disks list --format json --filter "name=(redis-$i-db-${gitlab_env}-data)" | jq -r '.[].selfLink')

echo $fqdn
echo $host_self_link
echo $disk_self_link

Failover away from recently upgraded node

# check role (expecting "master")
ssh $fqdn "$redis_cli role | head -n1"

# take this node out of the pool
ssh $fqdn "$redis_cli config set replica-priority 0"

# perform failover
ssh $fqdn "/opt/gitlab/embedded/bin/redis-cli -p 26379 sentinel failover ${gitlab_env}-redis"

# wait for master to step down and sync (expect "slave" [sic] and "connected")
ssh $fqdn "$redis_cli --no-raw role"

# wait for sentinel to ack the master change
ssh $fqdn "/opt/gitlab/embedded/bin/redis-cli -p 26379 --no-raw sentinel master ${gitlab_env}-redis"

Rollback (downgrade)

stop chef

# stop chef
knife ssh "roles:gstg-base-db-redis-server-single AND chef_environment:gstg" "chef-client-disable 'see production change $gitlab_production_change'"

downgrade sentinel

# get sentinel version
ssh $fqdn "/opt/gitlab/embedded/bin/redis-cli -p 26379 info | grep ^redis_version:"

# ensure config is written out
ssh $fqdn "/opt/gitlab/embedded/bin/redis-cli -p 26379 sentinel flushconfig"

# apply downgrade
ssh $fqdn sudo apt-get install -y "gitlab-ee=$gitlab_release_old"
ssh $fqdn sudo gitlab-ctl reconfigure

# fixup config, restart process
ssh $fqdn sudo gitlab-ctl stop sentinel
ssh $fqdn sudo cp /var/opt/gitlab/sentinel/sentinel.conf /var/opt/gitlab/sentinel/sentinel.conf.bak
ssh $fqdn sudo sed -i '/^user /d' /var/opt/gitlab/sentinel/sentinel.conf
ssh $fqdn sudo gitlab-ctl start sentinel

# ensure we are running the old version
ssh $fqdn "/opt/gitlab/embedded/bin/redis-cli -p 26379 info | grep ^redis_version:"

# check sentinel status
echo $hosts | xargs -n1 -I{} ssh "{}.c.${gitlab_project}.internal" "hostname; /opt/gitlab/embedded/bin/redis-cli -p 26379 sentinel ckquorum ${gitlab_env}-redis"

Downgrade redis

# double check that we are dealing with a replica
ssh $fqdn "$redis_cli --no-raw role"

# get redis version
ssh $fqdn "$redis_cli info | grep ^redis_version:"

# (no apt-get install or gitlab-ctl reconfigure needed, as this already happened during sentinel upgrade)

# fixup config, restart process
ssh $fqdn sudo gitlab-ctl stop redis
ssh $fqdn sudo cp /var/opt/gitlab/redis/redis.conf /var/opt/gitlab/redis/redis.conf.bak
ssh $fqdn sudo sed -i '/^user /d' /var/opt/gitlab/redis/redis.conf
ssh $fqdn sudo gitlab-ctl start redis

# ensure we are running the old version
ssh $fqdn "$redis_cli info | grep ^redis_version:"

revert chef-repo MR

Alternatively: If needed, we can recover from disk snapshot

Cleanup

Ensure we have the same version everywhere

# sentinels
echo $hosts | xargs -n1 -I{} ssh "{}.c.${gitlab_project}.internal" 'hostname; /opt/gitlab/embedded/bin/redis-cli -p 26379 info | grep ^redis_version:; echo'

# redises
echo $hosts | xargs -n1 -I{} ssh "{}.c.${gitlab_project}.internal" 'hostname; '$redis_cli' info | grep ^redis_version:; echo'

Ensure replica-priority is set everywhere

# inspect replica-priority
echo $hosts | xargs -n1 -I{} ssh "{}.c.${gitlab_project}.internal" 'hostname; '$redis_cli' --no-raw config get replica-priority; echo'

# set replica-priority if needed
export i=01
export fqdn="redis-$i-db-${gitlab_env}.c.${gitlab_project}.internal"

ssh $fqdn "$redis_cli config set replica-priority 100"

Cleanup backed-up config files

export i=01
export fqdn="redis-$i-db-${gitlab_env}.c.${gitlab_project}.internal"

ssh $fqdn sudo rm /etc/gitlab/gitlab.rb.bak /var/opt/gitlab/sentinel/sentinel.conf.bak /var/opt/gitlab/redis/redis.conf.bak

Re-enable chef

export i=01
export fqdn="redis-$i-db-${gitlab_env}.c.${gitlab_project}.internal"

ssh $fqdn sudo chef-client-enable
ssh $fqdn sudo chef-client

Monitoring

Key metrics to observe

Metric: Redis SLOs
- Location: https://dashboards.gitlab.net/d/redis-main/redis-overview?orgId=1&from=now-6h%2Fm&to=now%2Fm&var-PROMETHEUS_DS=Global&var-environment=gstg&var-stage=main&var-sigma=2
- What changes to this metric should prompt a rollback: Significant increases in latency, error rates, saturation.

Summary of infrastructure changes

Does this change introduce new compute instances?
Does this change re-size any existing compute instances?
Does this change introduce any additional usage of tooling like Elastic Search, CDNs, Cloudflare, etc?

Summary of the above

Changes checklist

This issue has a criticality label (e.g. C1, C2, C3, C4) and a change-type label (e.g. changeunscheduled, changescheduled) based on the Change Management Criticalities.
This issue has the change technician as the assignee.
Pre-Change, Change, Post-Change, and Rollback steps and have been filled out and reviewed.
Necessary approvals have been completed based on the Change Management Workflow.
Change has been tested in staging and results noted in a comment on this issue.
A dry-run has been conducted and results noted in a comment on this issue.
SRE on-call has been informed prior to change being rolled out. (In #production channel, mention @sre-oncall and this issue and await their acknowledgement.)
There are currently no active incidents.

refs https://gitlab.com/gitlab-com/gl-infra/infrastructure/-/issues/12344

Edited Feb 12, 2021 by Igor