Upgrade Redis persistent to 6.0 in gstg

Production Change

Change Summary

We are upgrading redis on the redis-persistent cluster in gstg from 5.0 to 6.0. The motivation for this is documented in this epic: &395 (closed).

This first test on gstg also aims to discover issues with the upgrade process, so that we can refine it before moving to gprd.

Change Details

  1. Services Impacted - ServiceRedis
  2. Change Technician - @igorwwwwwwwwwwwwwwwwwwww
  3. Change Criticality - C2
  4. Change Reviewer - @craigf
  5. Due Date - 2021-02-12 10:00 UTC
  6. Time tracking - 2h
  7. Downtime Component - Reads will remain available, some writes will be lost during failover.

Detailed steps for the change

Pre-Change Steps - steps to be completed before execution of the change

Estimated Time to Complete (mins) - 1-3 minutes

  • Create MR on chef-repo that bumps omnibus version pin on redis node to 13.9.202101260505-6ddf2ab9a1e.4e39551fc9f.
  • Setup env
    export gitlab_env=gstg
    export gitlab_project=gitlab-staging-1
    export gitlab_redis_cluster=redis
    export gitlab_release=13.9.202101260505-6ddf2ab9a1e.4e39551fc9f
    export gitlab_release_old=12.8.1-ee.0
    export gitlab_production_change=3417
    
    export redis_cli='REDISCLI_AUTH="$(sudo grep ^requirepass /var/opt/gitlab/redis/redis.conf|cut -d" " -f2|tr -d \")" /opt/gitlab/embedded/bin/redis-cli'
    
    export hosts=$(seq -f "${gitlab_redis_cluster}-%02g-db-${gitlab_env}" 1 3)
  • Dashboard
  • Logs
    echo multitail $(echo $hosts | xargs -n1 -I{} echo -l \'ssh {}.c.${gitlab_project}.internal sudo tail -F /var/log/gitlab/redis/current\')
    echo multitail $(echo $hosts | xargs -n1 -I{} echo -l \'ssh {}.c.${gitlab_project}.internal sudo tail -F /var/log/gitlab/sentinel/current\')
  • Silence alert name RedisReplicasFlapping for 90 minutes
  • Disable chef
    cd chef-repo
    knife ssh "roles:gstg-base-db-redis-server-single AND chef_environment:gstg" "hostname"
    knife ssh "roles:gstg-base-db-redis-server-single AND chef_environment:gstg" "chef-client-disable 'see production change $gitlab_production_change'"

Change Steps - steps to take to execute the change

Estimated Time to Complete (mins) - 30 minutes

  • Merge aforementioned chef-repo MR
    • DO NOT APPLY WITH CI - that would cause all nodes to be restarted, or to fail if TF does not allow restarts (likely).
  • Ensure cluster is in a good state
    # check that hosts are as expected
    echo $hosts | xargs -n1 -I{} gcloud --project $gitlab_project compute instances list --format json --filter 'name=({})' | jq -r '.[].selfLink'
    
    # check roles ("slave" [sic])
    echo $hosts | xargs -n1 -I{} ssh "{}.c.${gitlab_project}.internal" 'hostname; '$redis_cli' role | head -n1; echo'
    
    # check sentinel status
    echo $hosts | xargs -n1 -I{} ssh "{}.c.${gitlab_project}.internal" "hostname; /opt/gitlab/embedded/bin/redis-cli -p 26379 sentinel ckquorum ${gitlab_env}-redis"
  • Perform disk snapshots
    # check the disks we are about to snapshot
    echo $hosts | xargs -n1 -I{} gcloud --project $gitlab_project compute disks list --format json --filter 'name=({}-data)' | jq -r '.[].selfLink'
    
    # snapshot all disks
    echo $hosts | xargs -n1 -I{} gcloud --project $gitlab_project compute disks list --format json --filter 'name=({}-data)' | jq -r '.[].selfLink' | xargs -n1 -I{} gcloud --project $gitlab_project compute disks snapshot '{}'
  • Pick host
    export i=01
    
    export fqdn="redis-$i-db-${gitlab_env}.c.${gitlab_project}.internal"
    export host_self_link="$(gcloud --project $gitlab_project compute instances list --format json --filter "name=(redis-$i-db-${gitlab_env})" | jq -r '.[].selfLink')"
    
    echo $fqdn
    echo $host_self_link
  • Backup config files
    ssh $fqdn sudo cp /etc/gitlab/gitlab.rb /etc/gitlab/gitlab.rb.bak
    ssh $fqdn sudo cp /var/opt/gitlab/sentinel/sentinel.conf /var/opt/gitlab/sentinel/sentinel.conf.bak
    ssh $fqdn sudo cp /var/opt/gitlab/redis/redis.conf /var/opt/gitlab/redis/redis.conf.bak
  • Failover if we are a master
    # check role
    ssh $fqdn "$redis_cli role | head -n1"
    
    # if role is master, perform failover
    ssh $fqdn "/opt/gitlab/embedded/bin/redis-cli -p 26379 sentinel failover ${gitlab_env}-redis"
    
    # wait for master to step down and sync (expect "slave" [sic] and "connected")
    ssh $fqdn "$redis_cli --no-raw role"
    
    # wait for sentinel to ack the master change
    ssh $fqdn "/opt/gitlab/embedded/bin/redis-cli -p 26379 --no-raw sentinel master ${gitlab_env}-redis"
  • Upgrade sentinel and redis
    # double check that we are dealing with a replica
    ssh $fqdn "$redis_cli --no-raw role"
    
    # get version
    ssh $fqdn "/opt/gitlab/embedded/bin/redis-cli -p 26379 info | grep ^redis_version:"
    ssh $fqdn "$redis_cli info | grep ^redis_version:"
    
    # ensure config is written out
    ssh $fqdn "/opt/gitlab/embedded/bin/redis-cli -p 26379 sentinel flushconfig"
    
    # check versions, sentinel quorum, and roles
    ssh $fqdn "/opt/gitlab/embedded/bin/redis-cli -p 26379 info | grep ^redis_version:"
    echo $hosts | xargs -n1 -I{} ssh "{}.c.${gitlab_project}.internal" "hostname; /opt/gitlab/embedded/bin/redis-cli -p 26379 sentinel ckquorum ${gitlab_env}-redis"
    ssh $fqdn "$redis_cli info | grep ^redis_version:"
    ssh $fqdn "$redis_cli --no-raw role"
    
    # fixup gitlab.rb before installing new packages
    # this _will_ restart processes
    ssh $fqdn sudo sed -i '/^gitlab_kas/d' /etc/gitlab/gitlab.rb
    ssh $fqdn sudo gitlab-ctl reconfigure
    
    # install packages
    # this _might_ restart processes
    ssh $fqdn sudo apt-get update
    ssh $fqdn sudo apt-get install -y "gitlab-ee=12.10.14-ee.0"
    ssh $fqdn sudo apt-get install -y "gitlab-ee=13.0.14-ee.0"
    ssh $fqdn sudo apt-get install -y "gitlab-ee=13.1.11-ee.0"
    ssh $fqdn sudo apt-get install -y "gitlab-ee=13.5.3-ee.0"
    ssh $fqdn sudo apt-get install -y "gitlab-ee=$gitlab_release"
    
    # reconfigure
    # this _will_ restart processes
    ssh $fqdn sudo gitlab-ctl reconfigure
    ssh $fqdn sudo gitlab-ctl restart sentinel
    ssh $fqdn sudo gitlab-ctl restart redis
    
    # ensure we are running the new version
    ssh $fqdn "/opt/gitlab/embedded/bin/redis-cli -p 26379 info | grep ^redis_version:"
    ssh $fqdn "$redis_cli info | grep ^redis_version:"
    
    # check sentinel status
    echo $hosts | xargs -n1 -I{} ssh "{}.c.${gitlab_project}.internal" "hostname; /opt/gitlab/embedded/bin/redis-cli -p 26379 sentinel ckquorum ${gitlab_env}-redis"
  • Failover to upgraded node
    # check roles
    echo $hosts | xargs -n1 -I{} ssh "{}.c.${gitlab_project}.internal" 'hostname; '$redis_cli' role | head -n1; echo'
    
    # check replica priority
    echo $hosts | xargs -n1 -I{} ssh "{}.c.${gitlab_project}.internal" 'hostname; '$redis_cli' --no-raw config get replica-priority; echo'
    
    # take other node out of the pool
    echo $hosts | grep -v ^redis-$i-db-${gitlab_env}$ | xargs -n1 -I{} ssh "{}.c.${gitlab_project}.internal" 'hostname; '$redis_cli' --no-raw config set replica-priority 0; echo'
    
    # check replica priority
    echo $hosts | xargs -n1 -I{} ssh "{}.c.${gitlab_project}.internal" 'hostname; '$redis_cli' --no-raw config get replica-priority; echo'
    
    # perform failover
    ssh $fqdn "/opt/gitlab/embedded/bin/redis-cli -p 26379 sentinel failover ${gitlab_env}-redis"
    
    # check roles
    echo $hosts | xargs -n1 -I{} ssh "{}.c.${gitlab_project}.internal" 'hostname; '$redis_cli' --no-raw role; echo'
    
    # wait for sentinel to ack the master change
    ssh $fqdn "/opt/gitlab/embedded/bin/redis-cli -p 26379 --no-raw sentinel master ${gitlab_env}-redis"
    
    # restore replica-priority
    echo $hosts | grep -v ^redis-$i-db-${gitlab_env}$ | xargs -n1 -I{} ssh "{}.c.${gitlab_project}.internal" 'hostname; '$redis_cli' --no-raw config set replica-priority 100; echo'
  • Repeat process for node 02
    export i=02
    
    export fqdn="redis-$i-db-${gitlab_env}.c.${gitlab_project}.internal"
    export host_self_link="$(gcloud --project $gitlab_project compute instances list --format json --filter "name=(redis-$i-db-${gitlab_env})" | jq -r '.[].selfLink')"
    
    echo $fqdn
    echo $host_self_link
  • Repeat process for node 03
    export i=03
    
    export fqdn="redis-$i-db-${gitlab_env}.c.${gitlab_project}.internal"
    export host_self_link="$(gcloud --project $gitlab_project compute instances list --format json --filter "name=(redis-$i-db-${gitlab_env})" | jq -r '.[].selfLink')"
    
    echo $fqdn
    echo $host_self_link

Post-Change Steps - steps to take to verify the change

Estimated Time to Complete (mins) - 5 minutes

  • Ensure we have the same version everywhere
    # sentinels
    echo $hosts | xargs -n1 -I{} ssh "{}.c.${gitlab_project}.internal" 'hostname; /opt/gitlab/embedded/bin/redis-cli -p 26379 info | grep ^redis_version:; echo'
    
    # redises
    echo $hosts | xargs -n1 -I{} ssh "{}.c.${gitlab_project}.internal" 'hostname; '$redis_cli' info | grep ^redis_version:; echo'
  • Ensure replica-priority is set everywhere
    # inspect replica-priority
    echo $hosts | xargs -n1 -I{} ssh "{}.c.${gitlab_project}.internal" 'hostname; '$redis_cli' --no-raw config get replica-priority; echo'
    
    # set replica-priority if needed
    export i=01
    export fqdn="redis-$i-db-${gitlab_env}.c.${gitlab_project}.internal"
    
    ssh $fqdn "$redis_cli config set replica-priority 100"
  • Cleanup backed-up config files
    export i=01
    export fqdn="redis-$i-db-${gitlab_env}.c.${gitlab_project}.internal"
    
    ssh $fqdn sudo rm /etc/gitlab/gitlab.rb.bak /var/opt/gitlab/sentinel/sentinel.conf.bak /var/opt/gitlab/redis/redis.conf.bak
  • Re-enable chef
    export i=01
    export fqdn="redis-$i-db-${gitlab_env}.c.${gitlab_project}.internal"
    
    ssh $fqdn sudo chef-client-enable
    ssh $fqdn sudo chef-client

Rollback

Rollback steps - steps to be taken in the event of a need to rollback this change

Estimated Time to Complete (mins) - 30 minutes

  • Pick host
    export i=01
    
    export fqdn="redis-$i-db-${gitlab_env}.c.${gitlab_project}.internal"
    
    export host_self_link="$(gcloud --project $gitlab_project compute instances list --format json --filter "name=(redis-$i-db-${gitlab_env})" | jq -r '.[].selfLink')"
    export disk_self_link=$(gcloud --project $gitlab_project compute disks list --format json --filter "name=(redis-$i-db-${gitlab_env}-data)" | jq -r '.[].selfLink')
    
    echo $fqdn
    echo $host_self_link
    echo $disk_self_link
  • Failover away from recently upgraded node
    # check role (expecting "master")
    ssh $fqdn "$redis_cli role | head -n1"
    
    # take this node out of the pool
    ssh $fqdn "$redis_cli config set replica-priority 0"
    
    # perform failover
    ssh $fqdn "/opt/gitlab/embedded/bin/redis-cli -p 26379 sentinel failover ${gitlab_env}-redis"
    
    # wait for master to step down and sync (expect "slave" [sic] and "connected")
    ssh $fqdn "$redis_cli --no-raw role"
    
    # wait for sentinel to ack the master change
    ssh $fqdn "/opt/gitlab/embedded/bin/redis-cli -p 26379 --no-raw sentinel master ${gitlab_env}-redis"
  • Rollback (downgrade)
    • stop chef
      # stop chef
      knife ssh "roles:gstg-base-db-redis-server-single AND chef_environment:gstg" "chef-client-disable 'see production change $gitlab_production_change'"
    • downgrade sentinel
      # get sentinel version
      ssh $fqdn "/opt/gitlab/embedded/bin/redis-cli -p 26379 info | grep ^redis_version:"
      
      # ensure config is written out
      ssh $fqdn "/opt/gitlab/embedded/bin/redis-cli -p 26379 sentinel flushconfig"
      
      # apply downgrade
      ssh $fqdn sudo apt-get install -y "gitlab-ee=$gitlab_release_old"
      ssh $fqdn sudo gitlab-ctl reconfigure
      
      # fixup config, restart process
      ssh $fqdn sudo gitlab-ctl stop sentinel
      ssh $fqdn sudo cp /var/opt/gitlab/sentinel/sentinel.conf /var/opt/gitlab/sentinel/sentinel.conf.bak
      ssh $fqdn sudo sed -i '/^user /d' /var/opt/gitlab/sentinel/sentinel.conf
      ssh $fqdn sudo gitlab-ctl start sentinel
      
      # ensure we are running the old version
      ssh $fqdn "/opt/gitlab/embedded/bin/redis-cli -p 26379 info | grep ^redis_version:"
      
      # check sentinel status
      echo $hosts | xargs -n1 -I{} ssh "{}.c.${gitlab_project}.internal" "hostname; /opt/gitlab/embedded/bin/redis-cli -p 26379 sentinel ckquorum ${gitlab_env}-redis"
    • Downgrade redis
      # double check that we are dealing with a replica
      ssh $fqdn "$redis_cli --no-raw role"
      
      # get redis version
      ssh $fqdn "$redis_cli info | grep ^redis_version:"
      
      # (no apt-get install or gitlab-ctl reconfigure needed, as this already happened during sentinel upgrade)
      
      # fixup config, restart process
      ssh $fqdn sudo gitlab-ctl stop redis
      ssh $fqdn sudo cp /var/opt/gitlab/redis/redis.conf /var/opt/gitlab/redis/redis.conf.bak
      ssh $fqdn sudo sed -i '/^user /d' /var/opt/gitlab/redis/redis.conf
      ssh $fqdn sudo gitlab-ctl start redis
      
      # ensure we are running the old version
      ssh $fqdn "$redis_cli info | grep ^redis_version:"
    • revert chef-repo MR
  • Alternatively: If needed, we can recover from disk snapshot

Cleanup

  • Ensure we have the same version everywhere
    # sentinels
    echo $hosts | xargs -n1 -I{} ssh "{}.c.${gitlab_project}.internal" 'hostname; /opt/gitlab/embedded/bin/redis-cli -p 26379 info | grep ^redis_version:; echo'
    
    # redises
    echo $hosts | xargs -n1 -I{} ssh "{}.c.${gitlab_project}.internal" 'hostname; '$redis_cli' info | grep ^redis_version:; echo'
  • Ensure replica-priority is set everywhere
    # inspect replica-priority
    echo $hosts | xargs -n1 -I{} ssh "{}.c.${gitlab_project}.internal" 'hostname; '$redis_cli' --no-raw config get replica-priority; echo'
    
    # set replica-priority if needed
    export i=01
    export fqdn="redis-$i-db-${gitlab_env}.c.${gitlab_project}.internal"
    
    ssh $fqdn "$redis_cli config set replica-priority 100"
  • Cleanup backed-up config files
    export i=01
    export fqdn="redis-$i-db-${gitlab_env}.c.${gitlab_project}.internal"
    
    ssh $fqdn sudo rm /etc/gitlab/gitlab.rb.bak /var/opt/gitlab/sentinel/sentinel.conf.bak /var/opt/gitlab/redis/redis.conf.bak
  • Re-enable chef
    export i=01
    export fqdn="redis-$i-db-${gitlab_env}.c.${gitlab_project}.internal"
    
    ssh $fqdn sudo chef-client-enable
    ssh $fqdn sudo chef-client

Monitoring

Key metrics to observe

Summary of infrastructure changes

  • Does this change introduce new compute instances?
  • Does this change re-size any existing compute instances?
  • Does this change introduce any additional usage of tooling like Elastic Search, CDNs, Cloudflare, etc?

Summary of the above

Changes checklist

  • This issue has a criticality label (e.g. C1, C2, C3, C4) and a change-type label (e.g. changeunscheduled, changescheduled) based on the Change Management Criticalities.
  • This issue has the change technician as the assignee.
  • Pre-Change, Change, Post-Change, and Rollback steps and have been filled out and reviewed.
  • Necessary approvals have been completed based on the Change Management Workflow.
  • Change has been tested in staging and results noted in a comment on this issue.
  • A dry-run has been conducted and results noted in a comment on this issue.
  • SRE on-call has been informed prior to change being rolled out. (In #production channel, mention @sre-oncall and this issue and await their acknowledgement.)
  • There are currently no active incidents.

refs https://gitlab.com/gitlab-com/gl-infra/infrastructure/-/issues/12344

Edited by Igor