Enable threaded I/O on redis persistent in gprd

Production Change

Change Summary

The change will enable threaded IO on redis persistent in gprd.

See https://gitlab.com/gitlab-com/gl-infra/infrastructure/-/issues/12625.

Change Details

Services Impacted - redis persistent
Change Technician - @igorwwwwwwwwwwwwwwwwwwww
Change Criticality - C2
Change Type - ConfigurationChange
Change Reviewer - @mwasilewski-gitlab
Due Date - 2021-03-10 12:00:00 UTC
Time tracking - 1h
Downtime Component - Some writes will be lost during failover, reads will remain available.

Detailed steps for the change

Preparation

set production change issue in env vars below

Setup env

export gitlab_env=gprd
export gitlab_project=gitlab-production
export gitlab_redis_cluster=redis
export gitlab_production_change=3883

export redis_cli='REDISCLI_AUTH="$(sudo grep ^requirepass /var/opt/gitlab/redis/redis.conf|cut -d" " -f2|tr -d \")" /opt/gitlab/embedded/bin/redis-cli'

export hosts=$(seq -f "${gitlab_redis_cluster}-%02g-db-${gitlab_env}" 1 3)

Dashboard
- https://dashboards.gitlab.net/d/redis-main/redis-overview?orgId=1&from=now-6h%2Fm&to=now%2Fm&var-PROMETHEUS_DS=Global&var-environment=gprd&var-stage=main&var-sigma=2

Logs

# source env vars
echo multitail $(echo $hosts | xargs -n1 -I{} echo -l \'ssh {}.c.${gitlab_project}.internal sudo tail -F /var/log/gitlab/redis/current\')
echo multitail $(echo $hosts | xargs -n1 -I{} echo -l \'ssh {}.c.${gitlab_project}.internal sudo tail -F /var/log/gitlab/sentinel/current\')

Silence alert name RedisReplicasFlapping for 90 minutes

Disable chef

cd chef-repo
knife ssh "roles:gprd-base-db-redis-server-single AND chef_environment:gprd" "hostname"
knife ssh "roles:gprd-base-db-redis-server-single AND chef_environment:gprd" "chef-client-disable 'see production change $gitlab_production_change'"

Capture performance data BEFORE the change

capture flamegraphs

# find master
echo $hosts | xargs -n1 -I{} ssh "{}.c.${gitlab_project}.internal" 'hostname; '$redis_cli' role | head -n1; echo'

# select master
export i=xx
export fqdn="redis-$i-db-${gitlab_env}.c.${gitlab_project}.internal"

# capture perf profiles on master
ssh $fqdn "sudo perf record -a -g --freq 99 -o perf.data -- sleep 120"

# turn captured profiles into flamegraphs
ssh $fqdn 'sudo perf script -F comm,pid,tid,cpu,time,event,ip,sym,dso,trace --header -i perf.data | stackcollapse-perf.pl --kernel --tid | grep $(pgrep -of 'bin/redis-server')| flamegraph.pl --hash --colors=perl > flamegraph.$(hostname).$(date +%Y%m%d_%H%M_%Z).svg'

# capture pidstat
ssh $fqdn 'sudo pidstat -t -p $(pgrep -of bin/redis-server) 1 120 > pidstat.cpu.$(hostname).$(date +%Y%m%d_%H%M_%Z).out'
ssh $fqdn 'sudo pidstat -tw -p $(pgrep -of bin/redis-server) 1 120 > pidstat.task.$(hostname).$(date +%Y%m%d_%H%M_%Z).out'

Apply change

Merge aforementioned chef-repo MR

Ensure cluster is in a good state

# check that hosts are as expected
echo $hosts | xargs -n1 -I{} gcloud --project $gitlab_project compute instances list --format json --filter 'name=({})' | jq -r '.[].selfLink'

# check roles ("slave" [sic])
echo $hosts | xargs -n1 -I{} ssh "{}.c.${gitlab_project}.internal" 'hostname; '$redis_cli' role | head -n1; echo'

# check sentinel status
echo $hosts | xargs -n1 -I{} ssh "{}.c.${gitlab_project}.internal" "hostname; /opt/gitlab/embedded/bin/redis-cli -p 26379 sentinel ckquorum ${gitlab_env}-redis"

Perform disk snapshots

# check the disks we are about to snapshot
echo $hosts | xargs -n1 -I{} gcloud --project $gitlab_project compute disks list --format json --filter 'name=({}-data)' | jq -r '.[].selfLink'

# snapshot all disks
echo $hosts | xargs -n1 -I{} gcloud --project $gitlab_project compute disks list --format json --filter 'name=({}-data)' | jq -r '.[].selfLink' | xargs -n1 -I{} gcloud --project $gitlab_project compute disks snapshot '{}'

Pick host

export i=01

export fqdn="redis-$i-db-${gitlab_env}.c.${gitlab_project}.internal"

export host_self_link="$(gcloud --project $gitlab_project compute instances list --format json --filter "name=(redis-$i-db-${gitlab_env})" | jq -r '.[].selfLink')"

echo $fqdn
echo $host_self_link

Backup config files

ssh $fqdn sudo cp /etc/gitlab/gitlab.rb /etc/gitlab/gitlab.rb.bak
ssh $fqdn sudo cp /var/opt/gitlab/sentinel/sentinel.conf /var/opt/gitlab/sentinel/sentinel.conf.bak
ssh $fqdn sudo cp /var/opt/gitlab/redis/redis.conf /var/opt/gitlab/redis/redis.conf.bak

Failover if we are a master

# check role
ssh $fqdn "$redis_cli role | head -n1"

# if role is master, perform failover
ssh $fqdn "/opt/gitlab/embedded/bin/redis-cli -p 26379 sentinel failover ${gitlab_env}-redis"

# wait for master to step down and sync (expect "slave" [sic] and "connected")
ssh $fqdn "$redis_cli --no-raw role"

# wait for sentinel to ack the master change
ssh $fqdn "/opt/gitlab/embedded/bin/redis-cli -p 26379 --no-raw sentinel master ${gitlab_env}-redis"

Reconfigure Redis

# double check that we are dealing with a replica
ssh $fqdn "$redis_cli --no-raw role"

# check sentinel quorum, and roles
echo $hosts | xargs -n1 -I{} ssh "{}.c.${gitlab_project}.internal" "hostname; /opt/gitlab/embedded/bin/redis-cli -p 26379 sentinel ckquorum ${gitlab_env}-redis"
ssh $fqdn "$redis_cli --no-raw role"

# ensure current config is as expected
ssh $fqdn "$redis_cli config get io-threads"

# temporarily disable rdb saving to allow for fast restart
ssh $fqdn "$redis_cli config get save"
ssh $fqdn "$redis_cli config set save ''"

# run chef-client
# this _may_ restart processes
ssh $fqdn "sudo chef-client-enable"
ssh $fqdn "sudo chef-client"

# temporarily disable rdb saving to allow for fast restart
ssh $fqdn "$redis_cli config get save"
ssh $fqdn "$redis_cli config set save ''"

# reconfigure
# this _will_ restart processes
ssh $fqdn "sudo gitlab-ctl reconfigure"

# ensure config change took effect
ssh $fqdn "$redis_cli config get io-threads"

# check sync status
echo $hosts | xargs -n1 -I{} ssh "{}.c.${gitlab_project}.internal" 'hostname; '$redis_cli' role | head -n1; echo'

# check sentinel status
echo $hosts | xargs -n1 -I{} ssh "{}.c.${gitlab_project}.internal" "hostname; /opt/gitlab/embedded/bin/redis-cli -p 26379 sentinel ckquorum ${gitlab_env}-redis"

Wait ~1h

Failover to upgraded node

# check roles
echo $hosts | xargs -n1 -I{} ssh "{}.c.${gitlab_project}.internal" 'hostname; '$redis_cli' role | head -n1; echo'

# check replica priority
echo $hosts | xargs -n1 -I{} ssh "{}.c.${gitlab_project}.internal" 'hostname; '$redis_cli' --no-raw config get replica-priority; echo'

# take other nodes out of the pool
echo $hosts | grep -v ^redis-$i-db-${gitlab_env}$ | xargs -n1 -I{} ssh "{}.c.${gitlab_project}.internal" 'hostname; '$redis_cli' --no-raw config set replica-priority 0; echo'

# check replica priority
echo $hosts | xargs -n1 -I{} ssh "{}.c.${gitlab_project}.internal" 'hostname; '$redis_cli' --no-raw config get replica-priority; echo'

# perform failover
ssh $fqdn "/opt/gitlab/embedded/bin/redis-cli -p 26379 sentinel failover ${gitlab_env}-redis"

# check roles
echo $hosts | xargs -n1 -I{} ssh "{}.c.${gitlab_project}.internal" 'hostname; '$redis_cli' --no-raw role; echo'

# wait for sentinel to ack the master change
ssh $fqdn "/opt/gitlab/embedded/bin/redis-cli -p 26379 --no-raw sentinel master ${gitlab_env}-redis"

# restore replica-priority
echo $hosts | grep -v ^redis-$i-db-${gitlab_env}$ | xargs -n1 -I{} ssh "{}.c.${gitlab_project}.internal" 'hostname; '$redis_cli' --no-raw config set replica-priority 100; echo'

# check replica priority
echo $hosts | xargs -n1 -I{} ssh "{}.c.${gitlab_project}.internal" 'hostname; '$redis_cli' --no-raw config get replica-priority; echo'

Wait ~1h

Repeat process for node 02

export i=02

export fqdn="redis-$i-db-${gitlab_env}.c.${gitlab_project}.internal"
export host_self_link="$(gcloud --project $gitlab_project compute instances list --format json --filter "name=(redis-$i-db-${gitlab_env})" | jq -r '.[].selfLink')"

echo $fqdn
echo $host_self_link

Repeat process for node 03

export i=03

export fqdn="redis-$i-db-${gitlab_env}.c.${gitlab_project}.internal"
export host_self_link="$(gcloud --project $gitlab_project compute instances list --format json --filter "name=(redis-$i-db-${gitlab_env})" | jq -r '.[].selfLink')"

echo $fqdn
echo $host_self_link

Recovery

Pick host

export i=01

export fqdn="redis-$i-db-${gitlab_env}.c.${gitlab_project}.internal"

export host_self_link="$(gcloud --project $gitlab_project compute instances list --format json --filter "name=(redis-$i-db-${gitlab_env})" | jq -r '.[].selfLink')"
export disk_self_link=$(gcloud --project $gitlab_project compute disks list --format json --filter "name=(redis-$i-db-${gitlab_env}-data)" | jq -r '.[].selfLink')

echo $fqdn
echo $host_self_link
echo $disk_self_link

Failover away from recently upgraded node

# check role (expecting "master")
ssh $fqdn "$redis_cli role | head -n1"

# take this node out of the pool
ssh $fqdn "$redis_cli config set replica-priority 0"

# perform failover
ssh $fqdn "/opt/gitlab/embedded/bin/redis-cli -p 26379 sentinel failover ${gitlab_env}-redis"

# wait for master to step down and sync (expect "slave" [sic] and "connected")
ssh $fqdn "$redis_cli --no-raw role"

# wait for sentinel to ack the master change
ssh $fqdn "/opt/gitlab/embedded/bin/redis-cli -p 26379 sentinel master ${gitlab_env}-redis"

Revert the chef-repo MR
Run chef-client
Alternatively: If needed, we can recover from disk snapshot

Cleanup

Ensure replica-priority is set everywhere

# inspect replica-priority
echo $hosts | xargs -n1 -I{} ssh "{}.c.${gitlab_project}.internal" 'hostname; '$redis_cli' --no-raw config get replica-priority; echo'

# set replica-priority if needed
export i=01
export fqdn="redis-$i-db-${gitlab_env}.c.${gitlab_project}.internal"

ssh $fqdn "$redis_cli config set replica-priority 100"

Cleanup backed-up config files

export i=01
export fqdn="redis-$i-db-${gitlab_env}.c.${gitlab_project}.internal"

ssh $fqdn sudo rm /etc/gitlab/gitlab.rb.bak /var/opt/gitlab/sentinel/sentinel.conf.bak /var/opt/gitlab/redis/redis.conf.bak

Re-enable chef

# check if chef is running on all nodes
echo $hosts | xargs -n1 -I{} ssh "{}.c.${gitlab_project}.internal" 'hostname; sudo systemctl is-active chef-client'

# if needed enable on a given node
export i=01
export fqdn="redis-$i-db-${gitlab_env}.c.${gitlab_project}.internal"

ssh $fqdn 'sudo chef-client-enable'
ssh $fqdn 'sudo chef-client'

Capture performance data AFTER the change

the same commands as for capturing perf BEFORE (not putting here commands)

Monitoring

Key metrics to observe

Metric: SLOs
- Location: https://dashboards.gitlab.net/d/redis-main/redis-overview?orgId=1&from=now-6h%2Fm&to=now%2Fm&var-PROMETHEUS_DS=Global&var-environment=gprd&var-stage=main&var-sigma=2
- What changes to this metric should prompt a rollback: SLO violations. Vast change in saturation.

Summary of infrastructure changes

Does this change introduce new compute instances?
Does this change re-size any existing compute instances?
Does this change introduce any additional usage of tooling like Elastic Search, CDNs, Cloudflare, etc?

Summary of the above

Changes checklist

This issue has a criticality label (e.g. C1, C2, C3, C4) and a change-type label (e.g. changeunscheduled, changescheduled) based on the Change Management Criticalities.
This issue has the change technician as the assignee.
Pre-Change, Change, Post-Change, and Rollback steps and have been filled out and reviewed.
Necessary approvals have been completed based on the Change Management Workflow.
Change has been tested in staging and results noted in a comment on this issue.
A dry-run has been conducted and results noted in a comment on this issue.
SRE on-call has been informed prior to change being rolled out. (In #production channel, mention @sre-oncall and this issue and await their acknowledgement.)
There are currently no active incidents.

Edited Mar 12, 2021 by Igor