Enable threaded I/O on redis persistent in gprd
Production Change
Change Summary
The change will enable threaded IO on redis persistent in gprd.
See https://gitlab.com/gitlab-com/gl-infra/infrastructure/-/issues/12625.
Change Details
- Services Impacted - redis persistent
- Change Technician - @igorwwwwwwwwwwwwwwwwwwww
- Change Criticality - C2
- Change Type - ConfigurationChange
- Change Reviewer - @mwasilewski-gitlab
- Due Date - 2021-03-10 12:00:00 UTC
- Time tracking - 1h
- Downtime Component - Some writes will be lost during failover, reads will remain available.
Detailed steps for the change
Preparation
-
set production change issue in env vars below -
Setup env export gitlab_env=gprd export gitlab_project=gitlab-production export gitlab_redis_cluster=redis export gitlab_production_change=3883 export redis_cli='REDISCLI_AUTH="$(sudo grep ^requirepass /var/opt/gitlab/redis/redis.conf|cut -d" " -f2|tr -d \")" /opt/gitlab/embedded/bin/redis-cli' export hosts=$(seq -f "${gitlab_redis_cluster}-%02g-db-${gitlab_env}" 1 3)
-
Dashboard -
Logs # source env vars echo multitail $(echo $hosts | xargs -n1 -I{} echo -l \'ssh {}.c.${gitlab_project}.internal sudo tail -F /var/log/gitlab/redis/current\') echo multitail $(echo $hosts | xargs -n1 -I{} echo -l \'ssh {}.c.${gitlab_project}.internal sudo tail -F /var/log/gitlab/sentinel/current\')
-
Silence alert name RedisReplicasFlapping
for 90 minutes -
Disable chef cd chef-repo knife ssh "roles:gprd-base-db-redis-server-single AND chef_environment:gprd" "hostname" knife ssh "roles:gprd-base-db-redis-server-single AND chef_environment:gprd" "chef-client-disable 'see production change $gitlab_production_change'"
Capture performance data BEFORE the change
-
capture flamegraphs # find master echo $hosts | xargs -n1 -I{} ssh "{}.c.${gitlab_project}.internal" 'hostname; '$redis_cli' role | head -n1; echo' # select master export i=xx export fqdn="redis-$i-db-${gitlab_env}.c.${gitlab_project}.internal" # capture perf profiles on master ssh $fqdn "sudo perf record -a -g --freq 99 -o perf.data -- sleep 120" # turn captured profiles into flamegraphs ssh $fqdn 'sudo perf script -F comm,pid,tid,cpu,time,event,ip,sym,dso,trace --header -i perf.data | stackcollapse-perf.pl --kernel --tid | grep $(pgrep -of 'bin/redis-server')| flamegraph.pl --hash --colors=perl > flamegraph.$(hostname).$(date +%Y%m%d_%H%M_%Z).svg' # capture pidstat ssh $fqdn 'sudo pidstat -t -p $(pgrep -of bin/redis-server) 1 120 > pidstat.cpu.$(hostname).$(date +%Y%m%d_%H%M_%Z).out' ssh $fqdn 'sudo pidstat -tw -p $(pgrep -of bin/redis-server) 1 120 > pidstat.task.$(hostname).$(date +%Y%m%d_%H%M_%Z).out'
Apply change
-
Merge aforementioned chef-repo MR -
Ensure cluster is in a good state # check that hosts are as expected echo $hosts | xargs -n1 -I{} gcloud --project $gitlab_project compute instances list --format json --filter 'name=({})' | jq -r '.[].selfLink' # check roles ("slave" [sic]) echo $hosts | xargs -n1 -I{} ssh "{}.c.${gitlab_project}.internal" 'hostname; '$redis_cli' role | head -n1; echo' # check sentinel status echo $hosts | xargs -n1 -I{} ssh "{}.c.${gitlab_project}.internal" "hostname; /opt/gitlab/embedded/bin/redis-cli -p 26379 sentinel ckquorum ${gitlab_env}-redis"
-
Perform disk snapshots # check the disks we are about to snapshot echo $hosts | xargs -n1 -I{} gcloud --project $gitlab_project compute disks list --format json --filter 'name=({}-data)' | jq -r '.[].selfLink' # snapshot all disks echo $hosts | xargs -n1 -I{} gcloud --project $gitlab_project compute disks list --format json --filter 'name=({}-data)' | jq -r '.[].selfLink' | xargs -n1 -I{} gcloud --project $gitlab_project compute disks snapshot '{}'
-
Pick host export i=01 export fqdn="redis-$i-db-${gitlab_env}.c.${gitlab_project}.internal" export host_self_link="$(gcloud --project $gitlab_project compute instances list --format json --filter "name=(redis-$i-db-${gitlab_env})" | jq -r '.[].selfLink')" echo $fqdn echo $host_self_link
-
Backup config files ssh $fqdn sudo cp /etc/gitlab/gitlab.rb /etc/gitlab/gitlab.rb.bak ssh $fqdn sudo cp /var/opt/gitlab/sentinel/sentinel.conf /var/opt/gitlab/sentinel/sentinel.conf.bak ssh $fqdn sudo cp /var/opt/gitlab/redis/redis.conf /var/opt/gitlab/redis/redis.conf.bak
-
Failover if we are a master # check role ssh $fqdn "$redis_cli role | head -n1" # if role is master, perform failover ssh $fqdn "/opt/gitlab/embedded/bin/redis-cli -p 26379 sentinel failover ${gitlab_env}-redis" # wait for master to step down and sync (expect "slave" [sic] and "connected") ssh $fqdn "$redis_cli --no-raw role" # wait for sentinel to ack the master change ssh $fqdn "/opt/gitlab/embedded/bin/redis-cli -p 26379 --no-raw sentinel master ${gitlab_env}-redis"
-
Reconfigure Redis # double check that we are dealing with a replica ssh $fqdn "$redis_cli --no-raw role" # check sentinel quorum, and roles echo $hosts | xargs -n1 -I{} ssh "{}.c.${gitlab_project}.internal" "hostname; /opt/gitlab/embedded/bin/redis-cli -p 26379 sentinel ckquorum ${gitlab_env}-redis" ssh $fqdn "$redis_cli --no-raw role" # ensure current config is as expected ssh $fqdn "$redis_cli config get io-threads" # temporarily disable rdb saving to allow for fast restart ssh $fqdn "$redis_cli config get save" ssh $fqdn "$redis_cli config set save ''" # run chef-client # this _may_ restart processes ssh $fqdn "sudo chef-client-enable" ssh $fqdn "sudo chef-client" # temporarily disable rdb saving to allow for fast restart ssh $fqdn "$redis_cli config get save" ssh $fqdn "$redis_cli config set save ''" # reconfigure # this _will_ restart processes ssh $fqdn "sudo gitlab-ctl reconfigure" # ensure config change took effect ssh $fqdn "$redis_cli config get io-threads" # check sync status echo $hosts | xargs -n1 -I{} ssh "{}.c.${gitlab_project}.internal" 'hostname; '$redis_cli' role | head -n1; echo' # check sentinel status echo $hosts | xargs -n1 -I{} ssh "{}.c.${gitlab_project}.internal" "hostname; /opt/gitlab/embedded/bin/redis-cli -p 26379 sentinel ckquorum ${gitlab_env}-redis"
-
Wait ~1h -
Failover to upgraded node # check roles echo $hosts | xargs -n1 -I{} ssh "{}.c.${gitlab_project}.internal" 'hostname; '$redis_cli' role | head -n1; echo' # check replica priority echo $hosts | xargs -n1 -I{} ssh "{}.c.${gitlab_project}.internal" 'hostname; '$redis_cli' --no-raw config get replica-priority; echo' # take other nodes out of the pool echo $hosts | grep -v ^redis-$i-db-${gitlab_env}$ | xargs -n1 -I{} ssh "{}.c.${gitlab_project}.internal" 'hostname; '$redis_cli' --no-raw config set replica-priority 0; echo' # check replica priority echo $hosts | xargs -n1 -I{} ssh "{}.c.${gitlab_project}.internal" 'hostname; '$redis_cli' --no-raw config get replica-priority; echo' # perform failover ssh $fqdn "/opt/gitlab/embedded/bin/redis-cli -p 26379 sentinel failover ${gitlab_env}-redis" # check roles echo $hosts | xargs -n1 -I{} ssh "{}.c.${gitlab_project}.internal" 'hostname; '$redis_cli' --no-raw role; echo' # wait for sentinel to ack the master change ssh $fqdn "/opt/gitlab/embedded/bin/redis-cli -p 26379 --no-raw sentinel master ${gitlab_env}-redis" # restore replica-priority echo $hosts | grep -v ^redis-$i-db-${gitlab_env}$ | xargs -n1 -I{} ssh "{}.c.${gitlab_project}.internal" 'hostname; '$redis_cli' --no-raw config set replica-priority 100; echo' # check replica priority echo $hosts | xargs -n1 -I{} ssh "{}.c.${gitlab_project}.internal" 'hostname; '$redis_cli' --no-raw config get replica-priority; echo'
-
Wait ~1h -
Repeat process for node 02 export i=02 export fqdn="redis-$i-db-${gitlab_env}.c.${gitlab_project}.internal" export host_self_link="$(gcloud --project $gitlab_project compute instances list --format json --filter "name=(redis-$i-db-${gitlab_env})" | jq -r '.[].selfLink')" echo $fqdn echo $host_self_link
-
Repeat process for node 03 export i=03 export fqdn="redis-$i-db-${gitlab_env}.c.${gitlab_project}.internal" export host_self_link="$(gcloud --project $gitlab_project compute instances list --format json --filter "name=(redis-$i-db-${gitlab_env})" | jq -r '.[].selfLink')" echo $fqdn echo $host_self_link
Recovery
-
Pick host
export i=01 export fqdn="redis-$i-db-${gitlab_env}.c.${gitlab_project}.internal" export host_self_link="$(gcloud --project $gitlab_project compute instances list --format json --filter "name=(redis-$i-db-${gitlab_env})" | jq -r '.[].selfLink')" export disk_self_link=$(gcloud --project $gitlab_project compute disks list --format json --filter "name=(redis-$i-db-${gitlab_env}-data)" | jq -r '.[].selfLink') echo $fqdn echo $host_self_link echo $disk_self_link
-
Failover away from recently upgraded node
# check role (expecting "master") ssh $fqdn "$redis_cli role | head -n1" # take this node out of the pool ssh $fqdn "$redis_cli config set replica-priority 0" # perform failover ssh $fqdn "/opt/gitlab/embedded/bin/redis-cli -p 26379 sentinel failover ${gitlab_env}-redis" # wait for master to step down and sync (expect "slave" [sic] and "connected") ssh $fqdn "$redis_cli --no-raw role" # wait for sentinel to ack the master change ssh $fqdn "/opt/gitlab/embedded/bin/redis-cli -p 26379 sentinel master ${gitlab_env}-redis"
-
Revert the chef-repo MR
-
Run
chef-client
-
Alternatively: If needed, we can recover from disk snapshot
Cleanup
-
Ensure replica-priority is set everywhere # inspect replica-priority echo $hosts | xargs -n1 -I{} ssh "{}.c.${gitlab_project}.internal" 'hostname; '$redis_cli' --no-raw config get replica-priority; echo' # set replica-priority if needed export i=01 export fqdn="redis-$i-db-${gitlab_env}.c.${gitlab_project}.internal" ssh $fqdn "$redis_cli config set replica-priority 100"
-
Cleanup backed-up config files export i=01 export fqdn="redis-$i-db-${gitlab_env}.c.${gitlab_project}.internal" ssh $fqdn sudo rm /etc/gitlab/gitlab.rb.bak /var/opt/gitlab/sentinel/sentinel.conf.bak /var/opt/gitlab/redis/redis.conf.bak
-
Re-enable chef # check if chef is running on all nodes echo $hosts | xargs -n1 -I{} ssh "{}.c.${gitlab_project}.internal" 'hostname; sudo systemctl is-active chef-client' # if needed enable on a given node export i=01 export fqdn="redis-$i-db-${gitlab_env}.c.${gitlab_project}.internal" ssh $fqdn 'sudo chef-client-enable' ssh $fqdn 'sudo chef-client'
Capture performance data AFTER the change
-
the same commands as for capturing perf BEFORE (not putting here commands)
Monitoring
Key metrics to observe
- Metric: SLOs
- Location: https://dashboards.gitlab.net/d/redis-main/redis-overview?orgId=1&from=now-6h%2Fm&to=now%2Fm&var-PROMETHEUS_DS=Global&var-environment=gprd&var-stage=main&var-sigma=2
- What changes to this metric should prompt a rollback: SLO violations. Vast change in saturation.
Summary of infrastructure changes
-
Does this change introduce new compute instances? -
Does this change re-size any existing compute instances? -
Does this change introduce any additional usage of tooling like Elastic Search, CDNs, Cloudflare, etc?
Summary of the above
Changes checklist
-
This issue has a criticality label (e.g. C1, C2, C3, C4) and a change-type label (e.g. changeunscheduled, changescheduled) based on the Change Management Criticalities. -
This issue has the change technician as the assignee. -
Pre-Change, Change, Post-Change, and Rollback steps and have been filled out and reviewed. -
Necessary approvals have been completed based on the Change Management Workflow. -
Change has been tested in staging and results noted in a comment on this issue. -
A dry-run has been conducted and results noted in a comment on this issue. -
SRE on-call has been informed prior to change being rolled out. (In #production channel, mention @sre-oncall
and this issue and await their acknowledgement.) -
There are currently no active incidents.
Edited by Igor