Enable threaded I/O on redis-sidekiq gprd
Production Change
Change Summary
We are enabling threaded I/O on redis-sidekiq in production. This has been enabled for the redis-persistent cluster for several months. And was enabled on redis-sidekiq gstg for several weeks.
We expect a slight reduction in CPU utilization on the main thread, especially under load. This can be measured as a reduction in system time consumed, as some of the write syscalls are fanned out to a dedicated I/O thread pool.
Change Details
-
Services Impacted - ServiceRedis (
redis-sidekiq) - Change Technician - @igorwwwwwwwwwwwwwwwwwwww
- Change Criticality - C3
- Change Type - changeunscheduled
- Change Reviewer - @rehab
- Due Date - 2021-05-06 13:00 UTC
- Time tracking - 1h
- Downtime Component - Reads will remain available, there will be a minimal window of data loss during controlled failover.
Detailed steps for the change
Change Steps - steps to take to execute the change
Estimated Time to Complete (mins) - 45 mins
-
Merge and apply chef change -
Setup env export gitlab_env=gprd export gitlab_project=gitlab-production export gitlab_redis_cluster=redis-sidekiq export redis_cli='REDISCLI_AUTH="$(sudo grep ^requirepass /var/opt/gitlab/redis/redis.conf|cut -d" " -f2|tr -d \")" /opt/gitlab/embedded/bin/redis-cli' export hosts=$(seq -f "${gitlab_redis_cluster}-%02g-db-${gitlab_env}" 1 3) -
Start with host 01 export i=01 export fqdn="${gitlab_redis_cluster}-$i-db-${gitlab_env}.c.${gitlab_project}.internal" export host_self_link="$(gcloud --project $gitlab_project compute instances list --format json --filter "name=(${gitlab_redis_cluster}-$i-db-${gitlab_env})" | jq -r '.[].selfLink')" echo $fqdn echo $host_self_link -
Failover if we are a master # if role is master, perform failover if [[ "$(ssh $fqdn "$redis_cli role | head -n1")" = "master" ]]; then echo failing over; ssh $fqdn "/opt/gitlab/embedded/bin/redis-cli -p 26379 sentinel failover ${gitlab_env}-${gitlab_redis_cluster}"; fi # wait for master to step down and sync (expect "slave" [sic] and "connected") while ! [[ "$(ssh $fqdn "$redis_cli role" | head -n1)" = "slave" ]]; do echo waiting for stepdown; sleep 1; done while ! [[ "$(ssh $fqdn "$redis_cli --raw role" | tail -n +4 | head -n1)" = "connected" ]]; do echo waiting for sync; sleep 1; done # wait for sentinel to ack the master change while [[ "$(ssh $fqdn "/opt/gitlab/embedded/bin/redis-cli -p 26379 --raw sentinel master ${gitlab_env}-${gitlab_redis_cluster}" | grep -A1 ^ip$ | tail -n +2 | awk '{ print substr($0, length($0)-1) }')" = "$i" ]]; do echo waiting for sentinel; sleep 1; done -
Reconfigure # double check that we are dealing with a replica ssh $fqdn "$redis_cli --no-raw role" # check sentinel quorum, and roles echo $hosts | xargs -n1 -I{} ssh "{}.c.${gitlab_project}.internal" "hostname; /opt/gitlab/embedded/bin/redis-cli -p 26379 sentinel ckquorum ${gitlab_env}-${gitlab_redis_cluster}" ssh $fqdn "$redis_cli --no-raw role" # ensure current config is as expected ssh $fqdn "$redis_cli config get io-threads" # run chef-client ssh $fqdn "sudo chef-client" # temporarily disable rdb saving to allow for fast restart ssh $fqdn "$redis_cli config get save" ssh $fqdn "$redis_cli config set save ''" # reconfigure # this _will_ restart processes ssh $fqdn "sudo gitlab-ctl reconfigure" # ensure config change took effect ssh $fqdn "$redis_cli config get save" ssh $fqdn "$redis_cli config get io-threads" # check sync status echo $hosts | xargs -n1 -I{} ssh "{}.c.${gitlab_project}.internal" 'hostname; '$redis_cli' role | head -n1; echo' # check sentinel status echo $hosts | xargs -n1 -I{} ssh "{}.c.${gitlab_project}.internal" "hostname; /opt/gitlab/embedded/bin/redis-cli -p 26379 sentinel ckquorum ${gitlab_env}-${gitlab_redis_cluster}" -
Repeat process for node 02 export i=02 export fqdn="${gitlab_redis_cluster}-$i-db-${gitlab_env}.c.${gitlab_project}.internal" export host_self_link="$(gcloud --project $gitlab_project compute instances list --format json --filter "name=(${gitlab_redis_cluster}-$i-db-${gitlab_env})" | jq -r '.[].selfLink')" echo $fqdn echo $host_self_link -
Complete reconfigure on node 02 -
Repeat process for node 03 export i=03 export fqdn="${gitlab_redis_cluster}-$i-db-${gitlab_env}.c.${gitlab_project}.internal" export host_self_link="$(gcloud --project $gitlab_project compute instances list --format json --filter "name=(${gitlab_redis_cluster}-$i-db-${gitlab_env})" | jq -r '.[].selfLink')" echo $fqdn echo $host_self_link -
Complete reconfigure on node 03
Post-Change Steps - steps to take to verify the change
Estimated Time to Complete (mins) - 15 mins
-
Check CPU utilization - Dashboard
sudo pidstat -t -p $(pidof bin/redis-server) 1- (Optional) Record a profile with perf
Rollback
Rollback steps - steps to be taken in the event of a need to rollback this change
Estimated Time to Complete (mins) - 30 mins
-
Revert and apply chef change -
Failover if currently master, then reconfigure node 01 -
Failover if currently master, then reconfigure node 02 -
Failover if currently master, then reconfigure node 03
Monitoring
Key metrics to observe
- Metric: Saturation
redis_primary_cpu- Location: https://dashboards.gitlab.net/d/redis-sidekiq-main/redis-sidekiq-overview?orgId=1
- What changes to this metric should prompt a rollback: Sustained increase above baseline.
Summary of infrastructure changes
This is a configuration change on redis-server.
Changes checklist
-
This issue has a criticality label (e.g. C1, C2, C3, C4) and a change-type label (e.g. changeunscheduled, changescheduled) based on the Change Management Criticalities. -
This issue has the change technician as the assignee. -
Pre-Change, Change, Post-Change, and Rollback steps and have been filled out and reviewed. -
Necessary approvals have been completed based on the Change Management Workflow. -
Change has been tested in staging and results noted in a comment on this issue. -
A dry-run has been conducted and results noted in a comment on this issue. -
SRE on-call has been informed prior to change being rolled out. (In #production channel, mention @sre-oncalland this issue and await their acknowledgement.) -
There are currently no active incidents.
refs https://gitlab.com/gitlab-com/gl-infra/infrastructure/-/issues/12626
Edited by Igor