Enable threaded I/O on redis-sidekiq gprd

Production Change

Change Summary

We are enabling threaded I/O on redis-sidekiq in production. This has been enabled for the redis-persistent cluster for several months. And was enabled on redis-sidekiq gstg for several weeks.

We expect a slight reduction in CPU utilization on the main thread, especially under load. This can be measured as a reduction in system time consumed, as some of the write syscalls are fanned out to a dedicated I/O thread pool.

Change Details

Services Impacted - ServiceRedis (redis-sidekiq)
Change Technician - @igorwwwwwwwwwwwwwwwwwwww
Change Criticality - C3
Change Type - changeunscheduled
Change Reviewer - @rehab
Due Date - 2021-05-06 13:00 UTC
Time tracking - 1h
Downtime Component - Reads will remain available, there will be a minimal window of data loss during controlled failover.

Detailed steps for the change

Change Steps - steps to take to execute the change

Estimated Time to Complete (mins) - 45 mins

Merge and apply chef change

Setup env

export gitlab_env=gprd
export gitlab_project=gitlab-production
export gitlab_redis_cluster=redis-sidekiq

export redis_cli='REDISCLI_AUTH="$(sudo grep ^requirepass /var/opt/gitlab/redis/redis.conf|cut -d" " -f2|tr -d \")" /opt/gitlab/embedded/bin/redis-cli'

export hosts=$(seq -f "${gitlab_redis_cluster}-%02g-db-${gitlab_env}" 1 3)

Start with host 01

export i=01

export fqdn="${gitlab_redis_cluster}-$i-db-${gitlab_env}.c.${gitlab_project}.internal"
export host_self_link="$(gcloud --project $gitlab_project compute instances list --format json --filter "name=(${gitlab_redis_cluster}-$i-db-${gitlab_env})" | jq -r '.[].selfLink')"

echo $fqdn
echo $host_self_link

Failover if we are a master

# if role is master, perform failover
if [[ "$(ssh $fqdn "$redis_cli role | head -n1")" = "master" ]]; then echo failing over; ssh $fqdn "/opt/gitlab/embedded/bin/redis-cli -p 26379 sentinel failover ${gitlab_env}-${gitlab_redis_cluster}"; fi

# wait for master to step down and sync (expect "slave" [sic] and "connected")
while ! [[ "$(ssh $fqdn "$redis_cli role" | head -n1)" = "slave" ]]; do echo waiting for stepdown; sleep 1; done
while ! [[ "$(ssh $fqdn "$redis_cli --raw role" | tail -n +4 | head -n1)" = "connected" ]]; do echo waiting for sync; sleep 1; done

# wait for sentinel to ack the master change
while [[ "$(ssh $fqdn "/opt/gitlab/embedded/bin/redis-cli -p 26379 --raw sentinel master ${gitlab_env}-${gitlab_redis_cluster}" | grep -A1 ^ip$ | tail -n +2 | awk '{ print substr($0, length($0)-1) }')" = "$i" ]]; do echo waiting for sentinel; sleep 1; done

Reconfigure

# double check that we are dealing with a replica
ssh $fqdn "$redis_cli --no-raw role"

# check sentinel quorum, and roles
echo $hosts | xargs -n1 -I{} ssh "{}.c.${gitlab_project}.internal" "hostname; /opt/gitlab/embedded/bin/redis-cli -p 26379 sentinel ckquorum ${gitlab_env}-${gitlab_redis_cluster}"
ssh $fqdn "$redis_cli --no-raw role"

# ensure current config is as expected
ssh $fqdn "$redis_cli config get io-threads"

# run chef-client
ssh $fqdn "sudo chef-client"

# temporarily disable rdb saving to allow for fast restart
ssh $fqdn "$redis_cli config get save"
ssh $fqdn "$redis_cli config set save ''"

# reconfigure
# this _will_ restart processes
ssh $fqdn "sudo gitlab-ctl reconfigure"

# ensure config change took effect
ssh $fqdn "$redis_cli config get save"
ssh $fqdn "$redis_cli config get io-threads"

# check sync status
echo $hosts | xargs -n1 -I{} ssh "{}.c.${gitlab_project}.internal" 'hostname; '$redis_cli' role | head -n1; echo'

# check sentinel status
echo $hosts | xargs -n1 -I{} ssh "{}.c.${gitlab_project}.internal" "hostname; /opt/gitlab/embedded/bin/redis-cli -p 26379 sentinel ckquorum ${gitlab_env}-${gitlab_redis_cluster}"

Repeat process for node 02

export i=02

export fqdn="${gitlab_redis_cluster}-$i-db-${gitlab_env}.c.${gitlab_project}.internal"
export host_self_link="$(gcloud --project $gitlab_project compute instances list --format json --filter "name=(${gitlab_redis_cluster}-$i-db-${gitlab_env})" | jq -r '.[].selfLink')"

echo $fqdn
echo $host_self_link

Complete reconfigure on node 02

Repeat process for node 03

export i=03

export fqdn="${gitlab_redis_cluster}-$i-db-${gitlab_env}.c.${gitlab_project}.internal"
export host_self_link="$(gcloud --project $gitlab_project compute instances list --format json --filter "name=(${gitlab_redis_cluster}-$i-db-${gitlab_env})" | jq -r '.[].selfLink')"

echo $fqdn
echo $host_self_link

Complete reconfigure on node 03

Post-Change Steps - steps to take to verify the change

Estimated Time to Complete (mins) - 15 mins

Check CPU utilization
- Dashboard
- sudo pidstat -t -p $(pidof bin/redis-server) 1
- (Optional) Record a profile with perf

Rollback

Rollback steps - steps to be taken in the event of a need to rollback this change

Estimated Time to Complete (mins) - 30 mins

Revert and apply chef change
Failover if currently master, then reconfigure node 01
Failover if currently master, then reconfigure node 02
Failover if currently master, then reconfigure node 03

Monitoring

Key metrics to observe

Metric: Saturation redis_primary_cpu
- Location: https://dashboards.gitlab.net/d/redis-sidekiq-main/redis-sidekiq-overview?orgId=1
- What changes to this metric should prompt a rollback: Sustained increase above baseline.

Summary of infrastructure changes

This is a configuration change on redis-server.

Changes checklist

This issue has a criticality label (e.g. C1, C2, C3, C4) and a change-type label (e.g. changeunscheduled, changescheduled) based on the Change Management Criticalities.
This issue has the change technician as the assignee.
Pre-Change, Change, Post-Change, and Rollback steps and have been filled out and reviewed.
Necessary approvals have been completed based on the Change Management Workflow.
Change has been tested in staging and results noted in a comment on this issue.
A dry-run has been conducted and results noted in a comment on this issue.
SRE on-call has been informed prior to change being rolled out. (In #production channel, mention @sre-oncall and this issue and await their acknowledgement.)
There are currently no active incidents.

refs https://gitlab.com/gitlab-com/gl-infra/infrastructure/-/issues/12626

Edited May 06, 2021 by Igor