Skip to content

Enable threaded I/O on redis-sidekiq gprd

Production Change

Change Summary

We are enabling threaded I/O on redis-sidekiq in production. This has been enabled for the redis-persistent cluster for several months. And was enabled on redis-sidekiq gstg for several weeks.

We expect a slight reduction in CPU utilization on the main thread, especially under load. This can be measured as a reduction in system time consumed, as some of the write syscalls are fanned out to a dedicated I/O thread pool.

Change Details

  1. Services Impacted - ServiceRedis (redis-sidekiq)
  2. Change Technician - @igorwwwwwwwwwwwwwwwwwwww
  3. Change Criticality - C3
  4. Change Type - changeunscheduled
  5. Change Reviewer - @rehab
  6. Due Date - 2021-05-06 13:00 UTC
  7. Time tracking - 1h
  8. Downtime Component - Reads will remain available, there will be a minimal window of data loss during controlled failover.

Detailed steps for the change

Change Steps - steps to take to execute the change

Estimated Time to Complete (mins) - 45 mins

  • Merge and apply chef change
  • Setup env
    export gitlab_env=gprd
    export gitlab_project=gitlab-production
    export gitlab_redis_cluster=redis-sidekiq
    
    export redis_cli='REDISCLI_AUTH="$(sudo grep ^requirepass /var/opt/gitlab/redis/redis.conf|cut -d" " -f2|tr -d \")" /opt/gitlab/embedded/bin/redis-cli'
    
    export hosts=$(seq -f "${gitlab_redis_cluster}-%02g-db-${gitlab_env}" 1 3)
  • Start with host 01
    export i=01
    
    export fqdn="${gitlab_redis_cluster}-$i-db-${gitlab_env}.c.${gitlab_project}.internal"
    export host_self_link="$(gcloud --project $gitlab_project compute instances list --format json --filter "name=(${gitlab_redis_cluster}-$i-db-${gitlab_env})" | jq -r '.[].selfLink')"
    
    echo $fqdn
    echo $host_self_link
  • Failover if we are a master
    # if role is master, perform failover
    if [[ "$(ssh $fqdn "$redis_cli role | head -n1")" = "master" ]]; then echo failing over; ssh $fqdn "/opt/gitlab/embedded/bin/redis-cli -p 26379 sentinel failover ${gitlab_env}-${gitlab_redis_cluster}"; fi
    
    # wait for master to step down and sync (expect "slave" [sic] and "connected")
    while ! [[ "$(ssh $fqdn "$redis_cli role" | head -n1)" = "slave" ]]; do echo waiting for stepdown; sleep 1; done
    while ! [[ "$(ssh $fqdn "$redis_cli --raw role" | tail -n +4 | head -n1)" = "connected" ]]; do echo waiting for sync; sleep 1; done
    
    # wait for sentinel to ack the master change
    while [[ "$(ssh $fqdn "/opt/gitlab/embedded/bin/redis-cli -p 26379 --raw sentinel master ${gitlab_env}-${gitlab_redis_cluster}" | grep -A1 ^ip$ | tail -n +2 | awk '{ print substr($0, length($0)-1) }')" = "$i" ]]; do echo waiting for sentinel; sleep 1; done
  • Reconfigure
    # double check that we are dealing with a replica
    ssh $fqdn "$redis_cli --no-raw role"
    
    # check sentinel quorum, and roles
    echo $hosts | xargs -n1 -I{} ssh "{}.c.${gitlab_project}.internal" "hostname; /opt/gitlab/embedded/bin/redis-cli -p 26379 sentinel ckquorum ${gitlab_env}-${gitlab_redis_cluster}"
    ssh $fqdn "$redis_cli --no-raw role"
    
    # ensure current config is as expected
    ssh $fqdn "$redis_cli config get io-threads"
    
    # run chef-client
    ssh $fqdn "sudo chef-client"
    
    # temporarily disable rdb saving to allow for fast restart
    ssh $fqdn "$redis_cli config get save"
    ssh $fqdn "$redis_cli config set save ''"
    
    # reconfigure
    # this _will_ restart processes
    ssh $fqdn "sudo gitlab-ctl reconfigure"
    
    # ensure config change took effect
    ssh $fqdn "$redis_cli config get save"
    ssh $fqdn "$redis_cli config get io-threads"
    
    # check sync status
    echo $hosts | xargs -n1 -I{} ssh "{}.c.${gitlab_project}.internal" 'hostname; '$redis_cli' role | head -n1; echo'
    
    # check sentinel status
    echo $hosts | xargs -n1 -I{} ssh "{}.c.${gitlab_project}.internal" "hostname; /opt/gitlab/embedded/bin/redis-cli -p 26379 sentinel ckquorum ${gitlab_env}-${gitlab_redis_cluster}"
  • Repeat process for node 02
    export i=02
    
    export fqdn="${gitlab_redis_cluster}-$i-db-${gitlab_env}.c.${gitlab_project}.internal"
    export host_self_link="$(gcloud --project $gitlab_project compute instances list --format json --filter "name=(${gitlab_redis_cluster}-$i-db-${gitlab_env})" | jq -r '.[].selfLink')"
    
    echo $fqdn
    echo $host_self_link
  • Complete reconfigure on node 02
  • Repeat process for node 03
    export i=03
    
    export fqdn="${gitlab_redis_cluster}-$i-db-${gitlab_env}.c.${gitlab_project}.internal"
    export host_self_link="$(gcloud --project $gitlab_project compute instances list --format json --filter "name=(${gitlab_redis_cluster}-$i-db-${gitlab_env})" | jq -r '.[].selfLink')"
    
    echo $fqdn
    echo $host_self_link
  • Complete reconfigure on node 03

Post-Change Steps - steps to take to verify the change

Estimated Time to Complete (mins) - 15 mins

  • Check CPU utilization
    • Dashboard
    • sudo pidstat -t -p $(pidof bin/redis-server) 1
    • (Optional) Record a profile with perf

Rollback

Rollback steps - steps to be taken in the event of a need to rollback this change

Estimated Time to Complete (mins) - 30 mins

  • Revert and apply chef change
  • Failover if currently master, then reconfigure node 01
  • Failover if currently master, then reconfigure node 02
  • Failover if currently master, then reconfigure node 03

Monitoring

Key metrics to observe

Summary of infrastructure changes

This is a configuration change on redis-server.

Changes checklist

  • This issue has a criticality label (e.g. C1, C2, C3, C4) and a change-type label (e.g. changeunscheduled, changescheduled) based on the Change Management Criticalities.
  • This issue has the change technician as the assignee.
  • Pre-Change, Change, Post-Change, and Rollback steps and have been filled out and reviewed.
  • Necessary approvals have been completed based on the Change Management Workflow.
  • Change has been tested in staging and results noted in a comment on this issue.
  • A dry-run has been conducted and results noted in a comment on this issue.
  • SRE on-call has been informed prior to change being rolled out. (In #production channel, mention @sre-oncall and this issue and await their acknowledgement.)
  • There are currently no active incidents.

refs https://gitlab.com/gitlab-com/gl-infra/infrastructure/-/issues/12626

Edited by Igor