Increase TCP idle timeout on redis-cache nodes

This is a follow-up to #1831 (closed) to make the experimental change permanent.

Production Change - Criticality 4 C4

Change Objective	Let idle client connections live for multiple minutes, so the every-minute workload burst does not have to create as many new connections, saving redis CPU time and memory churn.
Change Type	ConfigurationChange
Services Impacted	redis-cache
Change Team Members	@msmiley
Change Criticality	C4
Change Reviewer or tested in staging	Tested on staging environment: #1874 (comment 314341173)
Dry-run output	N/A
Due Date	2020-03-31 01:45 UTC (2020-03-30 18:45 PDT)
Time tracking	10 minutes (same to rollback)

Detailed steps for the change

Pre-condition

The run-time setting is already 1200 seconds, but the config file is still 60 seconds. Note that the redis.conf file differs from the gitlab.rb file.

WARNING: DO NOT run "gitlab-ctl reconfigure", as it would cause redis to restart needlessly and cause downtime. The chef-client run will not run it for that very reason.

$ knife ssh 'roles:gprd-base-db-redis-server-cache' '~/gitlab-redis-cli.sh config get timeout'
redis-cache-03-db-gprd.c.gitlab-production.internal 1) "timeout"
redis-cache-03-db-gprd.c.gitlab-production.internal 2) "1200"
redis-cache-02-db-gprd.c.gitlab-production.internal 1) "timeout"
redis-cache-02-db-gprd.c.gitlab-production.internal 2) "1200"
redis-cache-01-db-gprd.c.gitlab-production.internal 1) "timeout"
redis-cache-01-db-gprd.c.gitlab-production.internal 2) "1200"

$ knife ssh 'roles:gprd-base-db-redis-server-cache' 'sudo grep "redis.*tcp_timeout" /etc/gitlab/gitlab.rb'
redis-cache-01-db-gprd.c.gitlab-production.internal redis['tcp_timeout'] = "60"
redis-cache-02-db-gprd.c.gitlab-production.internal redis['tcp_timeout'] = "60"
redis-cache-03-db-gprd.c.gitlab-production.internal redis['tcp_timeout'] = "60"

$ knife ssh 'roles:gprd-base-db-redis-server-cache' 'sudo grep "^timeout" /var/opt/gitlab/redis/redis.conf'
redis-cache-01-db-gprd.c.gitlab-production.internal timeout 1200
redis-cache-02-db-gprd.c.gitlab-production.internal timeout 60
redis-cache-03-db-gprd.c.gitlab-production.internal timeout 60

Change procedure

Backup the redis.conf file for later comparison.

$ knife ssh 'roles:gprd-base-db-redis-server-cache' 'sudo cp -p /var/opt/gitlab/redis/redis.conf{,.backup}'

Run the apply_to_prod pipeline job for the merge request, which only updates gitlab.rb, not redis.conf: https://ops.gitlab.net/gitlab-cookbooks/chef-repo/-/merge_requests/3020 -- relevant pipeline: https://ops.gitlab.net/gitlab-cookbooks/chef-repo/pipelines/128181
Run chef-client to update gitlab.rb. This does not have to complete before proceeding to the next step.

$ knife ssh -C 1 'roles:gprd-base-db-redis-server-cache' 'sudo chef-client'

Run CONFIG REWRITE via redis-cli to update redis.conf to match the runtime state of the redis-server process.

$ knife ssh -C 1 'roles:gprd-base-db-redis-server-cache' '~/gitlab-redis-cli.sh config rewrite'

Validation

Verify the only change to redis.conf was the expected change to the timeout setting.

$ knife ssh -C 1 'roles:gprd-base-db-redis-server-cache' 'sudo diff -U0 /var/opt/gitlab/redis/redis.conf{.backup,}'

Verify the Redis runtime setting for timeout is still 1200 seconds, and verify the config files now agree:

gitlab.rb (chef-managed)
redis.conf (gitlab-ctl-managed)

$ knife ssh 'roles:gprd-base-db-redis-server-cache' '~/gitlab-redis-cli.sh config get timeout'

$ knife ssh 'roles:gprd-base-db-redis-server-cache' 'sudo grep "redis.*tcp_timeout" /etc/gitlab/gitlab.rb'

$ knife ssh 'roles:gprd-base-db-redis-server-cache' 'sudo grep "^timeout" /var/opt/gitlab/redis/redis.conf'

Rollback steps

Since this aims to make the redis.conf file match the runtime state, no rollback should be needed, but if it is, the old redis.conf file can be restored.

Changes checklist

Detailed steps and rollback steps have been filled prior to commencing work
Person on-call has been informed prior to change being rolled out

Edited Mar 31, 2020 by Matt Smiley