Convert gstg redis-cache to C2 machine type (#2)
Context: https://gitlab.com/gitlab-com/gl-infra/infrastructure/-/issues/9636
C2
Production Change - Criticality 2Change Objective | Change the redis-cache nodes to C2 machine types with better CPUs |
---|---|
Change Type | ConfigurationChange |
Services Impacted | ServiceRedis |
Change Team Members | @igorwwwwwwwwwwwwwwwwwwww @craigf |
Change Criticality | C2 |
Change Reviewer | @hphilipps |
Tested in staging | This is the staging control (to be copied for production) |
Dry-run output | N/A |
Due Date | 2020-03-27 10:45UTC (engineer @ 11:45) |
Time tracking | 1hr |
Detailed steps for the change
-
Merge the terraform MR. - DO NOT APPLY WITH CI - that would cause all nodes to be restarted, or to fail if TF does not allow restarts (likely).
-
Identify the current primary redis-cache node. SSH to each, find the one that has PRIMARY-REDIS
in the prompt. -
Select one of the replicas, and shut it down. -
Change the machine type: gcloud --project gitlab-staging-1 compute instances set-machine-type redis-cache-N-db-gstg --machine-type c2-standard-30 --zone us-east1-X
where N selects the node, and X the zone that node is in (01 = c, 02 = d, 03 = b) -
Change the disk size: gcloud --project gitlab-staging-1 compute disks resize redis-cache-N-db-gstg-data --size 500 --zone us-east1-X
-
Start the node: gcloud --project gitlab-staging-1 compute instances start redis-cache-N-db-gstg --zone us-east1-X
- Once the node has started up:
-
Verify that redis and sentinel started with sudo gitlab-ctl status
- If necessary, start them manually with
sudo gitlab-ctl start
- If necessary, start them manually with
-
Resize data filesystem: sudo resize2fs /dev/sdb
- On the sentinel master (
ssh redis-cache-sentinel-N-db-gstg.c.gitlab-staging-1.internal
), verify:-
With sudo tail -f /var/log/gitlab/redis/current /var/log/gitlab/sentinel/current
, that the slave reconnected and sentinel reported a+sdown
on slave and sentinel before the restart, and a reboot and-sdown
on slave and sentinel when the services started up again. Also thatSynchronization with slave IP:6379 succeeded
-
With /opt/gitlab/embedded/bin/redis-cli -p 26379 sentinel replicas gstg-redis-cache
, that two replicas are known.
-
-
-
Monitor for 5 minutes, ensure that nothing unexpected happens on the redis cluster (e.g. further failovers) - Force a failover to the modified node:
- On the replica that is still type n1, ensure it cannot be failed over to (temporarily)
-
REDIS_MASTER_AUTH=$(sudo grep ^masterauth /var/opt/gitlab/redis/redis.conf|cut -d\" -f2)
-
/opt/gitlab/embedded/bin/redis-cli -a $REDIS_MASTER_AUTH CONFIG SET replica-priority 0
-
-
On any one of the redis-cache-sentinel nodes: /opt/gitlab/embedded/bin/redis-cli -p 26379 SENTINEL failover gstg-redis-cache
. -
Verify it failed over to the c2 node /opt/gitlab/embedded/bin/redis-cli -p 26379 sentinel master gstg-redis-cache
, looking for the last octet of the IP address in item (4) to be10X
(X being the index of the node) - On the replica that is still of type n1, enable failover again:
-
REDIS_MASTER_AUTH=$(sudo grep ^masterauth /var/opt/gitlab/redis/redis.conf|cut -d\" -f2)
-
/opt/gitlab/embedded/bin/redis-cli -a $REDIS_MASTER_AUTH CONFIG SET replica-priority 100
-
- On the replica that is still type n1, ensure it cannot be failed over to (temporarily)
- Monitor for 15 minutes:
-
Tail logs sudo tail -F /var/log/gitlab/redis/current
andsudo tail -F /var/log/gitlab/sentinel/current
-
Ensure that nothing unexpected happens on the redis cluster (e.g. further failovers) -
Monitor performance using https://dashboards.gitlab.net/d/alerts-saturation_component/alerts-saturation-component-alert?orgId=1&from=now-3h&to=now&panelId=2&tz=UTC&var-environment=gstg&var-type=redis-cache&var-stage=main&var-component=single_threaded_cpu&fullscreen which will show the highest single CPU usage on the cluster which will be the master (both before and after). We are expecting this number to drop. If it rises, rollback immediately.
-
-
Apply process for second instance -
Apply process for third instance -
Locally: tf plan -target module.redis-cache
. There should be no plan.
The change of node type of the other 2 nodes (now replicas) will be completed in the following days, when we have confidence there's no unexpected effects (unlikely, but worth being careful).
Rollback steps
- Force a failover away from the C2 node:
/opt/gitlab/embedded/bin/redis-cli -p 26379 SENTINEL failover gstg-redis-cache
. - Revert the change in MR
- Shutdown the C2 node
- Change back to instance type n1-stanard-2:
gcloud --project gitlab-staging-1 compute instances set-machine-type redis-cache-N-db-gstg --machine-type n1-highmem-16 --zone us-east-1X
- Start up the node:
gcloud --project gitlab-staging-1 compute instances start redis-cache-N-db-gstg --zone us-east1-X
- Once the node has started up, verify that redis and sentinel started with
sudo gitlab-ctl status
- If necessary, start them manually with
sudo gitlab-ctl start
- If necessary, start them manually with
- Do not roll back disk resize (it's hard to make this safe)
Changes checklist
-
Detailed steps and rollback steps have been filled prior to commencing work -
Person on-call has been informed prior to change being rolled out
Based on the work by @cmiskell in #1829 (closed).
Edited by Igor