Convert gprd redis-cache to C2 machine type
Context: https://gitlab.com/gitlab-com/gl-infra/infrastructure/-/issues/9636
C2
Production Change - Criticality 2Change Objective | Change the redis-cache nodes to C2 machine types with better CPUs |
---|---|
Change Type | ConfigurationChange |
Services Impacted | ServiceRedis |
Change Team Members | @igorwwwwwwwwwwwwwwwwwwww @craigf |
Change Criticality | C2 |
Change Reviewer | @T4cC0re |
Tested in staging | #1866 (closed) |
Dry-run output | N/A |
Due Date | 2020-03-31 10:15 UTC (engineer @ 12:15) |
Time tracking | 2hr |
Detailed steps for the change
-
Merge the terraform MR. - DO NOT APPLY WITH CI - that would cause all nodes to be restarted, or to fail if TF does not allow restarts (likely).
-
Identify the current primary redis-cache node. SSH to each, find the one that has PRIMARY-REDIS
in the prompt. -
Select one of the replicas, and shut it down. -
Change the machine type: gcloud --project gitlab-production compute instances set-machine-type redis-cache-N-db-gprd --machine-type c2-standard-30 --zone us-east1-X
where N selects the node, and X the zone that node is in (01 = c, 02 = d, 03 = b) -
Change the disk size: gcloud --project gitlab-production compute disks resize redis-cache-N-db-gprd-data --size 500 --zone us-east1-X
-
Start the node: gcloud --project gitlab-production compute instances start redis-cache-N-db-gprd --zone us-east1-X
- Once the node has started up:
-
Resize data filesystem: sudo resize2fs /dev/sdb
-
Verify that redis and sentinel started with sudo gitlab-ctl status
- If necessary, start them manually with
sudo gitlab-ctl start
- If necessary, start them manually with
- On the sentinel master (
ssh redis-cache-sentinel-N-db-gprd.c.gitlab-production.internal
), verify:-
With sudo tail -f /var/log/gitlab/redis/current /var/log/gitlab/sentinel/current
, that the slave reconnected and sentinel reported a+sdown
on slave and sentinel before the restart, and a reboot and-sdown
on slave and sentinel when the services started up again. Also thatSynchronization with slave IP:6379 succeeded
-
With /opt/gitlab/embedded/bin/redis-cli -p 26379 sentinel replicas gprd-redis-cache
, that two replicas are known.
-
-
-
Monitor for 5 minutes, ensure that nothing unexpected happens on the redis cluster (e.g. further failovers) - Force a failover to the modified node:
- On the replica that is still type n1, ensure it cannot be failed over to (temporarily)
-
REDIS_MASTER_AUTH=$(sudo grep ^masterauth /var/opt/gitlab/redis/redis.conf|cut -d\" -f2)
-
/opt/gitlab/embedded/bin/redis-cli -a $REDIS_MASTER_AUTH CONFIG SET replica-priority 0
-
-
On any one of the redis-cache-sentinel nodes: /opt/gitlab/embedded/bin/redis-cli -p 26379 SENTINEL failover gprd-redis-cache
. -
Verify it failed over to the c2 node /opt/gitlab/embedded/bin/redis-cli -p 26379 sentinel master gprd-redis-cache
, looking for the last octet of the IP address in item (4) to be10X
(X being the index of the node) - On the replica that is still of type n1, enable failover again:
-
REDIS_MASTER_AUTH=$(sudo grep ^masterauth /var/opt/gitlab/redis/redis.conf|cut -d\" -f2)
-
/opt/gitlab/embedded/bin/redis-cli -a $REDIS_MASTER_AUTH CONFIG SET replica-priority 100
-
- On the replica that is still type n1, ensure it cannot be failed over to (temporarily)
- Monitor for 15 minutes:
-
Tail logs sudo tail -F /var/log/gitlab/redis/current
andsudo tail -F /var/log/gitlab/sentinel/current
-
Ensure that nothing unexpected happens on the redis cluster (e.g. further failovers) -
Monitor performance using https://dashboards.gitlab.net/d/alerts-saturation_component/alerts-saturation-component-alert?orgId=1&from=now-3h&to=now&panelId=2&tz=UTC&var-environment=gprd&var-type=redis-cache&var-stage=main&var-component=single_threaded_cpu&fullscreen which will show the highest single CPU usage on the cluster which will be the master (both before and after). We are expecting this number to drop. If it rises, rollback immediately.
-
-
Apply process for second instance -
Apply process for third instance -
Locally: tf plan -target module.redis-cache
. There should be no plan.
The change of node type of the other 2 nodes (now replicas) will be completed in the following days, when we have confidence there's no unexpected effects (unlikely, but worth being careful).
Rebuilding nodes whose root disks+filesystems were accidentally grown
You can't shrink ext4 filesystems online, and you can't unmount the root filesystem (well, not without a lot of pivot_root trickery). Since we accidentally grew the root disk+filesystem, we need to rebuild these nodes.
Run these steps on staging first.
-
If the node is a master, initiate a failover. See above steps. -
tf apply -target module.redis-cache
-
Await chef convergence gcloud --project=gitlab-staging-1 compute instances tail-serial-port-output redis-cache-0N-db-gstg --zone=us-east1-X | grep startup-script
Rollback steps
- Force a failover away from the C2 node:
/opt/gitlab/embedded/bin/redis-cli -p 26379 SENTINEL failover gprd-redis-cache
. - Revert the change in MR
- Shutdown the C2 node
- Change back to instance type n1-stanard-2:
gcloud --project gitlab-production compute instances set-machine-type redis-cache-N-db-gprd --machine-type n1-highmem-16 --zone us-east-1X
- Start up the node:
gcloud --project gitlab-production compute instances start redis-cache-N-db-gprd --zone us-east1-X
- Once the node has started up, verify that redis and sentinel started with
sudo gitlab-ctl status
- If necessary, start them manually with
sudo gitlab-ctl start
- If necessary, start them manually with
- Do not roll back disk resize (it's hard to make this safe)
Changes checklist
-
Detailed steps and rollback steps have been filled prior to commencing work -
Person on-call has been informed prior to change being rolled out
Based on the work by @cmiskell in #1829 (closed).
Edited by Igor