Convert gstg redis-sidekiq to C2 machine type
Context: https://gitlab.com/gitlab-com/gl-infra/scalability/-/issues/230
C2
Production Change - Criticality 2Change Objective | Change the redis-sidekiq nodes to C2 machine types with better CPUs |
---|---|
Change Type | ConfigurationChange |
Services Impacted | ServiceSidekiq ServiceRedis |
Change Team Members | @cmiskell |
Change Criticality | C2 |
Change Reviewer | Self-reviewed for stating |
Tested in staging | This is the staging control (to be copied for production) |
Dry-run output | N/A |
Due Date | 2020-03-26 01:00UTC (engineer @ 14:00) |
Time tracking | 1hr |
Detailed steps for the change
-
Merge the terraform MR. - DO NOT APPLY WITH CI - that would cause all nodes to be restarted, or to fail if TF does not allow restarts (likely).
-
Identify the current primary redis-sidekiq node. SSH to each, find the one that has PRIMARY-REDIS
in the prompt. -
Select one of the replicas, and shut it down. -
Change the machine type: gcloud --project gitlab-staging-1 compute instances set-machine-type redis-sidekiq-N-db-gstg --machine-type c2-standard-4 --zone us-east1-X
where N selects the node, and X the zone that node is in (01 = c, 02 = d, 03 = b) -
Start the node: gcloud --project gitlab-staging-1 compute instances start redis-sidekiq-N-db-gstg --zone us-east1-X
- Once the node has started up:
-
Verify that redis and sentinel started with sudo gitlab-ctl status
- If necessary, start them manually with
sudo gitlab-ctl start
- If necessary, start them manually with
- On the master, verify:
-
With sudo tail -f /var/log/gitlab/redis/current /var/log/gitlab/sentinel/current
, that the slave reconnected and sentinel reported a+sdown
on slave and sentinel before the restart, and a reboot and-sdown
on slave and sentinel when the services started up again. Also thatSynchronization with slave IP:6379 succeeded
-
With /opt/gitlab/embedded/bin/redis-cli -p 26379 sentinel slaves gstg-redis-sidekiq
, that two slaves are known.
-
-
-
Monitor for 5 minutes, ensure that nothing unexpected happens on the redis cluster (e.g. further failovers) - Force a failover to the modified node:
- On the replica that is still type n1, ensure it cannot be failed over to (temporarily)
-
REDIS_MASTER_AUTH=$(sudo grep ^masterauth /var/opt/gitlab/redis/redis.conf|cut -d\" -f2)
-
/opt/gitlab/embedded/bin/redis-cli -a $REDIS_MASTER_AUTH CONFIG SET replica-priority 0
-
-
On any one of the redis-sidekiq nodes: /opt/gitlab/embedded/bin/redis-cli -p 26379 SENTINEL failover gstg-redis-sidekiq
. -
Verify it failed over to the c2 node /opt/gitlab/embedded/bin/redis-cli -p 26379 sentinel master gstg-redis-sidekiq
, looking for the last octet of the IP address in item (4) to be10X
(X being the index of the node) - On the replica that is still of type n1, enable failover again:
-
REDIS_MASTER_AUTH=$(sudo grep ^masterauth /var/opt/gitlab/redis/redis.conf|cut -d\" -f2)
-
/opt/gitlab/embedded/bin/redis-cli -a $REDIS_MASTER_AUTH CONFIG SET replica-priority 100
-
- On the replica that is still type n1, ensure it cannot be failed over to (temporarily)
- Monitor for 15 minutes:
-
Ensure that nothing unexpected happens on the redis cluster (e.g. further failovers) -
Monitor performance using https://dashboards.gitlab.net/d/alerts-saturation_component/alerts-saturation-component-alert?orgId=1&from=now-3h&to=now&panelId=2&tz=UTC&var-environment=gprd&var-type=redis-sidekiq&var-stage=main&var-component=single_threaded_cpu&fullscreen which will show the highest single CPU usage on the cluster which will be the master (both before and after). We are expecting this number to drop. If it rises, rollback immediately.
-
The change of node type of the other 2 nodes (now replicas) will be completed in the following days, when we have confidence there's no unexpected effects (unlikely, but worth being careful).
Rollback steps
- Force a failover away from the C2 node:
/opt/gitlab/embedded/bin/redis-cli -p 26379 SENTINEL failover gstg-redis-sidekiq
. - Revert the change in MR
- Shutdown the C2 node
- Change back to instance type n1-stanard-2:
gcloud --project gitlab-staging-1 compute instances set-machine-type redis-sidekiq-N-db-gstg --machine-type n1-standard-2 --zone us-east-1X
- Start up the node:
gcloud --project gitlab-staging-1 compute instances start redis-sidekiq-N-db-gstg --zone us-east1-X
- Once the node has started up, verify that redis and sentinel started with
sudo gitlab-ctl status
- If necessary, start them manually with
sudo gitlab-ctl start
- If necessary, start them manually with
Changes checklist
-
Detailed steps and rollback steps have been filled prior to commencing work -
Person on-call has been informed prior to change being rolled out
Edited by Craig Miskell