[gstg] Replace `redis-cache-sentinel` instances after changing `machine_type` from `n1-standard-1` to `n2d-standard-4`

`Staging` Change

Change Summary

[gstg] Replace redis-cache-sentinel instances after changing machine_type from n1-standard-1 to n2d-standard-4.

Nodes affected by this change:

redis-cache-sentinel-01-db-gstg.c.gitlab-staging-1.internal
redis-cache-sentinel-02-db-gstg.c.gitlab-staging-1.internal
redis-cache-sentinel-03-db-gstg.c.gitlab-staging-1.internal

Fulfills: https://gitlab.com/gitlab-com/gl-infra/reliability/-/issues/15402

Change Details

Services Impacted - ServiceRedis
Change Technician - @nnelson
Change Reviewer - @ahanselka
Time tracking - 70 minutes
Downtime Component - No downtime

Detailed steps for the change

Pre-Change Steps - steps to be completed before execution of the change

Estimated Time to Complete - 10 minutes

Install any available gcloud components updates.
```
gcloud components update
```

Set some environment variables.

GITLAB_ENVIRONMENT='gstg'
GITLAB_PROJECT='gitlab-staging-1'

Collect a list of the fqdn values for all redis-cache-sentinel service instances.

bundle exec knife search node "fqdn:redis-cache-sentinel-*-db-${GITLAB_ENVIRONMENT}*" --format=json | jq --raw-output '[.rows[].automatic.hostname]|sort|.[]' > /tmp/sentinel_service_instances.txt

Merge the MR which changes the redis-cache-sentinel.machine_type from n1-standard-1 to n2d-standard-4: https://ops.gitlab.net/gitlab-com/gl-infra/config-mgmt/-/merge_requests/3574
Wait for pipelines to finish without errors.

Change Steps - steps to take to execute the change

Estimated Time to Complete - 20 minutes

Add the changein-progress label by making a comment with the following content:

/label changein-progress

Wait until the merge pipeline has completed.
Confirm that the pipeline stages have completed without error.
Verify that the plan stage is once again clean.
Press the play button on the apply gstg stage.
Wait until the apply gstg pipeline stage has completed.

Fetch and pull the latest changes.

cd ~/Documents/workspace/config-mgmt
git fetch
git pull

Execute:

cd ~/Documents/workspace/config-mgmt/environments/$GITLAB_ENVIRONMENT
tf init -upgrade

`redis-cache-sentinel-01-db-gstg`

Create the terraform plan:

target='module.redis-cache.google_compute_instance.sentinel_instance_with_attached_disk[0]'
plan_file_sentinel_0="gl-infra-6534_sentinel_$(date -u '+%Y%m%d_%H%M%S').plan"; tf plan -replace="${target}" -target="${target}" -out="${plan_file_sentinel_0}"; ls "${plan_file_sentinel_0}"

Execute:

tf show "${plan_file_sentinel_0}"

tf apply "${plan_file_sentinel_0}"

Wait until the status of the host is eventually RUNNING:

gcloud --project="${GITLAB_PROJECT}" compute instances describe "redis-cache-sentinel-01-db-${GITLAB_ENVIRONMENT}" --format=json | jq --raw-output .status

Monitor the serial port output of the system.

gcloud compute --project="${GITLAB_PROJECT}" instances get-serial-port-output redis-cache-sentinel-01-db-${GITLAB_ENVIRONMENT} --port=1 2>&1 | grep 'google-startup-scripts.service: Succeeded'

Verify that the number of processor cores for the new system match expectations:

ssh "redis-cache-sentinel-01-db-${GITLAB_ENVIRONMENT}.c.${GITLAB_PROJECT}.internal" -- 'nproc'

Reconfigure gitlab in order to start the redis-sentinel service:

ssh "redis-cache-sentinel-01-db-${GITLAB_ENVIRONMENT}.c.${GITLAB_PROJECT}.internal" -- 'sudo gitlab-ctl reconfigure'

Verify that the redis-sentinel service is back up and running on the host system:

ssh "redis-cache-sentinel-01-db-${GITLAB_ENVIRONMENT}.c.${GITLAB_PROJECT}.internal" -- 'pgrep -fl redis-sentinel'

Verify that the quorum looks correct:

ssh "redis-cache-sentinel-01-db-${GITLAB_ENVIRONMENT}.c.${GITLAB_PROJECT}.internal" -- "/opt/gitlab/embedded/bin/redis-cli -p 26379 sentinel ckquorum ${GITLAB_ENVIRONMENT}-redis-cache"

Verify that the output resembles:

OK 3 usable Sentinels. Quorum and failover authorization can be reached

`redis-cache-sentinel-02-db-gstg`

Create the terraform plan:

target='module.redis-cache.google_compute_instance.sentinel_instance_with_attached_disk[1]'
plan_file_sentinel_1="gl-infra-6534_sentinel_$(date -u '+%Y%m%d_%H%M%S').plan"; tf plan -replace="${target}" -target="${target}" -out="${plan_file_sentinel_1}"; ls "${plan_file_sentinel_1}"

Execute:

tf show "${plan_file_sentinel_1}"

tf apply "${plan_file_sentinel_1}"

Wait until the status of the host is eventually RUNNING:

gcloud --project="${GITLAB_PROJECT}" compute instances describe "redis-cache-sentinel-02-db-${GITLAB_ENVIRONMENT}" --format=json | jq --raw-output .status

Monitor the serial port output of the system.

gcloud compute --project="${GITLAB_PROJECT}" instances get-serial-port-output redis-cache-sentinel-02-db-${GITLAB_ENVIRONMENT} --port=1 2>&1 | grep 'google-startup-scripts.service: Succeeded'

Verify that the number of processor cores for the new system match expectations:

ssh "redis-cache-sentinel-02-db-${GITLAB_ENVIRONMENT}.c.${GITLAB_PROJECT}.internal" -- 'nproc'

Reconfigure gitlab in order to start the redis-sentinel service:

ssh "redis-cache-sentinel-02-db-${GITLAB_ENVIRONMENT}.c.${GITLAB_PROJECT}.internal" -- 'sudo gitlab-ctl reconfigure'

Verify that the redis-sentinel process is eventually back up and running on the host system:

bundle exec knife ssh "fqdn:redis-cache-sentinel-*-db-${GITLAB_ENVIRONMENT}*" 'pgrep -fl redis-sentinel'

Verify that the quorum looks correct:

ssh "redis-cache-sentinel-02-db-${GITLAB_ENVIRONMENT}.c.${GITLAB_PROJECT}.internal" -- "/opt/gitlab/embedded/bin/redis-cli -p 26379 sentinel ckquorum ${GITLAB_ENVIRONMENT}-redis-cache"

Verify that the output resembles:

OK 3 usable Sentinels. Quorum and failover authorization can be reached

`redis-cache-sentinel-03-db-gstg`

Create the terraform plan:

target='module.redis-cache.google_compute_instance.sentinel_instance_with_attached_disk[2]'
plan_file_sentinel_2="gl-infra-6534_sentinel_$(date -u '+%Y%m%d_%H%M%S').plan"; tf plan -replace="${target}" -target="${target}" -out="${plan_file_sentinel_2}"; ls "${plan_file_sentinel_2}"

Execute:

tf show "${plan_file_sentinel_2}"

tf apply "${plan_file_sentinel_2}"

Wait until the status of the host is eventually RUNNING:

gcloud --project="${GITLAB_PROJECT}" compute instances describe "redis-cache-sentinel-03-db-${GITLAB_ENVIRONMENT}" --format=json | jq --raw-output .status

Monitor the serial port output of the system.

gcloud compute --project="${GITLAB_PROJECT}" instances get-serial-port-output redis-cache-sentinel-03-db-${GITLAB_ENVIRONMENT} --port=1 2>&1 | grep 'google-startup-scripts.service: Succeeded'

Verify that the number of processor cores for the new system match expectations:

ssh "redis-cache-sentinel-03-db-${GITLAB_ENVIRONMENT}.c.${GITLAB_PROJECT}.internal" -- 'nproc'

Reconfigure gitlab in order to start the redis-sentinel service:

ssh "redis-cache-sentinel-03-db-${GITLAB_ENVIRONMENT}.c.${GITLAB_PROJECT}.internal" -- 'sudo gitlab-ctl reconfigure'

Verify that the redis-sentinel process is eventually back up and running on the host system:

bundle exec knife ssh "fqdn:redis-cache-sentinel-*-db-${GITLAB_ENVIRONMENT}*" 'pgrep -fl redis-sentinel'

Verify that the quorum looks correct:

ssh "redis-cache-sentinel-03-db-${GITLAB_ENVIRONMENT}.c.${GITLAB_PROJECT}.internal" -- "/opt/gitlab/embedded/bin/redis-cli -p 26379 sentinel ckquorum ${GITLAB_ENVIRONMENT}-redis-cache"

Verify that the output resembles:

OK 3 usable Sentinels. Quorum and failover authorization can be reached

All done

Remove the changein-progress label by making a comment with the following content:

/unlabel changein-progress

Post-Change Steps - steps to take to verify the change

Estimated Time to Complete - 1 minutes

TODO

Rollback

Rollback steps - steps to be taken in the event of a need to rollback this change

Estimated Time to Complete - 35 minutes

Create a revert merge request from the terraform merge request for this issue: https://ops.gitlab.net/gitlab-com/gl-infra/config-mgmt/-/merge_requests/3574
Paste a link to the revert merge request here: https://ops.gitlab.net/gitlab-com/gl-infra/config-mgmt/-/merge_requests/0000
Have the revert merge request peer reviewed.
Merge the revert merge request and repeat the change steps above.

Monitoring

Key metrics to observe

Metric: node_schedstat_waiting_seconds_total
- Locations:
- What changes to this metric should prompt a rollback: Any sustained elevation above 1 for longer than 10 minutes.

Summary of infrastructure changes

Does this change introduce new compute instances? No
Does this change re-size any existing compute instances? Yes
Does this change introduce any additional usage of tooling like Elastic Search, CDNs, Cloudflare, etc? No

This is a change plan to resize the redis-cache-sentinel service instances machine_type from n1-standard-1 to n2d-standard-4.

Change Reviewer checklist

C4 C3 C2 C1:

The scheduled day and time of execution of the change is appropriate.
The change plan is technically accurate.
The change plan includes estimated timing values based on previous testing.
The change plan includes a viable rollback plan.
The specified metrics/monitoring dashboards provide sufficient visibility for the change.

C2 C1:

The complexity of the plan is appropriate for the corresponding risk of the change. (i.e. the plan contains clear details).
The change plan includes success measures for all steps/milestones during the execution.
The change adequately minimizes risk within the environment/service.
The performance implications of executing the change are well-understood and documented.
The specified metrics/monitoring dashboards provide sufficient visibility for the change. - If not, is it possible (or necessary) to make changes to observability platforms for added visibility?
The change has a primary and secondary SRE with knowledge of the details available during the change window.

Change Technician checklist

Edited Mar 15, 2022 by Nels Nelson

[gstg] Replace `redis-cache-sentinel` instances after changing `machine_type` from `n1-standard-1` to `n2d-standard-4`

Staging Change

Change Summary

Change Details

Detailed steps for the change

Pre-Change Steps - steps to be completed before execution of the change

Change Steps - steps to take to execute the change

redis-cache-sentinel-01-db-gstg

redis-cache-sentinel-02-db-gstg

redis-cache-sentinel-03-db-gstg

All done

Post-Change Steps - steps to take to verify the change

Rollback

Rollback steps - steps to be taken in the event of a need to rollback this change

Monitoring

Key metrics to observe

Summary of infrastructure changes

Change Reviewer checklist

Change Technician checklist

`Staging` Change

`redis-cache-sentinel-01-db-gstg`

`redis-cache-sentinel-02-db-gstg`

`redis-cache-sentinel-03-db-gstg`