[gstg] Replace `redis-cache-sentinel` instances after changing `machine_type` from `n1-standard-1` to `n2d-standard-4`

Staging Change

Change Summary

[gstg] Replace redis-cache-sentinel instances after changing machine_type from n1-standard-1 to n2d-standard-4.

Nodes affected by this change:

  • redis-cache-sentinel-01-db-gstg.c.gitlab-staging-1.internal
  • redis-cache-sentinel-02-db-gstg.c.gitlab-staging-1.internal
  • redis-cache-sentinel-03-db-gstg.c.gitlab-staging-1.internal

Fulfills: https://gitlab.com/gitlab-com/gl-infra/reliability/-/issues/15402

Change Details

  1. Services Impacted - ServiceRedis
  2. Change Technician - @nnelson
  3. Change Reviewer - @ahanselka
  4. Time tracking - 70 minutes
  5. Downtime Component - No downtime

Detailed steps for the change

Pre-Change Steps - steps to be completed before execution of the change

Estimated Time to Complete - 10 minutes

  • Install any available gcloud components updates.
    gcloud components update
  • Set some environment variables.
    GITLAB_ENVIRONMENT='gstg'
    GITLAB_PROJECT='gitlab-staging-1'
  • Collect a list of the fqdn values for all redis-cache-sentinel service instances.
    bundle exec knife search node "fqdn:redis-cache-sentinel-*-db-${GITLAB_ENVIRONMENT}*" --format=json | jq --raw-output '[.rows[].automatic.hostname]|sort|.[]' > /tmp/sentinel_service_instances.txt
  • Merge the MR which changes the redis-cache-sentinel.machine_type from n1-standard-1 to n2d-standard-4: https://ops.gitlab.net/gitlab-com/gl-infra/config-mgmt/-/merge_requests/3574
  • Wait for pipelines to finish without errors.

Change Steps - steps to take to execute the change

Estimated Time to Complete - 20 minutes

/label changein-progress

  • Wait until the merge pipeline has completed.
  • Confirm that the pipeline stages have completed without error.
  • Verify that the plan stage is once again clean.
  • Press the play button on the apply gstg stage.
  • Wait until the apply gstg pipeline stage has completed.
  • Fetch and pull the latest changes.
    cd ~/Documents/workspace/config-mgmt
    git fetch
    git pull
  • Execute:
    cd ~/Documents/workspace/config-mgmt/environments/$GITLAB_ENVIRONMENT
    tf init -upgrade

redis-cache-sentinel-01-db-gstg

  • Create the terraform plan:
    target='module.redis-cache.google_compute_instance.sentinel_instance_with_attached_disk[0]'
    plan_file_sentinel_0="gl-infra-6534_sentinel_$(date -u '+%Y%m%d_%H%M%S').plan"; tf plan -replace="${target}" -target="${target}" -out="${plan_file_sentinel_0}"; ls "${plan_file_sentinel_0}"
  • Execute:
    tf show "${plan_file_sentinel_0}"
    tf apply "${plan_file_sentinel_0}"
  • Wait until the status of the host is eventually RUNNING:
    gcloud --project="${GITLAB_PROJECT}" compute instances describe "redis-cache-sentinel-01-db-${GITLAB_ENVIRONMENT}" --format=json | jq --raw-output .status
  • Monitor the serial port output of the system.
    gcloud compute --project="${GITLAB_PROJECT}" instances get-serial-port-output redis-cache-sentinel-01-db-${GITLAB_ENVIRONMENT} --port=1 2>&1 | grep 'google-startup-scripts.service: Succeeded'
  • Verify that the number of processor cores for the new system match expectations:
    ssh "redis-cache-sentinel-01-db-${GITLAB_ENVIRONMENT}.c.${GITLAB_PROJECT}.internal" -- 'nproc'
  • Reconfigure gitlab in order to start the redis-sentinel service:
    ssh "redis-cache-sentinel-01-db-${GITLAB_ENVIRONMENT}.c.${GITLAB_PROJECT}.internal" -- 'sudo gitlab-ctl reconfigure'
  • Verify that the redis-sentinel service is back up and running on the host system:
    ssh "redis-cache-sentinel-01-db-${GITLAB_ENVIRONMENT}.c.${GITLAB_PROJECT}.internal" -- 'pgrep -fl redis-sentinel'
  • Verify that the quorum looks correct:
    ssh "redis-cache-sentinel-01-db-${GITLAB_ENVIRONMENT}.c.${GITLAB_PROJECT}.internal" -- "/opt/gitlab/embedded/bin/redis-cli -p 26379 sentinel ckquorum ${GITLAB_ENVIRONMENT}-redis-cache"
    • Verify that the output resembles:
      OK 3 usable Sentinels. Quorum and failover authorization can be reached

redis-cache-sentinel-02-db-gstg

  • Create the terraform plan:
    target='module.redis-cache.google_compute_instance.sentinel_instance_with_attached_disk[1]'
    plan_file_sentinel_1="gl-infra-6534_sentinel_$(date -u '+%Y%m%d_%H%M%S').plan"; tf plan -replace="${target}" -target="${target}" -out="${plan_file_sentinel_1}"; ls "${plan_file_sentinel_1}"
  • Execute:
    tf show "${plan_file_sentinel_1}"
    tf apply "${plan_file_sentinel_1}"
  • Wait until the status of the host is eventually RUNNING:
    gcloud --project="${GITLAB_PROJECT}" compute instances describe "redis-cache-sentinel-02-db-${GITLAB_ENVIRONMENT}" --format=json | jq --raw-output .status
  • Monitor the serial port output of the system.
    gcloud compute --project="${GITLAB_PROJECT}" instances get-serial-port-output redis-cache-sentinel-02-db-${GITLAB_ENVIRONMENT} --port=1 2>&1 | grep 'google-startup-scripts.service: Succeeded'
  • Verify that the number of processor cores for the new system match expectations:
    ssh "redis-cache-sentinel-02-db-${GITLAB_ENVIRONMENT}.c.${GITLAB_PROJECT}.internal" -- 'nproc'
  • Reconfigure gitlab in order to start the redis-sentinel service:
    ssh "redis-cache-sentinel-02-db-${GITLAB_ENVIRONMENT}.c.${GITLAB_PROJECT}.internal" -- 'sudo gitlab-ctl reconfigure'
  • Verify that the redis-sentinel process is eventually back up and running on the host system:
    bundle exec knife ssh "fqdn:redis-cache-sentinel-*-db-${GITLAB_ENVIRONMENT}*" 'pgrep -fl redis-sentinel'
  • Verify that the quorum looks correct:
    ssh "redis-cache-sentinel-02-db-${GITLAB_ENVIRONMENT}.c.${GITLAB_PROJECT}.internal" -- "/opt/gitlab/embedded/bin/redis-cli -p 26379 sentinel ckquorum ${GITLAB_ENVIRONMENT}-redis-cache"
    • Verify that the output resembles:
      OK 3 usable Sentinels. Quorum and failover authorization can be reached

redis-cache-sentinel-03-db-gstg

  • Create the terraform plan:
    target='module.redis-cache.google_compute_instance.sentinel_instance_with_attached_disk[2]'
    plan_file_sentinel_2="gl-infra-6534_sentinel_$(date -u '+%Y%m%d_%H%M%S').plan"; tf plan -replace="${target}" -target="${target}" -out="${plan_file_sentinel_2}"; ls "${plan_file_sentinel_2}"
  • Execute:
    tf show "${plan_file_sentinel_2}"
    tf apply "${plan_file_sentinel_2}"
  • Wait until the status of the host is eventually RUNNING:
    gcloud --project="${GITLAB_PROJECT}" compute instances describe "redis-cache-sentinel-03-db-${GITLAB_ENVIRONMENT}" --format=json | jq --raw-output .status
  • Monitor the serial port output of the system.
    gcloud compute --project="${GITLAB_PROJECT}" instances get-serial-port-output redis-cache-sentinel-03-db-${GITLAB_ENVIRONMENT} --port=1 2>&1 | grep 'google-startup-scripts.service: Succeeded'
  • Verify that the number of processor cores for the new system match expectations:
    ssh "redis-cache-sentinel-03-db-${GITLAB_ENVIRONMENT}.c.${GITLAB_PROJECT}.internal" -- 'nproc'
  • Reconfigure gitlab in order to start the redis-sentinel service:
    ssh "redis-cache-sentinel-03-db-${GITLAB_ENVIRONMENT}.c.${GITLAB_PROJECT}.internal" -- 'sudo gitlab-ctl reconfigure'
  • Verify that the redis-sentinel process is eventually back up and running on the host system:
    bundle exec knife ssh "fqdn:redis-cache-sentinel-*-db-${GITLAB_ENVIRONMENT}*" 'pgrep -fl redis-sentinel'
  • Verify that the quorum looks correct:
    ssh "redis-cache-sentinel-03-db-${GITLAB_ENVIRONMENT}.c.${GITLAB_PROJECT}.internal" -- "/opt/gitlab/embedded/bin/redis-cli -p 26379 sentinel ckquorum ${GITLAB_ENVIRONMENT}-redis-cache"
    • Verify that the output resembles:
      OK 3 usable Sentinels. Quorum and failover authorization can be reached

All done

/unlabel changein-progress

Post-Change Steps - steps to take to verify the change

Estimated Time to Complete - 1 minutes

  • TODO

Rollback

Rollback steps - steps to be taken in the event of a need to rollback this change

Estimated Time to Complete - 35 minutes

Monitoring

Key metrics to observe

Summary of infrastructure changes

  • Does this change introduce new compute instances? No
  • Does this change re-size any existing compute instances? Yes
  • Does this change introduce any additional usage of tooling like Elastic Search, CDNs, Cloudflare, etc? No

This is a change plan to resize the redis-cache-sentinel service instances machine_type from n1-standard-1 to n2d-standard-4.

Change Reviewer checklist

C4 C3 C2 C1:

  • The scheduled day and time of execution of the change is appropriate.
  • The change plan is technically accurate.
  • The change plan includes estimated timing values based on previous testing.
  • The change plan includes a viable rollback plan.
  • The specified metrics/monitoring dashboards provide sufficient visibility for the change.

C2 C1:

  • The complexity of the plan is appropriate for the corresponding risk of the change. (i.e. the plan contains clear details).
  • The change plan includes success measures for all steps/milestones during the execution.
  • The change adequately minimizes risk within the environment/service.
  • The performance implications of executing the change are well-understood and documented.
  • The specified metrics/monitoring dashboards provide sufficient visibility for the change. - If not, is it possible (or necessary) to make changes to observability platforms for added visibility?
  • The change has a primary and secondary SRE with knowledge of the details available during the change window.

Change Technician checklist

  • This issue has a criticality label (e.g. C1, C2, C3, C4) and a change-type label (e.g. changeunscheduled, changescheduled) based on the [Change Management Criticalities](https://about.gitlab.com/han dbook/engineering/infrastructure/change-management/#change-criticalities).

  • This issue has the change technician as the assignee.

  • Pre-Change, Change, Post-Change, and Rollback steps and have been filled out and reviewed.

  • This Change Issue is linked to the appropriate Issue and/or Epic

  • Necessary approvals have been completed based on the Change Management Workflow.

  • Change has been tested in staging and results noted in a comment on this issue.

  • A dry-run has been conducted and results noted in a comment on this issue.

  • SRE on-call has been informed prior to change being rolled out. (In #production channel, mention @sre-oncall and this issue and await their acknowledgement.)

  • Release managers have been informed (If needed! Cases include DB change) prior to change being rolled out. (In #production channel, mention @release-managers and this issue and await their acknowledgment.)

  • There are currently no active incidents.

Edited by Nels Nelson