[gprd] Replace `camoproxy` instances after changing `machine_type` from `n1-standard-1` to `n2-standard-4`

Production Change

Change Summary

[gprd] Replace camoproxy instances after changing machine_type from n1-standard-1 to n2-standard-4.

Nodes affected by this change:

  • camoproxy-01-sv-gprd.c.gitlab-production.internal
  • camoproxy-02-sv-gprd.c.gitlab-production.internal
  • camoproxy-03-sv-gprd.c.gitlab-production.internal

Fulfills: https://gitlab.com/gitlab-com/gl-infra/infrastructure/-/issues/15110

Change Details

  1. Services Impacted - ServiceCamoproxy
  2. Change Technician - @nnelson
  3. Change Reviewer - @f_santos
  4. Time tracking - 70 minutes
  5. Downtime Component - No downtime

Detailed steps for the change

Pre-Change Steps - steps to be completed before execution of the change

Estimated Time to Complete - 10 minutes

  • Set some environment variables.
    GITLAB_ENVIRONMENT='gprd'
    GITLAB_PROJECT='gitlab-production'
  • Collect a list of the fqdn values for all camoproxy service instances.
    bundle exec knife search node "fqdn:camoproxy-*-sv-${GITLAB_ENVIRONMENT}*" --format=json | jq --raw-output '[.rows[].automatic.hostname]|sort|.[]' > /tmp/camoproxy_service_instances.txt
  • Merge the MR which changes the camoproxy.machine_type from n1-standard-1 to n2d-standard-4: https://ops.gitlab.net/gitlab-com/gl-infra/config-mgmt/-/merge_requests/3509
  • Wait for pipelines to finish without errors.

Change Steps - steps to take to execute the change

Estimated Time to Complete - 20 minutes

/label changein-progress

  • Wait until the merge pipeline has completed, and click the run button on the apply stage.
  • Fetch and pull the latest changes.
    cd ~/Documents/workspace/config-mgmt
    git fetch
    git pull
  • Execute:
    cd ~/Documents/workspace/config-mgmt/environments/gprd
    tf init -upgrade

camoproxy-01

  • To replace the first node, execute:
    target='module.camoproxy.google_compute_instance.default[0]'
    plan_file_camoproxy_0="gl-infra-6410_camoproxy-ci_$(date -u '+%Y%m%d_%H%M%S').plan"; tf plan -replace="${target}" -target="${target}" -out="${plan_file_camoproxy_0}"; ls "${plan_file_camoproxy_0}"
    tf show "${plan_file_camoproxy_0}"
    tf apply "${plan_file_camoproxy_0}"
  • Wait until the status of the host is eventually RUNNING:
    gcloud --project="${GITLAB_PROJECT}" compute instances describe "camoproxy-01-sv-${GITLAB_ENVIRONMENT}" --format='csv[no-heading](status,zone)'
  • Verify that the go-camo process is eventually back up and running on the host system:
    bundle exec knife ssh "fqdn:camoproxy-*-sv-${GITLAB_ENVIRONMENT}*" 'pgrep -fl go-camo'
  • Check for request traffic logged in the camoproxy logs:
    bundle exec knife ssh "fqdn:camoproxy-*-sv-${GITLAB_ENVIRONMENT}*" 'sudo tail -n10 /var/log/camoproxy/current'

camoproxy-02

  • To replace the second node, execute:
    target='module.camoproxy.google_compute_instance.default[1]'
    plan_file_camoproxy_1="gl-infra-6410_camoproxy-ci_$(date -u '+%Y%m%d_%H%M%S').plan"; tf plan -replace="${target}" -target="${target}" -out="${plan_file_camoproxy_1}"; ls "${plan_file_camoproxy_1}"
    tf show "${plan_file_camoproxy_1}"
    tf apply "${plan_file_camoproxy_1}"
  • Wait until the status of the host is eventually RUNNING:
    gcloud --project="${GITLAB_PROJECT}" compute instances describe "camoproxy-02-sv-${GITLAB_ENVIRONMENT}" --format='csv[no-heading](status,zone)'
  • Verify that the go-camo process is eventually back up and running on the host system:
    bundle exec knife ssh "fqdn:camoproxy-*-sv-${GITLAB_ENVIRONMENT}*" 'pgrep -fl go-camo'
  • Check for request traffic logged in the camoproxy logs:
    bundle exec knife ssh "fqdn:camoproxy-*-sv-${GITLAB_ENVIRONMENT}*" 'sudo tail -n10 /var/log/camoproxy/current'

camoproxy-03

  • To replace the third node, execute:
    target='module.camoproxy.google_compute_instance.default[2]'
    plan_file_camoproxy_2="gl-infra-6410_camoproxy-ci_$(date -u '+%Y%m%d_%H%M%S').plan"; tf plan -replace="${target}" -target="${target}" -out="${plan_file_camoproxy_2}"; ls "${plan_file_camoproxy_2}"
    tf show "${plan_file_camoproxy_2}"
    tf apply "${plan_file_camoproxy_2}"
  • Wait until the status of the host is eventually RUNNING:
    gcloud --project="${GITLAB_PROJECT}" compute instances describe "camoproxy-03-sv-${GITLAB_ENVIRONMENT}" --format='csv[no-heading](status,zone)'
  • Verify that the go-camo process is eventually back up and running on the host system:
    bundle exec knife ssh "fqdn:camoproxy-*-sv-${GITLAB_ENVIRONMENT}*" 'pgrep -fl go-camo'
  • Check for request traffic logged in the camoproxy logs:
    bundle exec knife ssh "fqdn:camoproxy-03-sv-${GITLAB_ENVIRONMENT}*" 'sudo tail -n10 /var/log/camoproxy/current'
  • Remove the changein-progress label by making a comment with the following content:

/unlabel changein-progress

Post-Change Steps - steps to take to verify the change

Estimated Time to Complete - 5 minutes

  • Verify that all systems now have the specific resource allocations.
    ssh camoproxy-01-sv-${GITLAB_ENVIRONMENT}.c.${GITLAB_PROJECT}.internal -- '[[ $(nproc) -eq 4 ]] && echo confirmed || echo failed'
    ssh camoproxy-02-sv-${GITLAB_ENVIRONMENT}.c.${GITLAB_PROJECT}.internal -- '[[ $(nproc) -eq 4 ]] && echo confirmed || echo failed'
    ssh camoproxy-03-sv-${GITLAB_ENVIRONMENT}.c.${GITLAB_PROJECT}.internal -- '[[ $(nproc) -eq 4 ]] && echo confirmed || echo failed'

Rollback

Rollback steps - steps to be taken in the event of a need to rollback this change

Estimated Time to Complete - 35 minutes

Monitoring

Key metrics to observe

Summary of infrastructure changes

  • Does this change introduce new compute instances? No
  • Does this change re-size any existing compute instances? Yes
  • Does this change introduce any additional usage of tooling like Elastic Search, CDNs, Cloudflare, etc? No

This is a change plan to resize the camoproxy service instances machine_type from n1-standard-1 to n2d-standard-4.

Change Reviewer checklist

C4 C3 C2 C1:

  • The scheduled day and time of execution of the change is appropriate.
  • The change plan is technically accurate.
  • The change plan includes estimated timing values based on previous testing.
  • The change plan includes a viable rollback plan.
  • The specified metrics/monitoring dashboards provide sufficient visibility for the change.

C2 C1:

  • The complexity of the plan is appropriate for the corresponding risk of the change. (i.e. the plan contains clear details).
  • The change plan includes success measures for all steps/milestones during the execution.
  • The change adequately minimizes risk within the environment/service.
  • The performance implications of executing the change are well-understood and documented.
  • The specified metrics/monitoring dashboards provide sufficient visibility for the change. - If not, is it possible (or necessary) to make changes to observability platforms for added visibility?
  • The change has a primary and secondary SRE with knowledge of the details available during the change window.

Change Technician checklist

  • This issue has a criticality label (e.g. C1, C2, C3, C4) and a change-type label (e.g. changeunscheduled, changescheduled) based on the Change Management Criticalities.
  • This issue has the change technician as the assignee.
  • Pre-Change, Change, Post-Change, and Rollback steps and have been filled out and reviewed.
  • This Change Issue is linked to the appropriate Issue and/or Epic
  • Necessary approvals have been completed based on the Change Management Workflow.
  • Change has been tested in staging and results noted in a comment on this i ssue.
  • A dry-run has been conducted and results noted in a comment on this issue.
  • SRE on-call has been informed prior to change being rolled out. (In #production channel, mention @sre-oncall and this issue and await their acknowledgement.)
  • Release managers have been informed (If needed! Cases include DB change) prior to change being rolled out. (In #production channel, mention @release-managers and this issue and await their acknowledgment.)
  • There are currently no active incidents.
Edited by Nels Nelson