[gprd] Replace `camoproxy` instances after changing `machine_type` from `n1-standard-1` to `n2-standard-4`

`Production` Change

Change Summary

[gprd] Replace camoproxy instances after changing machine_type from n1-standard-1 to n2-standard-4.

Nodes affected by this change:

camoproxy-01-sv-gprd.c.gitlab-production.internal
camoproxy-02-sv-gprd.c.gitlab-production.internal
camoproxy-03-sv-gprd.c.gitlab-production.internal

Fulfills: https://gitlab.com/gitlab-com/gl-infra/infrastructure/-/issues/15110

Change Details

Services Impacted - ServiceCamoproxy
Change Technician - @nnelson
Change Reviewer - @f_santos
Time tracking - 70 minutes
Downtime Component - No downtime

Detailed steps for the change

Pre-Change Steps - steps to be completed before execution of the change

Estimated Time to Complete - 10 minutes

Set some environment variables.

GITLAB_ENVIRONMENT='gprd'
GITLAB_PROJECT='gitlab-production'

Collect a list of the fqdn values for all camoproxy service instances.

bundle exec knife search node "fqdn:camoproxy-*-sv-${GITLAB_ENVIRONMENT}*" --format=json | jq --raw-output '[.rows[].automatic.hostname]|sort|.[]' > /tmp/camoproxy_service_instances.txt

Merge the MR which changes the camoproxy.machine_type from n1-standard-1 to n2d-standard-4: https://ops.gitlab.net/gitlab-com/gl-infra/config-mgmt/-/merge_requests/3509
Wait for pipelines to finish without errors.

Change Steps - steps to take to execute the change

Estimated Time to Complete - 20 minutes

Add the changein-progress label by making a comment with the following content:

/label changein-progress

Wait until the merge pipeline has completed, and click the run button on the apply stage.

Fetch and pull the latest changes.

cd ~/Documents/workspace/config-mgmt
git fetch
git pull

Execute:

cd ~/Documents/workspace/config-mgmt/environments/gprd
tf init -upgrade

`camoproxy-01`

To replace the first node, execute:

target='module.camoproxy.google_compute_instance.default[0]'
plan_file_camoproxy_0="gl-infra-6410_camoproxy-ci_$(date -u '+%Y%m%d_%H%M%S').plan"; tf plan -replace="${target}" -target="${target}" -out="${plan_file_camoproxy_0}"; ls "${plan_file_camoproxy_0}"

tf show "${plan_file_camoproxy_0}"

tf apply "${plan_file_camoproxy_0}"

Wait until the status of the host is eventually RUNNING:

gcloud --project="${GITLAB_PROJECT}" compute instances describe "camoproxy-01-sv-${GITLAB_ENVIRONMENT}" --format='csv[no-heading](status,zone)'

Verify that the go-camo process is eventually back up and running on the host system:

bundle exec knife ssh "fqdn:camoproxy-*-sv-${GITLAB_ENVIRONMENT}*" 'pgrep -fl go-camo'

Check for request traffic logged in the camoproxy logs:

bundle exec knife ssh "fqdn:camoproxy-*-sv-${GITLAB_ENVIRONMENT}*" 'sudo tail -n10 /var/log/camoproxy/current'

`camoproxy-02`

To replace the second node, execute:

target='module.camoproxy.google_compute_instance.default[1]'
plan_file_camoproxy_1="gl-infra-6410_camoproxy-ci_$(date -u '+%Y%m%d_%H%M%S').plan"; tf plan -replace="${target}" -target="${target}" -out="${plan_file_camoproxy_1}"; ls "${plan_file_camoproxy_1}"

tf show "${plan_file_camoproxy_1}"

tf apply "${plan_file_camoproxy_1}"

Wait until the status of the host is eventually RUNNING:

gcloud --project="${GITLAB_PROJECT}" compute instances describe "camoproxy-02-sv-${GITLAB_ENVIRONMENT}" --format='csv[no-heading](status,zone)'

Verify that the go-camo process is eventually back up and running on the host system:

bundle exec knife ssh "fqdn:camoproxy-*-sv-${GITLAB_ENVIRONMENT}*" 'pgrep -fl go-camo'

Check for request traffic logged in the camoproxy logs:

bundle exec knife ssh "fqdn:camoproxy-*-sv-${GITLAB_ENVIRONMENT}*" 'sudo tail -n10 /var/log/camoproxy/current'

`camoproxy-03`

To replace the third node, execute:

target='module.camoproxy.google_compute_instance.default[2]'
plan_file_camoproxy_2="gl-infra-6410_camoproxy-ci_$(date -u '+%Y%m%d_%H%M%S').plan"; tf plan -replace="${target}" -target="${target}" -out="${plan_file_camoproxy_2}"; ls "${plan_file_camoproxy_2}"

tf show "${plan_file_camoproxy_2}"

tf apply "${plan_file_camoproxy_2}"

Wait until the status of the host is eventually RUNNING:

gcloud --project="${GITLAB_PROJECT}" compute instances describe "camoproxy-03-sv-${GITLAB_ENVIRONMENT}" --format='csv[no-heading](status,zone)'

Verify that the go-camo process is eventually back up and running on the host system:

bundle exec knife ssh "fqdn:camoproxy-*-sv-${GITLAB_ENVIRONMENT}*" 'pgrep -fl go-camo'

Check for request traffic logged in the camoproxy logs:

bundle exec knife ssh "fqdn:camoproxy-03-sv-${GITLAB_ENVIRONMENT}*" 'sudo tail -n10 /var/log/camoproxy/current'

Remove the changein-progress label by making a comment with the following content:

/unlabel changein-progress

Post-Change Steps - steps to take to verify the change

Estimated Time to Complete - 5 minutes

Verify that all systems now have the specific resource allocations.

ssh camoproxy-01-sv-${GITLAB_ENVIRONMENT}.c.${GITLAB_PROJECT}.internal -- '[[ $(nproc) -eq 4 ]] && echo confirmed || echo failed'
ssh camoproxy-02-sv-${GITLAB_ENVIRONMENT}.c.${GITLAB_PROJECT}.internal -- '[[ $(nproc) -eq 4 ]] && echo confirmed || echo failed'
ssh camoproxy-03-sv-${GITLAB_ENVIRONMENT}.c.${GITLAB_PROJECT}.internal -- '[[ $(nproc) -eq 4 ]] && echo confirmed || echo failed'

Rollback

Rollback steps - steps to be taken in the event of a need to rollback this change

Estimated Time to Complete - 35 minutes

Create a revert merge request from the terraform merge request for this issue: https://ops.gitlab.net/gitlab-com/gl-infra/config-mgmt/-/merge_requests/3506
Paste a link to the revert merge request here: https://ops.gitlab.net/gitlab-com/gl-infra/config-mgmt/-/merge_requests/0000
Have the revert merge request peer reviewed.
Merge the revert merge request and repeat the change steps above.

Monitoring

Key metrics to observe

Metric: gprd camoproxy Service Apdex
- Location: https://dashboards.gitlab.net/d/camoproxy-main/camoproxy-overview?orgId=1&var-PROMETHEUS_DS=Global&var-environment=gprd&var-stage=main
- What changes to this metric should prompt a rollback: Any sustained degradation below 96.5% for longer than 5 minutes.

Summary of infrastructure changes

Does this change introduce new compute instances? No
Does this change re-size any existing compute instances? Yes
Does this change introduce any additional usage of tooling like Elastic Search, CDNs, Cloudflare, etc? No

This is a change plan to resize the camoproxy service instances machine_type from n1-standard-1 to n2d-standard-4.

Change Reviewer checklist

C4 C3 C2 C1:

The scheduled day and time of execution of the change is appropriate.
The change plan is technically accurate.
The change plan includes estimated timing values based on previous testing.
The change plan includes a viable rollback plan.
The specified metrics/monitoring dashboards provide sufficient visibility for the change.

C2 C1:

The complexity of the plan is appropriate for the corresponding risk of the change. (i.e. the plan contains clear details).
The change plan includes success measures for all steps/milestones during the execution.
The change adequately minimizes risk within the environment/service.
The performance implications of executing the change are well-understood and documented.
The specified metrics/monitoring dashboards provide sufficient visibility for the change. - If not, is it possible (or necessary) to make changes to observability platforms for added visibility?
The change has a primary and secondary SRE with knowledge of the details available during the change window.

Change Technician checklist

Edited Feb 24, 2022 by Nels Nelson

[gprd] Replace `camoproxy` instances after changing `machine_type` from `n1-standard-1` to `n2-standard-4`

Production Change

Change Summary

Change Details

Detailed steps for the change

Pre-Change Steps - steps to be completed before execution of the change

Change Steps - steps to take to execute the change

camoproxy-01

camoproxy-02

camoproxy-03

Post-Change Steps - steps to take to verify the change

Rollback

Rollback steps - steps to be taken in the event of a need to rollback this change

Monitoring

Key metrics to observe

Summary of infrastructure changes

Change Reviewer checklist

Change Technician checklist

`Production` Change

`camoproxy-01`

`camoproxy-02`

`camoproxy-03`