[GSTG] Reprovision HAProxy with a single NIC

Production Change

Change Summary

Change Details

Services Impacted - ServiceHAProxy
Change Technician - @ahmadsherif
Change Reviewer - @jarv
Time tracking - 1h 20 minutes
Downtime Component - none

Detailed steps for the change

Pre-Change Steps - steps to be completed before execution of the change

Estimated Time to Complete (mins) - 1 minute

Set label changein-progress on this issue
Make sure no deployment to staging is tacking place (check #announcements Slack channel)

Change Steps - steps to take to execute the change

Estimated Time to Complete (mins) - 60 minutes

Merge https://ops.gitlab.net/gitlab-com/gl-infra/config-mgmt/-/merge_requests/3573
Run ssh fe-01-lb-gstg.c.gitlab-staging-1.internal 'sudo /usr/local/sbin/drain_haproxy.sh -w 60'
Locally, in config-mgmt/environments/gstg, run tf apply -target module.fe-lb.google_compute_instance.default[0]
After the node has been fully provisioned, make sure it's receiving traffic by checking the logs at /var/log/haproxy.log and that its status in GCP dashboard is marked as healthy.

Run the following snippet for each remaining server, check the node name below after completion :

i=${i:-1}
i=$((i+1))
ssh fe-0$i-lb-gstg.c.gitlab-staging-1.internal 'sudo /usr/local/sbin/drain_haproxy.sh -w 60'
tf apply -target "module.fe-lb.google_compute_instance.default[$((i-1))]"

fe-02-lb-gstg.c.gitlab-staging-1.internal
fe-03-lb-gstg.c.gitlab-staging-1.internal

Post-Change Steps - steps to take to verify the change

Estimated Time to Complete (mins) - 3 minutes

Check /var/log/haproxy.log on the nodes for traffic logs
Check GCP dashboard that all nodes are marked as healthy

Rollback

Rollback steps - steps to be taken in the event of a need to rollback this change

Estimated Time to Complete (mins) - 5 minutes

Revert and apply https://ops.gitlab.net/gitlab-com/gl-infra/config-mgmt/-/merge_requests/3573 (this will reprovision all nodes at once)

Monitoring

Key metrics to observe

Metric: Frontend metrics
- Location: https://dashboards.gitlab.net/d/frontend-main/frontend-overview?orgId=1&var-PROMETHEUS_DS=Global&var-environment=gstg&var-stage=main
- What changes to this metric should prompt a rollback: A drop in rate of requests

Summary of infrastructure changes

Does this change introduce new compute instances?
Does this change re-size any existing compute instances?
Does this change introduce any additional usage of tooling like Elastic Search, CDNs, Cloudflare, etc?

Summary of the above

Change Reviewer checklist

C4 C3 C2 C1:

The scheduled day and time of execution of the change is appropriate.
The change plan is technically accurate.
The change plan includes estimated timing values based on previous testing.
The change plan includes a viable rollback plan.
The specified metrics/monitoring dashboards provide sufficient visibility for the change.

C2 C1:

The complexity of the plan is appropriate for the corresponding risk of the change. (i.e. the plan contains clear details).
The change plan includes success measures for all steps/milestones during the execution.
The change adequately minimizes risk within the environment/service.
The performance implications of executing the change are well-understood and documented.
The specified metrics/monitoring dashboards provide sufficient visibility for the change. - If not, is it possible (or necessary) to make changes to observability platforms for added visibility?
The change has a primary and secondary SRE with knowledge of the details available during the change window.

Change Technician checklist

Edited Mar 15, 2022 by Ahmad Sherif