Cleanup terraform around altssh.gitlab.com

Production Change

Change Summary

Currently we are running altssh.gitlab.com via a port remapping in Cloudflare. While terraform does represent this accurately, it is currently convoluded with indirections and module-level overrides that have been left in since the migration.
This change will remove those indirections, to make the terraform code easily comprehendable, as well as removing unused altssh HAProxy nodes. This is one change, as those are interweaved and safer to apply as one.

Change Details

Services Impacted - ServiceGit ServiceCloudflare
Change Technician - @T4cC0re
Change Criticality - C1
Change Type - changescheduled
Change Reviewer - @cmcfarland
Due Date - 2021-06-02 1700 UTC
Time tracking - 80 minutes (+ 60 minutes rollback)
Downtime Component - No downtime expected. Decommissioned nodes are not in traffic and reconfiguration only affects terraform state. Potential to cause long-lasting, full outage of GitLab.com if steps are not carefully followed.

There is an emergency comms channel in Slack: #production-2169

Detailed steps for the change

Expectations and assumptions of the change (sourced from the MR)

We should have a PCL in effect while rolling this out
- Prevent accidental deploys and TF apply's
- !! A tf apply of this without state manipulations WILL BRING DOWN GITLAB.COM !!
- With the state manipulation and assumptions below in effect, this is safe to apply and does not re-configure active traffic paths.
Backup the terraform state locally. Delete after the change, don't upload, as it contains credentials.
When the resources are moved, this implicitly proves, the resources of this MR are not in use if the output matches the plan in the description.
Before applying the changes, everything can be undone, by reversing the order of arguments to tf state mv.
If there are more deletions or additional creations on the plan from 3, this should be investigated.

Pre-Change Steps - steps to be completed before execution of the change

Estimated Time to Complete (mins) - 15 minutes

Merge request: https://ops.gitlab.net/gitlab-com/gitlab-com-infrastructure/-/merge_requests/2286

Make sure the MR is on commit affad96057ea1c9e5eeedc52113441b3851ec5eb. This commit was reviewed in length and matched expectations set in https://ops.gitlab.net/gitlab-com/gitlab-com-infrastructure/-/merge_requests/2286#note_99244
Have the MR be approved by the reviewer. (intentionally not done prior to the change to prevent accidental merges)
Ensure a hard production change lock is in place before starting. This is a delicate change and must be executed mutually exclusive.

Change Steps - steps to take to execute the change

Estimated Time to Complete (mins) - 60 minutes

        Pipes: [                                                    0/5000]
  Connections: [                                                   0/20000]

At this point we have moved the state. This can still be easily rolled back.

Run a targeted tf plan: tf plan -target=module.gcp-tcp-lb-altssh-spectrum -target=module.gprd-dns-record -target=module.gcp-tcp-lb-altssh -target=module.gcp-tcp-lb-spectrum -target=module.gcp-tcp-lb -target=module.fe-lb-altssh -out prod-2169.tfplan
- The output should match this comment
- output matches
- In particular, there are no cloudflare_spectrum_application resources destroyed.

The validated plan is applied via tf apply prod-2169.tfplan.
- If this was executed, a rollback requires rebuilding nodes.

Post-Change Steps - steps to take to verify the change

Estimated Time to Complete (mins) - 5 minutes

Validate ssh still works: ssh -T git@gitlab.com
Validate altssh still works: ssh -T -p443 git@altssh.gitlab.com
Validate https://gitlab.com still opens in a browser.
Delete state backup rm /tmp/backup.tfstate

Rollback

Rollback steps - steps to be taken in the event of a need to rollback this change

Estimated Time to Complete (mins) - 60 minutes

Revert Merge request.
tf state mv 'module.gcp-tcp-lb-spectrum.module.dns_record_external.module.a.cloudflare_spectrum_application.a_aaaa["gitlab.com_tcp/22"]' 'module.gcp-tcp-lb.module.dns_record_external.module.a.cloudflare_spectrum_application.a_aaaa["gitlab.com_tcp/22"]'
tf state mv 'module.gcp-tcp-lb-spectrum.module.dns_record_external.module.a.cloudflare_spectrum_application.a_aaaa["gitlab.com_tcp/80"]' 'module.gcp-tcp-lb.module.dns_record_external.module.a.cloudflare_spectrum_application.a_aaaa["gitlab.com_tcp/80"]'
tf state mv 'module.gcp-tcp-lb-spectrum.module.dns_record_external.module.a.cloudflare_spectrum_application.a_aaaa["gitlab.com_tcp/443"]' 'module.gcp-tcp-lb.module.dns_record_external.module.a.cloudflare_spectrum_application.a_aaaa["gitlab.com_tcp/443"]'
tf state mv 'module.gprd-dns-record.module.a.cloudflare_spectrum_application.a_aaaa["altssh.gitlab.com_tcp/443"]' 'module.gcp-tcp-lb-altssh.module.dns_record_external.module.a.cloudflare_spectrum_application.a_aaaa["altssh.gitlab.com_tcp/443"]'
tf plan -target=module.gcp-tcp-lb-altssh-spectrum -target=module.gprd-dns-record -target=module.gcp-tcp-lb-altssh -target=module.gcp-tcp-lb-spectrum -target=module.gcp-tcp-lb -target=module.fe-lb-altssh -out prod-2169-revert.tfplan
tf apply prod-2169-revert.tfplan
Observe recreation of nodes.

Monitoring

Key metrics to observe

Metric: HAProxy Frontend Responses
- Location: https://dashboards.gitlab.net/d/RZmbBr7mk/gitlab-triage?viewPanel=1247&orgId=1&refresh=30s
- What changes to this metric should prompt a rollback: A sudden - sustained - drop should be investigated.
  - There are no dedicated altssh metrics anymore, as altssh already feeds into the regular ssh traffic.
  - If there are alerts, those most likely relate to metrics collected by the HAProxy nodes.

Summary of infrastructure changes

Does this change introduce new compute instances?
Does this change re-size any existing compute instances?
Does this change introduce any additional usage of tooling like Elastic Search, CDNs, Cloudflare, etc?

Changes:

delete module.fe-lb-altssh, module.gcp-tcp-lb-altssh and module.gcp-tcp-lb-altssh-spectrum
move definition of the altssh.gitlab.com spectrum app to module.gprd-dns-record
move definition of the gitlab.com spectrum app to module.gcp-tcp-lb-spectrum
remove variables relating to altssh (except DNS related)

Impact:

Remove fe-lb-altssh nodes and all related infrastructure (disk, firewall, health-check, instance groups, subnet, GCP TCP loadbalancers gcp-tcp-lb-altssh and gcp-tcp-lb-altssh-spectrum)
Remove already defunct gcp-tcp-lb and all related infrastructure **except nodes(( (firewall, health-check, target pool). The nodes are already assigned to gcp-tcp-lb-spectrum.
Keep the altssh.gitlab.com spectrum app defined (pointing to gcp-tcp-lb-spectrum), and keep having traffic re-routed from port 443 to port 22 in Cloudflare Spectrum.
Keep the gitlab.com spectrum app defined (managed by gcp-tcp-lb-spectrum), and keep having traffic routed through Cloudflare Spectrum.

Changes checklist

This issue has a criticality label (e.g. C1, C2, C3, C4) and a change-type label (e.g. changeunscheduled, changescheduled) based on the Change Management Criticalities.
This issue has the change technician as the assignee.
Pre-Change, Change, Post-Change, and Rollback steps and have been filled out and reviewed.
Necessary approvals have been completed based on the Change Management Workflow.
Change has been tested in staging and results noted in a comment on this issue. => https://ops.gitlab.net/gitlab-com/gitlab-com-infrastructure/-/merge_requests/2281
A dry-run has been conducted and results noted in a comment on this issue. => Not applicable
SRE on-call has been informed prior to change being rolled out. (In #production channel, mention @sre-oncall and this issue and await their acknowledgement.)
There are currently no active incidents.

Edited Jun 02, 2021 by Hendrik Meyer (xLabber)