Cleanup terraform around altssh.gitlab.com
Production Change
Change Summary
Currently we are running altssh.gitlab.com via a port remapping in Cloudflare. While terraform does represent this accurately, it is currently convoluded with indirections and module-level overrides that have been left in since the migration.
This change will remove those indirections, to make the terraform code easily comprehendable, as well as removing unused altssh HAProxy nodes. This is one change, as those are interweaved and safer to apply as one.
Change Details
- Services Impacted - ServiceGit ServiceCloudflare
- Change Technician - @T4cC0re
- Change Criticality - C1
- Change Type - changescheduled
- Change Reviewer - @cmcfarland
- Due Date - 2021-06-02 1700 UTC
- Time tracking - 80 minutes (+ 60 minutes rollback)
- Downtime Component - No downtime expected. Decommissioned nodes are not in traffic and reconfiguration only affects terraform state. Potential to cause long-lasting, full outage of GitLab.com if steps are not carefully followed.
There is an emergency comms channel in Slack: #production-2169
Detailed steps for the change
Expectations and assumptions of the change (sourced from the MR)
- We should have a PCL in effect while rolling this out
- Prevent accidental deploys and TF apply's
- !! A tf apply of this without state manipulations WILL BRING DOWN GITLAB.COM !!
- With the state manipulation and assumptions below in effect, this is safe to apply and does not re-configure active traffic paths.
- Backup the terraform state locally. Delete after the change, don't upload, as it contains credentials.
- When the resources are moved, this implicitly proves, the resources of this MR are not in use if the output matches the plan in the description.
- Before applying the changes, everything can be undone, by reversing the order of arguments to tf state mv.
- If there are more deletions or additional creations on the plan from 3, this should be investigated.
Pre-Change Steps - steps to be completed before execution of the change
Estimated Time to Complete (mins) - 15 minutes
Merge request: https://ops.gitlab.net/gitlab-com/gitlab-com-infrastructure/-/merge_requests/2286
-
Make sure the MR is on commit affad96057ea1c9e5eeedc52113441b3851ec5eb. This commit was reviewed in length and matched expectations set in https://ops.gitlab.net/gitlab-com/gitlab-com-infrastructure/-/merge_requests/2286#note_99244 -
Have the MR be approved by the reviewer. (intentionally not done prior to the change to prevent accidental merges) -
Ensure a hard production change lock is in place before starting. This is a delicate change and must be executed mutually exclusive.
Change Steps - steps to take to execute the change
Estimated Time to Complete (mins) - 60 minutes
Pipes: [ 0/5000]
Connections: [ 0/20000]
-
Confirm there is no ongoing deployment or other infrastructure related work happening -
Validate altssh works: ssh -T -p443 git@altssh.gitlab.com -
Validate that HAProxy processes have no client connections or each machine as shown above: -
ssh -tt fe-altssh-01-lb-gprd.c.gitlab-production.internal sudo hatop -s /var/run/haproxy/admin.sock -
ssh -tt fe-altssh-02-lb-gprd.c.gitlab-production.internal sudo hatop -s /var/run/haproxy/admin.sock -
ssh -tt fe-altssh-03-lb-gprd.c.gitlab-production.internal sudo hatop -s /var/run/haproxy/admin.sock
- This will prove, that the nodes in question are not receiving any traffic and that the currently configured port-mapping in Cloudflare is in effect.
- If this shows a connection count
> 0abort
-
-
Merge the MR into master.- !From this point on an untargeted
tf applyWILL BRING DOWN GITLAB.COM!
- !From this point on an untargeted
-
Update local checkout of the repository and enter the environments/gprddirectory. -
Update all terraform modules via tf init -upgrade. -
Run tf refreshto ensure all terraform resources are up to date. -
Backup the terraform state locally ( tf state pull > /tmp/backup.tfstate). !Don't upload, as it contains credentials! -
Move the altssh.gitlab.comspectrum app state to the new moduletf state mv 'module.gcp-tcp-lb-altssh.module.dns_record_external.module.a.cloudflare_spectrum_application.a_aaaa["altssh.gitlab.com_tcp/443"]' 'module.gprd-dns-record.module.a.cloudflare_spectrum_application.a_aaaa["altssh.gitlab.com_tcp/443"]' -
Move the gitlab.comspectrum app states to the proper loadbalancer module-
tf state mv 'module.gcp-tcp-lb.module.dns_record_external.module.a.cloudflare_spectrum_application.a_aaaa["gitlab.com_tcp/22"]' 'module.gcp-tcp-lb-spectrum.module.dns_record_external.module.a.cloudflare_spectrum_application.a_aaaa["gitlab.com_tcp/22"]' -
tf state mv 'module.gcp-tcp-lb.module.dns_record_external.module.a.cloudflare_spectrum_application.a_aaaa["gitlab.com_tcp/80"]' 'module.gcp-tcp-lb-spectrum.module.dns_record_external.module.a.cloudflare_spectrum_application.a_aaaa["gitlab.com_tcp/80"]' -
tf state mv 'module.gcp-tcp-lb.module.dns_record_external.module.a.cloudflare_spectrum_application.a_aaaa["gitlab.com_tcp/443"]' 'module.gcp-tcp-lb-spectrum.module.dns_record_external.module.a.cloudflare_spectrum_application.a_aaaa["gitlab.com_tcp/443"]'
-
- At this point we have moved the state. This can still be easily rolled back.
-
Run a targeted tf plan: tf plan -target=module.gcp-tcp-lb-altssh-spectrum -target=module.gprd-dns-record -target=module.gcp-tcp-lb-altssh -target=module.gcp-tcp-lb-spectrum -target=module.gcp-tcp-lb -target=module.fe-lb-altssh -out prod-2169.tfplan- The output should match this comment
-
output matches -
In particular, there are no cloudflare_spectrum_applicationresources destroyed.
-
The validated plan is applied via tf apply prod-2169.tfplan.- If this was executed, a rollback requires rebuilding nodes.
Post-Change Steps - steps to take to verify the change
Estimated Time to Complete (mins) - 5 minutes
-
Validate ssh still works: ssh -T git@gitlab.com -
Validate altssh still works: ssh -T -p443 git@altssh.gitlab.com -
Validate https://gitlab.com still opens in a browser. -
Delete state backup rm /tmp/backup.tfstate
Rollback
Rollback steps - steps to be taken in the event of a need to rollback this change
Estimated Time to Complete (mins) - 60 minutes
-
Revert Merge request. -
tf state mv 'module.gcp-tcp-lb-spectrum.module.dns_record_external.module.a.cloudflare_spectrum_application.a_aaaa["gitlab.com_tcp/22"]' 'module.gcp-tcp-lb.module.dns_record_external.module.a.cloudflare_spectrum_application.a_aaaa["gitlab.com_tcp/22"]' -
tf state mv 'module.gcp-tcp-lb-spectrum.module.dns_record_external.module.a.cloudflare_spectrum_application.a_aaaa["gitlab.com_tcp/80"]' 'module.gcp-tcp-lb.module.dns_record_external.module.a.cloudflare_spectrum_application.a_aaaa["gitlab.com_tcp/80"]' -
tf state mv 'module.gcp-tcp-lb-spectrum.module.dns_record_external.module.a.cloudflare_spectrum_application.a_aaaa["gitlab.com_tcp/443"]' 'module.gcp-tcp-lb.module.dns_record_external.module.a.cloudflare_spectrum_application.a_aaaa["gitlab.com_tcp/443"]' -
tf state mv 'module.gprd-dns-record.module.a.cloudflare_spectrum_application.a_aaaa["altssh.gitlab.com_tcp/443"]' 'module.gcp-tcp-lb-altssh.module.dns_record_external.module.a.cloudflare_spectrum_application.a_aaaa["altssh.gitlab.com_tcp/443"]' -
tf plan -target=module.gcp-tcp-lb-altssh-spectrum -target=module.gprd-dns-record -target=module.gcp-tcp-lb-altssh -target=module.gcp-tcp-lb-spectrum -target=module.gcp-tcp-lb -target=module.fe-lb-altssh -out prod-2169-revert.tfplan -
tf apply prod-2169-revert.tfplan -
Observe recreation of nodes.
Monitoring
Key metrics to observe
- Metric: HAProxy Frontend Responses
- Location: https://dashboards.gitlab.net/d/RZmbBr7mk/gitlab-triage?viewPanel=1247&orgId=1&refresh=30s
- What changes to this metric should prompt a rollback: A sudden - sustained - drop should be investigated.
- There are no dedicated
altsshmetrics anymore, asaltsshalready feeds into the regularsshtraffic. - If there are alerts, those most likely relate to metrics collected by the HAProxy nodes.
- There are no dedicated
Summary of infrastructure changes
-
Does this change introduce new compute instances? -
Does this change re-size any existing compute instances? -
Does this change introduce any additional usage of tooling like Elastic Search, CDNs, Cloudflare, etc?
Changes:
- delete
module.fe-lb-altssh,module.gcp-tcp-lb-altsshandmodule.gcp-tcp-lb-altssh-spectrum - move definition of the
altssh.gitlab.comspectrum app tomodule.gprd-dns-record - move definition of the
gitlab.comspectrum app tomodule.gcp-tcp-lb-spectrum - remove variables relating to
altssh(except DNS related)
Impact:
- Remove
fe-lb-altsshnodes and all related infrastructure (disk, firewall, health-check, instance groups, subnet, GCP TCP loadbalancersgcp-tcp-lb-altsshandgcp-tcp-lb-altssh-spectrum) - Remove already defunct
gcp-tcp-lband all related infrastructure **except nodes(( (firewall, health-check, target pool). The nodes are already assigned togcp-tcp-lb-spectrum. - Keep the
altssh.gitlab.comspectrum app defined (pointing togcp-tcp-lb-spectrum), and keep having traffic re-routed from port 443 to port 22 in Cloudflare Spectrum. - Keep the
gitlab.comspectrum app defined (managed bygcp-tcp-lb-spectrum), and keep having traffic routed through Cloudflare Spectrum.
Changes checklist
-
This issue has a criticality label (e.g. C1, C2, C3, C4) and a change-type label (e.g. changeunscheduled, changescheduled) based on the Change Management Criticalities. -
This issue has the change technician as the assignee. -
Pre-Change, Change, Post-Change, and Rollback steps and have been filled out and reviewed. -
Necessary approvals have been completed based on the Change Management Workflow. -
Change has been tested in staging and results noted in a comment on this issue. => https://ops.gitlab.net/gitlab-com/gitlab-com-infrastructure/-/merge_requests/2281 -
A dry-run has been conducted and results noted in a comment on this issue. => Not applicable -
SRE on-call has been informed prior to change being rolled out. (In #production channel, mention @sre-oncalland this issue and await their acknowledgement.) -
There are currently no active incidents.