Skip to content

feat: Use create_before_destroy=true on EKS nodegroups

Craig Miskell requested to merge create_before_destroy_nodegroups into main

What does this MR do?

Sets the create_before_destroy lifecycle rule to true on EKS node groups

Why

When some changes like increasing disk size are made to a node group that isn't using a launch template (in GET, selected by providing a custom AMI), terraform must destroy and re-create the node group. This causes a complete system outage because the pods that ran on that nodegroup have nowhere to go; in practice, I've seen 10+ minutes (8+ minutes to destroy the node group, 2+ minutes to create).

If create_before_destroy is set to true, a new nodegroup is created first with all the labels to allow it to take on the required workloads; when the existing nodegroup is destroyed, EKS does a proper cordon/drain, and workloads are rescheduled onto the new nodegroup in as clean a fashion as can ever be hoped for. It's also slightly faster (the destroy is quicker, perhaps because it can offload pods to the new nodes rather than have to wait for a while before tearing them down hard).

create_before_destroy does need care in some scenarios, but node groups are a classic example of where it's fine, because each node group has a unique name (timestamp-based suffix) so it's fine to have two or more of "the same", which is exactly what helps us here.

Related issues

Closes #674 (closed)

Author's checklist

When ready for review, the Author applies the workflowready for review label and mention @gl-quality/get-maintainers:

  • Merge request:
    • Corresponding Issue raised and reviewed by the GET maintainers team.
    • Merge Request Title and Description are up-to-date, accurate, and descriptive
    • MR targeting the appropriate branch
    • MR has a green pipeline
    • MR has no new security alerts in the widget from the Secret Detection and IaC Scan (SAST) jobs.
  • Code:
    • Check the area changed works as expected. Consider testing it in different environment sizes (1k,3k,10k,etc.).
    • Documentation created/updated in the same MR.
    • If this MR adds an optional configuration - check that all permutations continue to work.
    • For Terraform changes: set up a previous version environment, then run a terraform plan with your new changes and ensure nothing will be destroyed. If anything will be destroyed and this can't be avoided please add a comment to the current MR.
  • Create any follow-up issue(s) to support the new feature across other supported cloud providers or advanced configurations. Create 1 issue for each provider/configuration. Contact the Quality Enablement team if unsure.
Edited by Craig Miskell

Merge request reports