Skip to content

Refactor AWS EKS Node Pools with Launch Templates and IRSA enabled

Grant Young requested to merge gy-eks-rebuild-irsa-launch-templates into main

What does this MR do?

MR implements a top to bottom refactor of AWS EKS Node Pool design as follows:

  • Application credentials to AWS services are switched from Node Instance Metadata to IAM Roles for Service Accounts (IRSA) to follow best practices
  • Node Instance Metadata service has been switched to enforce IMDSv2 as recommended
  • EKS Node Pools are now always based on a launch template to bring in much needed flexibility, including the above IMDSv2 switch.
  • Refresh of IAM policies to follow latest recommendations
  • Cluster autoscaler permissions updated to allow it to scale to 0 (on EKS 1.24+)
  • Several other pieces of long standing housekeeping

Notes about IRSA

As noted elsewhere, Terraform is heavily limited when it comes to dealing with Kubernetes level config. As a result we need to handle it in Ansible. Due to this several "assumptions" are required between Terraform and AWS as follows:

  • IAM Role names and ARNs (Created in Terraform, assumed by AWS by generating ARN)
  • Kubernetes namespaces and service account names

To stick with YAGNI this requires some locking down in this area in that users won't be given options to change the application IAM roles and Kubernetes service account names. As hooks were already added in Ansible for Kubernetes namespace the same have been added in Terraform though as a reasonable hook. This is important as opening up all of the potential options here will likely lead to death by a thousand cuts as everything needs to be aligned exactly (and this is hard to debug in AWS EKS). At this time it's assumed this shouldn't affect most if any users but nonetheless this isn't ideal. However, IRSA is very much the best practice here in terms of Security so on balance this is seen as the best way forward.

Notes about IMDSv2

IMDSv2 is enabled here as all of the touch points between GitLab and AWS are now handled in IRSA, the pods do not need access to IMDS as a result so it is now locked accordingly - A hop limit of 1 literally blocks pods from receiving any responses.

Testing confirms this works as intended - S3 access for Webservice / Sidekiq pods (normal), Registry (container registry) and Toolbox (backups) all work as expected. Cluster Autoscaler was also found to work.

In the same vein for Omnibus nodes IMDSv2 is also enabled as the only touch point there is S3, which is confirmed to work today in smoke testing as Fog handles this (which was updated in 13.7) - However testing for Omnibus continues.

Notes about Upgrades

We've employed every lever here to try and make the upgrade process as seemless as possible but we're beholden to AWS here. EKS Node Pools will be rebuilt as a result due to AWS limitations. Furthermore the switch to IRSA won't complete until Ansible has been run to update the Charts.

As a result this change will require downtime. However, this MR is intended to be future proof and should hopefully be the long term solution moving forward to not require this again.

Related issues

Closes #366 (closed) #420 (closed) #699 (closed)

Author's checklist

When ready for review, the Author applies the workflowready for review label and mention @gl-quality/get-maintainers:

  • Merge request:
    • Corresponding Issue raised and reviewed by the GET maintainers team.
    • Merge Request Title and Description are up-to-date, accurate, and descriptive
    • MR targeting the appropriate branch
    • MR has a green pipeline
    • MR has no new security alerts in the widget from the Secret Detection and IaC Scan (SAST) jobs.
  • Code:
    • Check the area changed works as expected. Consider testing it in different environment sizes (1k,3k,10k,etc.).
    • Documentation created/updated in the same MR.
    • If this MR adds an optional configuration - check that all permutations continue to work.
    • For Terraform changes: set up a previous version environment, then run a terraform plan with your new changes and ensure nothing will be destroyed. If anything will be destroyed and this can't be avoided please add a comment to the current MR.
  • Create any follow-up issue(s) to support the new feature across other supported cloud providers or advanced configurations. Create 1 issue for each provider/configuration. Contact the Quality Enablement team if unsure.
Edited by Grant Young

Merge request reports