Skip to content

fix: ensure provision works every time

Andrew Newdigate requested to merge fix-aws-hybrid-creation-timeouts into main

References https://gitlab.com/gitlab-com/gl-infra/gitlab-dedicated/team/-/issues/292

What does this MR do?

This is a workaround to the timeout that we experience when initially provisioning a AWS Hybrid EKS cluster.

│ Error: error waiting for EKS Node Group (itestfixconfigure:gitlab_webservice_pool) to create: unexpected state 'CREATE_FAILED', wanted target 'ACTIVE'. last error: 1 error occurred:
│ 	* i-0f6b83ff481f5a93e, i-04a040b3b9d75afa9, i-08e6e3dbf18d36b53: NodeCreationFailure: Unhealthy nodes in the kubernetes cluster



│   with module.arch.aws_eks_node_group.gitlab_webservice_pool[0],
│   on ../get/terraform/modules/gitlab_ref_arch_aws/kubernetes.tf line 35, in resource "aws_eks_node_group" "gitlab_webservice_pool":
│   35: resource "aws_eks_node_group" "gitlab_webservice_pool" {



│ Error: error waiting for EKS Node Group (itestfixconfigure:gitlab_sidekiq_pool) to create: unexpected state 'CREATE_FAILED', wanted target 'ACTIVE'. last error: 1 error occurred:
│ 	* i-05fc9691725c66464, i-024c57c030b1013f3, i-0d240a07825393854: NodeCreationFailure: Unhealthy nodes in the kubernetes cluster



│   with module.arch.aws_eks_node_group.gitlab_sidekiq_pool[0],
│   on ../get/terraform/modules/gitlab_ref_arch_aws/kubernetes.tf line 62, in resource "aws_eks_node_group" "gitlab_sidekiq_pool":
│   62: resource "aws_eks_node_group" "gitlab_sidekiq_pool" {



│ Error: error waiting for EKS Node Group (itestfixconfigure:gitlab_supporting_pool) to create: unexpected state 'CREATE_FAILED', wanted target 'ACTIVE'. last error: 1 error occurred:
│ 	* i-08defcea5d9721f19, i-03c1cbdb471995164, i-0e8b6740b6479f505: NodeCreationFailure: Unhealthy nodes in the kubernetes cluster



│   with module.arch.aws_eks_node_group.gitlab_supporting_pool[0],
│   on ../get/terraform/modules/gitlab_ref_arch_aws/kubernetes.tf line 89, in resource "aws_eks_node_group" "gitlab_supporting_pool":
│   89: resource "aws_eks_node_group" "gitlab_supporting_pool" {



│ Error: unexpected EKS Add-On (itestfixconfigure:coredns) state returned during creation: timeout while waiting for state to become 'ACTIVE' (last state: 'DEGRADED', timeout: 20m0s)
│ [WARNING] Running terraform apply again will remove the kubernetes add-on and attempt to create it again effectively purging previous add-on configuration

│   with module.arch.aws_eks_addon.coredns[0],
│   on ../get/terraform/modules/gitlab_ref_arch_aws/kubernetes.tf line 189, in resource "aws_eks_addon" "coredns":
│  189: resource "aws_eks_addon" "coredns" {



│ Error: unexpected EKS Add-On (itestfixconfigure:vpc-cni) state returned during creation: unexpected state 'CREATE_FAILED', wanted target 'ACTIVE'. last error: 1 error occurred:
│ 	* : ConfigurationConflict: Apply failed with 7 conflicts: conflicts with "kubectl-client-side-apply" using apps/v1:
│ - .spec.template.spec.containers[name="aws-node"].resources
│ - .spec.template.spec.containers[name="aws-node"].resources.requests
│ - .spec.template.spec.containers[name="aws-node"].resources.requests.cpu
│ - .spec.template.spec.containers[name="aws-node"].resources.requests
│ - .spec.template.spec.containers[name="aws-node"].resources.requests.cpu
│ - .spec.template.spec.containers[name="aws-node"].resources.requests.cpu
│ - .spec.template.spec.initContainers[name="aws-vpc-cni-init"].image


│ [WARNING] Running terraform apply again will remove the kubernetes add-on and attempt to create it again effectively purging previous add-on configuration

│   with module.arch.aws_eks_addon.vpc_cni[0],
│   on ../get/terraform/modules/gitlab_ref_arch_aws/kubernetes.tf line 198, in resource "aws_eks_addon" "vpc_cni":
│  198: resource "aws_eks_addon" "vpc_cni" {

How does it work

There are implied dependencies between different parts of the EKS provisioning process, but these are not modelled in Terraform.

  1. The VPC CNI addon needs to run before the node pools are created
  2. The VPC CNI addon needs the aws_iam_openid_connect_provider to be configured
  3. The coredns addon needs to run after node pools are created

This adds dependencies to Terraform that ensure that everything happens in the right order.

This also substantially speeds up the provisioning process.

Related issues

Author's checklist

When ready for review, the Author applies the workflowready for review label and mention @gl-quality/get-maintainers:

  • Merge request:
    • Corresponding Issue raised and reviewed by the GET maintainers team.
    • Merge Request Title and Description are up to date, accurate, and descriptive
    • MR targeting the appropriate branch
    • MR has a green pipeline
  • Code:
    • Check the area changed works as expected. Consider testing it in different environment sizes (1k,3k,10k,etc.).
    • Documentation created/updated in the same MR.
    • If this MR adds an optional configuration - check that all permutations continue to work.
    • For Terraform changes: setup a previous version environment, then run a terraform plan with your new changes and ensure nothing will be destroyed. If anything will be destroyed and this can't be avoided please add a comment to the current MR.
  • Create any follow-up issue(s) to support the new feature across other supported cloud providers or advanced configurations. Create 1 issue for each provider/configuration. Contact the Quality Enablement team if unsure.
Edited by Andrew Newdigate

Merge request reports

Loading