Implement missing best practices on EKS 1.29 and document it

What is this about

In !3742 (merged) we introduced connection to the EKS 1.29. But we still need to improve our cluster infrastructure with the following, as recommended by @deriamis :

  • [Priority 1]: Set up node rebalancing. Just having autoscaler only configures the nodes to scale up, but not to scale down.
    • Just need to add --balance-similar-node-groups to the autoscaler initialization command.
  • We should migrate to role session to setup our login into eks cluster instead of using the usernames in the kube-system/aws-auth configmap.
  • Set up prometheus scrapers. So that we do k8s metrics collection and put it into cloudwatch.
  • Set up cloudwatch dashboard.
  • Make a decision about alarms in cloudwatch. Lambdas could be useful to notify us on Slack.

We should try to implement it as part of https://gitlab.com/gitlab-org/distribution/infrastructure/eks-cluster wherever possible. If anything can't yet be implemented, we should at least documented so that we follow any manual procedure to add it next time.

Assignee Loading
Time tracking Loading