Cluster provision flow issues - AWS

Summary

There are a couple of issues that can occur when using GitLab to provision an AWS EKS cluster.

The first is that the Clusters::Aws::FinalizeCreationService can permanently fail to run if the Clusters::Aws::VerifyProvisionStatusService takes longer than the temporary AWS STS token may be active for, or runs into a temporary error over the CloudFormation API calls.

The normal approach here may be to delete the cluster and retry, but if every run exceeds the time, the cluster likely never finalizes, and the state is left as "creating" permanently, even though a user can view their AWS dashboards and see that the stack has eventually created itself (this causes confusion to the users, as they expect GitLab to automatically realize this).

The second issue is that a user may begin to use the EKS cluster having observed it live, and may elect to edit its aws-auth configmap to add more users, etc. However, when we finally get around to trying a Clusters::Aws::FinalizeCreationService it expects that no configmap aws-auth pre-exists, and fails with an error if it does:

Failed to run Kubeclient: configmaps "aws-auth" already exists

This second problem is rare, and likely an outcome of the first (that causes a user to start using the provisioned EKS directly)

During these issues, all Kubernetes monitoring is broken because no Clusters::Platforms::Kubernetes models exist (no kubeclient) and the logs are filled with the error:

    "exception.class": "NoMethodError",
    "exception.message": "undefined method `kubeclient' for nil:NilClass",
    "exception.backtrace": [
        "app/models/clusters/cluster.rb:310:in `kubeclient'",
        "lib/gitlab/metrics/instrumentation.rb:162:in `kubeclient'",
        "lib/gitlab/kubernetes/node.rb:33:in `block in nodes_metrics_from_cluster'",
        "lib/gitlab/kubernetes/kube_client.rb:112:in `graceful_request'",
        "lib/gitlab/kubernetes/node.rb:51:in `graceful_request'",
        "lib/gitlab/kubernetes/node.rb:33:in `nodes_metrics_from_cluster'",
        "lib/gitlab/kubernetes/node.rb:14:in `all'",
        "app/models/clusters/cluster.rb:258:in `calculate_reactive_cache'",
        "lib/gitlab/metrics/instrumentation.rb:162:in `calculate_reactive_cache'",
        "app/models/concerns/reactive_caching.rb:94:in `block (2 levels) in exclusively_update_reactive_cache!'",

Steps to reproduce

Reproducing this isn't straight forward, as it appears to depend on how quick CloudFormation works or how often its API calls may receive an error due to timeout/etc., but it can be simulated by injecting an error in Clusters::Aws::VerifyProvisionStatusService when the stack statuses are checked in the interval loop.

Example Project

What is the current bug behavior?

Cluster fails to complete provisioning, even if the backend has completed its delayed provision.

What is the expected correct behavior?

Cluster creation keeps trying to recheck provisioning status without a timeout failure, until the cluster is explicitly deleted. Or provide a way to manually invoke a recheck even after the timeout has exceeded.

Relevant logs and/or screenshots

Customer ticket at https://gitlab.zendesk.com/agent/tickets/201157 (internal) has some additional screenshots and logs for reference.

Output of checks

Results of GitLab environment info

Expand for output related to GitLab environment info


(For installations with omnibus-gitlab package run and paste the output of:
`sudo gitlab-rake gitlab:env:info`)

(For installations from source run and paste the output of:
`sudo -u git -H bundle exec rake gitlab:env:info RAILS_ENV=production`)

Results of GitLab application Check

Expand for output related to the GitLab application check

(For installations with omnibus-gitlab package run and paste the output of:
sudo gitlab-rake gitlab:check SANITIZE=true)
(For installations from source run and paste the output of:
sudo -u git -H bundle exec rake gitlab:check RAILS_ENV=production SANITIZE=true)
(we will only investigate if the tests are passing)