Race condition in fetching Kubernetes token causes missing `$KUBECONFIG`

Summary

This likely explains many people missing $KUBECONFIG on group clusters in https://gitlab.com/gitlab-org/gitlab-ce/issues/55362 or similar issues.

This method https://gitlab.com/gitlab-org/gitlab-ee/blob/f6a5dda99a6986a266797ce518d516330a2d448b/app/services/clusters/gcp/kubernetes/create_or_update_namespace_service.rb#L13 has a race condition.

Basically the create_project_service_account is triggering an async worker in Kubernetes every time to recreate the the Secret for the service account every time it's called. If the time between create_project_service_account and configure_kubernetes_token is too short the token will be nil and we will persist this record with a missing token. Retrying the worker doesn't even fix the problem because it will again trigger create_project_service_account which seems to clear out the secret and create a new token.

Steps to reproduce

I've not been able to reproduce locally but we reproduced this with a customer and the likely reason we don't see this happen often during development is the latency between GitLab and Kubernetes. You probably need a super low latency connection to notice this race condition otherwise Kubernetes will probably be finished with the token creation before the 2nd request.

Example Project

What is the current bug behavior?

What is the expected correct behavior?

Possible fixes

We may want to add a retry mechanism around just the fetch_service_account_token. If we do something like:

def fetch_service_account_token
  3.times do
    token = Clusters::Gcp::Kubernetes::FetchKubernetesTokenService.new(
      platform.kubeclient,
      kubernetes_namespace.token_name,
      kubernetes_namespace.namespace
    ).execute
    return token if token
    sleep 2
  end
end

I believe it would fix this problem. This may not be the neatest approach though it is illustrative.

Edited Jun 21, 2019 by Dylan Griffith