Race condition in fetching Kubernetes token causes missing `$KUBECONFIG`
Summary
This likely explains many people missing $KUBECONFIG
on group clusters in https://gitlab.com/gitlab-org/gitlab-ce/issues/55362 or similar issues.
This method https://gitlab.com/gitlab-org/gitlab-ee/blob/f6a5dda99a6986a266797ce518d516330a2d448b/app/services/clusters/gcp/kubernetes/create_or_update_namespace_service.rb#L13 has a race condition.
Basically the create_project_service_account
is triggering an async worker in Kubernetes every time to recreate the the Secret
for the service account every time it's called. If the time between create_project_service_account
and configure_kubernetes_token
is too short the token will be nil
and we will persist this record with a missing token. Retrying the worker doesn't even fix the problem because it will again trigger create_project_service_account
which seems to clear out the secret and create a new token.
Steps to reproduce
I've not been able to reproduce locally but we reproduced this with a customer and the likely reason we don't see this happen often during development is the latency between GitLab and Kubernetes. You probably need a super low latency connection to notice this race condition otherwise Kubernetes will probably be finished with the token creation before the 2nd request.
Example Project
What is the current bug behavior?
What is the expected correct behavior?
Possible fixes
We may want to add a retry mechanism around just the fetch_service_account_token
. If we do something like:
def fetch_service_account_token
3.times do
token = Clusters::Gcp::Kubernetes::FetchKubernetesTokenService.new(
platform.kubeclient,
kubernetes_namespace.token_name,
kubernetes_namespace.namespace
).execute
return token if token
sleep 2
end
end
I believe it would fix this problem. This may not be the neatest approach though it is illustrative.