Possible race condition when checking application install completed
Summary
When installing a cluster application, the installation eventually errors with:
Kubernetes error: 404
Steps to reproduce
(as this is a race condition, it is not possible currently to reproduce always)
- Project > Kubernetes > Create new cluster
- Install Helm
- Install another application
Example Project
What is the current bug behavior?
Application is installed; install-<application> pod is deleted. Another instance of worker comes along and fails on 404 because the install-<application> is now gone.
What is the expected correct behavior?
Application stays installed
Relevant logs and/or screenshots
kubernetes.log (local)
{"severity":"ERROR","time":"2019-01-31T09:03:01.203Z","correlation_id":"ZgmvredQ5d8","exception":"Kubeclient::ResourceNotFoundError","error_code":404,"service":"Clusters::Applications::CheckInstallationProgressService","app_id":65,"project_ids":[54],"group_ids":[],"message":"pods \"install-runner\" not found"}
{"severity":"ERROR","time":"2019-01-31T09:03:01.203Z","correlation_id":"ZgmvredQ5d8","exception":"Kubeclient::ResourceNotFoundError","error_code":404,"service":"Clusters::Applications::CheckInstallationProgressService","app_id":65,"project_ids":[54],"group_ids":[],"message":"pods \"install-runner\" not found"}
development.log (local). Note the runner application first updated to 3 (installed), then was updated to 6 (update_errored)
SQL (0.3ms) UPDATE "clusters_applications_runners" SET "status" = $1, "updated_at" = $2 WHERE "clusters_applications_runners"."id" = $3 [["status", 3], ["updated_at", "2019-01-31 09:03:00.186734"], ["id", 65]]
↳ app/services/clusters/applications/check_installation_progress_service.rb:26
(5.7ms) COMMIT
↳ app/services/clusters/applications/check_installation_progress_service.rb:26
A copy of Gitlab::Metrics::Transaction has been removed from the module tree but is still active! excluded from capture: DSN not set
(1.7ms) SELECT "projects".id FROM "projects" INNER JOIN "cluster_projects" ON "projects"."id" = "cluster_projects"."project_id" WHERE "cluster_projects"."cluster_id" = $1 [["cluster_id", 150]]
↳ app/services/clusters/applications/base_helm_service.rb:20
(3.0ms) SELECT "projects".id FROM "projects" INNER JOIN "cluster_projects" ON "projects"."id" = "cluster_projects"."project_id" WHERE "cluster_projects"."cluster_id" = $1 [["cluster_id", 150]]
↳ app/services/clusters/applications/base_helm_service.rb:20
(1.1ms) SELECT "namespaces".id FROM "namespaces" INNER JOIN "cluster_groups" ON "namespaces"."id" = "cluster_groups"."group_id" WHERE "namespaces"."type" IN ('Group') AND "cluster_groups"."cluster_id" = $1 [["cluster_id", 150]]
↳ app/services/clusters/applications/base_helm_service.rb:21
(1.3ms) SELECT "namespaces".id FROM "namespaces" INNER JOIN "cluster_groups" ON "namespaces"."id" = "cluster_groups"."group_id" WHERE "namespaces"."type" IN ('Group') AND "cluster_groups"."cluster_id" = $1 [["cluster_id", 150]]
↳ app/services/clusters/applications/base_helm_service.rb:21
(1.4ms) BEGIN
↳ app/services/clusters/applications/check_installation_progress_service.rb:50
(0.2ms) BEGIN
↳ app/services/clusters/applications/check_installation_progress_service.rb:50
SQL (1.0ms) UPDATE "clusters_applications_runners" SET "status" = $1, "status_reason" = $2, "updated_at" = $3 WHERE "clusters_applications_runners"."id" = $4 [["status", 6], ["status_reason", "Kubernetes error: 404"], ["updated_at", "2019-01-31 09:03:01.208956"], ["id", 65]]
↳ app/services/clusters/applications/check_installation_progress_service.rb:50
(0.4ms) COMMIT
↳ app/services/clusters/applications/check_installation_progress_service.rb:50
A copy of Gitlab::Metrics::Transaction has been removed from the module tree but is still active! excluded from capture: DSN not set
SQL (2.3ms) UPDATE "clusters_applications_runners" SET "status" = $1, "status_reason" = $2, "updated_at" = $3 WHERE "clusters_applications_runners"."id" = $4 [["status", 6], ["status_reason", "Kubernetes error: 404"], ["updated_at", "2019-01-31 09:03:01.210109"], ["id", 65]]
↳ app/services/clusters/applications/check_installation_progress_service.rb:50
(0.4ms) COMMIT
↳ app/services/clusters/applications/check_installation_progress_service.rb:50
- https://sentry.gitlab.net/gitlab/gitlabcom/issues/648968/
- https://sentry.gitlab.net/gitlab/gitlabcom/issues/616804/
- https://sentry.gitlab.net/gitlab/gitlabcom/issues/590672/
The Sentry errors I looked at are all of the form below (install-prometheus, install-runner, install-helm etc) :
Kubeclient::ResourceNotFoundError: pods "install-prometheus" not found
And they all are on the 3rd retry.
queue: gcp_cluster:cluster_wait_for_app_installation,
queue_namespace: gcp_cluster,
retry: 3
Output of checks
(If you are reporting a bug on GitLab.com, write: This bug happens on GitLab.com)
Results of GitLab environment info
Expand for output related to GitLab environment info
(For installations with omnibus-gitlab package run and paste the output of:
sudo gitlab-rake gitlab:env:info)(For installations from source run and paste the output of:
sudo -u git -H bundle exec rake gitlab:env:info RAILS_ENV=production)
Results of GitLab application Check
Expand for output related to the GitLab application check
(For installations with omnibus-gitlab package run and paste the output of:
sudo gitlab-rake gitlab:check SANITIZE=true)(For installations from source run and paste the output of:
sudo -u git -H bundle exec rake gitlab:check RAILS_ENV=production SANITIZE=true)(we will only investigate if the tests are passing)
Possible fixes
(If you can, link to the line of code that might be responsible for the problem)