Possible race condition when checking application install completed
### Summary When installing a cluster application, the installation eventually errors with: ``` Kubernetes error: 404 ``` ### Steps to reproduce (as this is a race condition, it is not possible currently to reproduce always) 1. Project > Kubernetes > Create new cluster 1. Install Helm 1. Install another application ### Example Project ### What is the current *bug* behavior? Application is `installed`; `install-<application>` pod is deleted. Another instance of worker comes along and fails on 404 because the `install-<application>` is now gone. ### What is the expected *correct* behavior? Application stays `installed` ### Relevant logs and/or screenshots kubernetes.log (local) ``` {"severity":"ERROR","time":"2019-01-31T09:03:01.203Z","correlation_id":"ZgmvredQ5d8","exception":"Kubeclient::ResourceNotFoundError","error_code":404,"service":"Clusters::Applications::CheckInstallationProgressService","app_id":65,"project_ids":[54],"group_ids":[],"message":"pods \"install-runner\" not found"} {"severity":"ERROR","time":"2019-01-31T09:03:01.203Z","correlation_id":"ZgmvredQ5d8","exception":"Kubeclient::ResourceNotFoundError","error_code":404,"service":"Clusters::Applications::CheckInstallationProgressService","app_id":65,"project_ids":[54],"group_ids":[],"message":"pods \"install-runner\" not found"} ``` development.log (local). Note the runner application first updated to 3 (`installed`), then was updated to 6 (`update_errored`) ``` SQL (0.3ms) UPDATE "clusters_applications_runners" SET "status" = $1, "updated_at" = $2 WHERE "clusters_applications_runners"."id" = $3 [["status", 3], ["updated_at", "2019-01-31 09:03:00.186734"], ["id", 65]] ↳ app/services/clusters/applications/check_installation_progress_service.rb:26 (5.7ms) COMMIT ↳ app/services/clusters/applications/check_installation_progress_service.rb:26 A copy of Gitlab::Metrics::Transaction has been removed from the module tree but is still active! excluded from capture: DSN not set (1.7ms) SELECT "projects".id FROM "projects" INNER JOIN "cluster_projects" ON "projects"."id" = "cluster_projects"."project_id" WHERE "cluster_projects"."cluster_id" = $1 [["cluster_id", 150]] ↳ app/services/clusters/applications/base_helm_service.rb:20 (3.0ms) SELECT "projects".id FROM "projects" INNER JOIN "cluster_projects" ON "projects"."id" = "cluster_projects"."project_id" WHERE "cluster_projects"."cluster_id" = $1 [["cluster_id", 150]] ↳ app/services/clusters/applications/base_helm_service.rb:20 (1.1ms) SELECT "namespaces".id FROM "namespaces" INNER JOIN "cluster_groups" ON "namespaces"."id" = "cluster_groups"."group_id" WHERE "namespaces"."type" IN ('Group') AND "cluster_groups"."cluster_id" = $1 [["cluster_id", 150]] ↳ app/services/clusters/applications/base_helm_service.rb:21 (1.3ms) SELECT "namespaces".id FROM "namespaces" INNER JOIN "cluster_groups" ON "namespaces"."id" = "cluster_groups"."group_id" WHERE "namespaces"."type" IN ('Group') AND "cluster_groups"."cluster_id" = $1 [["cluster_id", 150]] ↳ app/services/clusters/applications/base_helm_service.rb:21 (1.4ms) BEGIN ↳ app/services/clusters/applications/check_installation_progress_service.rb:50 (0.2ms) BEGIN ↳ app/services/clusters/applications/check_installation_progress_service.rb:50 SQL (1.0ms) UPDATE "clusters_applications_runners" SET "status" = $1, "status_reason" = $2, "updated_at" = $3 WHERE "clusters_applications_runners"."id" = $4 [["status", 6], ["status_reason", "Kubernetes error: 404"], ["updated_at", "2019-01-31 09:03:01.208956"], ["id", 65]] ↳ app/services/clusters/applications/check_installation_progress_service.rb:50 (0.4ms) COMMIT ↳ app/services/clusters/applications/check_installation_progress_service.rb:50 A copy of Gitlab::Metrics::Transaction has been removed from the module tree but is still active! excluded from capture: DSN not set SQL (2.3ms) UPDATE "clusters_applications_runners" SET "status" = $1, "status_reason" = $2, "updated_at" = $3 WHERE "clusters_applications_runners"."id" = $4 [["status", 6], ["status_reason", "Kubernetes error: 404"], ["updated_at", "2019-01-31 09:03:01.210109"], ["id", 65]] ↳ app/services/clusters/applications/check_installation_progress_service.rb:50 (0.4ms) COMMIT ↳ app/services/clusters/applications/check_installation_progress_service.rb:50 ``` * https://sentry.gitlab.net/gitlab/gitlabcom/issues/648968/ * https://sentry.gitlab.net/gitlab/gitlabcom/issues/616804/ * https://sentry.gitlab.net/gitlab/gitlabcom/issues/590672/ The Sentry errors I looked at are all of the form below (install-prometheus, install-runner, install-helm etc) : ``` Kubeclient::ResourceNotFoundError: pods "install-prometheus" not found ``` And they all are on the 3rd retry. :thinking: ``` queue: gcp_cluster:cluster_wait_for_app_installation, queue_namespace: gcp_cluster, retry: 3 ``` ### Output of checks (If you are reporting a bug on GitLab.com, write: This bug happens on GitLab.com) #### Results of GitLab environment info <details> <summary>Expand for output related to GitLab environment info</summary> <pre> (For installations with omnibus-gitlab package run and paste the output of: `sudo gitlab-rake gitlab:env:info`) (For installations from source run and paste the output of: `sudo -u git -H bundle exec rake gitlab:env:info RAILS_ENV=production`) </pre> </details> #### Results of GitLab application Check <details> <summary>Expand for output related to the GitLab application check</summary> <pre> (For installations with omnibus-gitlab package run and paste the output of: `sudo gitlab-rake gitlab:check SANITIZE=true`) (For installations from source run and paste the output of: `sudo -u git -H bundle exec rake gitlab:check RAILS_ENV=production SANITIZE=true`) (we will only investigate if the tests are passing) </pre> </details> ### Possible fixes (If you can, link to the line of code that might be responsible for the problem)
issue