Possible race condition when checking application install completed

Summary

When installing a cluster application, the installation eventually errors with:

Kubernetes error: 404

Steps to reproduce

(as this is a race condition, it is not possible currently to reproduce always)

Project > Kubernetes > Create new cluster
Install Helm
Install another application

Example Project

What is the current bug behavior?

Application is installed; install-<application> pod is deleted. Another instance of worker comes along and fails on 404 because the install-<application> is now gone.

What is the expected correct behavior?

Application stays installed

Relevant logs and/or screenshots

kubernetes.log (local)

{"severity":"ERROR","time":"2019-01-31T09:03:01.203Z","correlation_id":"ZgmvredQ5d8","exception":"Kubeclient::ResourceNotFoundError","error_code":404,"service":"Clusters::Applications::CheckInstallationProgressService","app_id":65,"project_ids":[54],"group_ids":[],"message":"pods \"install-runner\" not found"}
{"severity":"ERROR","time":"2019-01-31T09:03:01.203Z","correlation_id":"ZgmvredQ5d8","exception":"Kubeclient::ResourceNotFoundError","error_code":404,"service":"Clusters::Applications::CheckInstallationProgressService","app_id":65,"project_ids":[54],"group_ids":[],"message":"pods \"install-runner\" not found"}

development.log (local). Note the runner application first updated to 3 (installed), then was updated to 6 (update_errored)

  SQL (0.3ms)  UPDATE "clusters_applications_runners" SET "status" = $1, "updated_at" = $2 WHERE "clusters_applications_runners"."id" = $3  [["status", 3], ["updated_at", "2019-01-31 09:03:00.186734"], ["id", 65]]
  ↳ app/services/clusters/applications/check_installation_progress_service.rb:26
   (5.7ms)  COMMIT
  ↳ app/services/clusters/applications/check_installation_progress_service.rb:26
A copy of Gitlab::Metrics::Transaction has been removed from the module tree but is still active! excluded from capture: DSN not set
   (1.7ms)  SELECT "projects".id FROM "projects" INNER JOIN "cluster_projects" ON "projects"."id" = "cluster_projects"."project_id" WHERE "cluster_projects"."cluster_id" = $1  [["cluster_id", 150]]
  ↳ app/services/clusters/applications/base_helm_service.rb:20
   (3.0ms)  SELECT "projects".id FROM "projects" INNER JOIN "cluster_projects" ON "projects"."id" = "cluster_projects"."project_id" WHERE "cluster_projects"."cluster_id" = $1  [["cluster_id", 150]]
  ↳ app/services/clusters/applications/base_helm_service.rb:20
   (1.1ms)  SELECT "namespaces".id FROM "namespaces" INNER JOIN "cluster_groups" ON "namespaces"."id" = "cluster_groups"."group_id" WHERE "namespaces"."type" IN ('Group') AND "cluster_groups"."cluster_id" = $1  [["cluster_id", 150]]
  ↳ app/services/clusters/applications/base_helm_service.rb:21
   (1.3ms)  SELECT "namespaces".id FROM "namespaces" INNER JOIN "cluster_groups" ON "namespaces"."id" = "cluster_groups"."group_id" WHERE "namespaces"."type" IN ('Group') AND "cluster_groups"."cluster_id" = $1  [["cluster_id", 150]]
  ↳ app/services/clusters/applications/base_helm_service.rb:21
   (1.4ms)  BEGIN
  ↳ app/services/clusters/applications/check_installation_progress_service.rb:50
   (0.2ms)  BEGIN
  ↳ app/services/clusters/applications/check_installation_progress_service.rb:50
  SQL (1.0ms)  UPDATE "clusters_applications_runners" SET "status" = $1, "status_reason" = $2, "updated_at" = $3 WHERE "clusters_applications_runners"."id" = $4  [["status", 6], ["status_reason", "Kubernetes error: 404"], ["updated_at", "2019-01-31 09:03:01.208956"], ["id", 65]]
  ↳ app/services/clusters/applications/check_installation_progress_service.rb:50
   (0.4ms)  COMMIT
  ↳ app/services/clusters/applications/check_installation_progress_service.rb:50
A copy of Gitlab::Metrics::Transaction has been removed from the module tree but is still active! excluded from capture: DSN not set
  SQL (2.3ms)  UPDATE "clusters_applications_runners" SET "status" = $1, "status_reason" = $2, "updated_at" = $3 WHERE "clusters_applications_runners"."id" = $4  [["status", 6], ["status_reason", "Kubernetes error: 404"], ["updated_at", "2019-01-31 09:03:01.210109"], ["id", 65]]
  ↳ app/services/clusters/applications/check_installation_progress_service.rb:50
   (0.4ms)  COMMIT
  ↳ app/services/clusters/applications/check_installation_progress_service.rb:50

The Sentry errors I looked at are all of the form below (install-prometheus, install-runner, install-helm etc) :

Kubeclient::ResourceNotFoundError: pods "install-prometheus" not found

And they all are on the 3rd retry. 🤔

queue: gcp_cluster:cluster_wait_for_app_installation, 
queue_namespace: gcp_cluster, 
retry: 3

Output of checks

(If you are reporting a bug on GitLab.com, write: This bug happens on GitLab.com)

Results of GitLab environment info

Expand for output related to GitLab environment info

(For installations with omnibus-gitlab package run and paste the output of:
sudo gitlab-rake gitlab:env:info)
(For installations from source run and paste the output of:
sudo -u git -H bundle exec rake gitlab:env:info RAILS_ENV=production)

Results of GitLab application Check

Expand for output related to the GitLab application check

(For installations with omnibus-gitlab package run and paste the output of:
sudo gitlab-rake gitlab:check SANITIZE=true)
(For installations from source run and paste the output of:
sudo -u git -H bundle exec rake gitlab:check RAILS_ENV=production SANITIZE=true)
(we will only investigate if the tests are passing)

Possible fixes

(If you can, link to the line of code that might be responsible for the problem)

Edited Feb 25, 2019 by Thong Kuah