Possible race condition when checking application install completed

Summary

When installing a cluster application, the installation eventually errors with:

Kubernetes error: 404

Steps to reproduce

(as this is a race condition, it is not possible currently to reproduce always)

  1. Project > Kubernetes > Create new cluster
  2. Install Helm
  3. Install another application

Example Project

What is the current bug behavior?

Application is installed; install-<application> pod is deleted. Another instance of worker comes along and fails on 404 because the install-<application> is now gone.

What is the expected correct behavior?

Application stays installed

Relevant logs and/or screenshots

kubernetes.log (local)

{"severity":"ERROR","time":"2019-01-31T09:03:01.203Z","correlation_id":"ZgmvredQ5d8","exception":"Kubeclient::ResourceNotFoundError","error_code":404,"service":"Clusters::Applications::CheckInstallationProgressService","app_id":65,"project_ids":[54],"group_ids":[],"message":"pods \"install-runner\" not found"}
{"severity":"ERROR","time":"2019-01-31T09:03:01.203Z","correlation_id":"ZgmvredQ5d8","exception":"Kubeclient::ResourceNotFoundError","error_code":404,"service":"Clusters::Applications::CheckInstallationProgressService","app_id":65,"project_ids":[54],"group_ids":[],"message":"pods \"install-runner\" not found"}

development.log (local). Note the runner application first updated to 3 (installed), then was updated to 6 (update_errored)

  SQL (0.3ms)  UPDATE "clusters_applications_runners" SET "status" = $1, "updated_at" = $2 WHERE "clusters_applications_runners"."id" = $3  [["status", 3], ["updated_at", "2019-01-31 09:03:00.186734"], ["id", 65]]
  ↳ app/services/clusters/applications/check_installation_progress_service.rb:26
   (5.7ms)  COMMIT
  ↳ app/services/clusters/applications/check_installation_progress_service.rb:26
A copy of Gitlab::Metrics::Transaction has been removed from the module tree but is still active! excluded from capture: DSN not set
   (1.7ms)  SELECT "projects".id FROM "projects" INNER JOIN "cluster_projects" ON "projects"."id" = "cluster_projects"."project_id" WHERE "cluster_projects"."cluster_id" = $1  [["cluster_id", 150]]
  ↳ app/services/clusters/applications/base_helm_service.rb:20
   (3.0ms)  SELECT "projects".id FROM "projects" INNER JOIN "cluster_projects" ON "projects"."id" = "cluster_projects"."project_id" WHERE "cluster_projects"."cluster_id" = $1  [["cluster_id", 150]]
  ↳ app/services/clusters/applications/base_helm_service.rb:20
   (1.1ms)  SELECT "namespaces".id FROM "namespaces" INNER JOIN "cluster_groups" ON "namespaces"."id" = "cluster_groups"."group_id" WHERE "namespaces"."type" IN ('Group') AND "cluster_groups"."cluster_id" = $1  [["cluster_id", 150]]
  ↳ app/services/clusters/applications/base_helm_service.rb:21
   (1.3ms)  SELECT "namespaces".id FROM "namespaces" INNER JOIN "cluster_groups" ON "namespaces"."id" = "cluster_groups"."group_id" WHERE "namespaces"."type" IN ('Group') AND "cluster_groups"."cluster_id" = $1  [["cluster_id", 150]]
  ↳ app/services/clusters/applications/base_helm_service.rb:21
   (1.4ms)  BEGIN
  ↳ app/services/clusters/applications/check_installation_progress_service.rb:50
   (0.2ms)  BEGIN
  ↳ app/services/clusters/applications/check_installation_progress_service.rb:50
  SQL (1.0ms)  UPDATE "clusters_applications_runners" SET "status" = $1, "status_reason" = $2, "updated_at" = $3 WHERE "clusters_applications_runners"."id" = $4  [["status", 6], ["status_reason", "Kubernetes error: 404"], ["updated_at", "2019-01-31 09:03:01.208956"], ["id", 65]]
  ↳ app/services/clusters/applications/check_installation_progress_service.rb:50
   (0.4ms)  COMMIT
  ↳ app/services/clusters/applications/check_installation_progress_service.rb:50
A copy of Gitlab::Metrics::Transaction has been removed from the module tree but is still active! excluded from capture: DSN not set
  SQL (2.3ms)  UPDATE "clusters_applications_runners" SET "status" = $1, "status_reason" = $2, "updated_at" = $3 WHERE "clusters_applications_runners"."id" = $4  [["status", 6], ["status_reason", "Kubernetes error: 404"], ["updated_at", "2019-01-31 09:03:01.210109"], ["id", 65]]
  ↳ app/services/clusters/applications/check_installation_progress_service.rb:50
   (0.4ms)  COMMIT
  ↳ app/services/clusters/applications/check_installation_progress_service.rb:50

The Sentry errors I looked at are all of the form below (install-prometheus, install-runner, install-helm etc) :

Kubeclient::ResourceNotFoundError: pods "install-prometheus" not found

And they all are on the 3rd retry. 🤔

queue: gcp_cluster:cluster_wait_for_app_installation, 
queue_namespace: gcp_cluster, 
retry: 3

Output of checks

(If you are reporting a bug on GitLab.com, write: This bug happens on GitLab.com)

Results of GitLab environment info

Expand for output related to GitLab environment info

(For installations with omnibus-gitlab package run and paste the output of: sudo gitlab-rake gitlab:env:info)

(For installations from source run and paste the output of: sudo -u git -H bundle exec rake gitlab:env:info RAILS_ENV=production)

Results of GitLab application Check

Expand for output related to the GitLab application check

(For installations with omnibus-gitlab package run and paste the output of: sudo gitlab-rake gitlab:check SANITIZE=true)

(For installations from source run and paste the output of: sudo -u git -H bundle exec rake gitlab:check RAILS_ENV=production SANITIZE=true)

(we will only investigate if the tests are passing)

Possible fixes

(If you can, link to the line of code that might be responsible for the problem)

Edited by Thong Kuah