500 error when removing certificate-based Kubernetes clusters
https://sentry.gitlab.net/gitlab/gitlabcom/issues/3190282/?referrer=gitlab_plugin
PG::QueryCanceled: ERROR: canceling statement due to statement timeout
CONTEXT: SQL statement "UPDATE ONLY "public"."deployments" SET "cluster_id" = NULL WHERE $1 OPERATOR(pg_catalog.=) "cluster_id""
lib/gitlab/database/load_balancing/connection_proxy.rb:119:in `block in write_using_load_balancer'
connection.send(...)
lib/gitlab/database/load_balancing/load_balancer.rb:112:in `block in read_write'
yield connection
lib/gitlab/database/load_balancing/load_balancer.rb:179:in `retry_with_backoff'
return yield
lib/gitlab/database/load_balancing/load_balancer.rb:110:in `read_write'
retry_with_backoff do
lib/gitlab/database/load_balancing/connection_proxy.rb:118:in `write_using_load_balancer'
@load_balancer.read_write do |connection|
...
(210 additional frame(s) were not displayed)
ActiveRecord::QueryCanceled: PG::QueryCanceled: ERROR: canceling statement due to statement timeout
CONTEXT: SQL statement "UPDATE ONLY "public"."deployments" SET "cluster_id" = NULL WHERE $1 OPERATOR(pg_catalog.=) "cluster_id""
Summary
When attempting to remove a certificate-based Kubernetes cluster, customers have reported 500 errors. After gaining permission to remove the cluster on the customers behalf, support was able to replicate this behavior via console. The issue appears to stem from the number of deployments being queried. Details can be found in the following internal escalations:
As we are deprecating this feature in GitLab 15.0, customers need to migrate to the GitLab agent for Kubernetes. This could be seen as a blocker while we near the release of GitLab 15.0.
Steps to reproduce
Attempt to remove a cluster with a large number of deployments and observe the 500 error.
What is the current bug behavior?
When attempting to remove a certificate-based Kubernetes cluster, customers have reported 500 errors.
What is the expected correct behavior?
Customers should be able to remove a certificate-based Kubernetes cluster without error.
Relevant logs and/or screenshots
json.exception.class: ActiveRecord::QueryCanceled
json.exception.message: PG::QueryCanceled: ERROR: canceling statement due to statement timeout CONTEXT: SQL statement "UPDATE ONLY "public"."deployments" SET "cluster_id" = NULL WHERE $1 OPERATOR(pg_catalog.=) "cluster_id""
Output of checks
This bug happens on GitLab.com
"Workaround"
As the certificate managed cluster feature is being deprecated in 15.0, and the inability to remove a cluster should not be a blocker for migration to KAS, the recommendation is to disable the cluster (instead of removing it) and keep going with the migration.
Proposal
Update the two places where cluster records are destroyed (Clusters::DestroyService
and Clusters::Cleanup::ServiceAccountService
) to first update/delete associated deployments
and deployment_clusters
in batches, instead of relying on cascading foreign keys.