`gstg` deployment failing due to helm being stuck
What happened
Earlier today, a gstg deployment failed with this error:
Error: UPGRADE FAILED: another operation (install/upgrade/rollback) is in progress
More logs
``` Error: identified at least one change, exiting with non-zero exit code (detailed-exitcode parameter enabled) Error: plugin "diff" exited with error helm.go:86: 2025-02-03 04:27:26.38603708 +0000 UTC m=+11.257154310 [debug] plugin "diff" exited with error Comparing release=gitlab, chart=../../vendor/charts/gitlab/gstg, namespace=gitlab history.go:56: 2025-02-03 04:27:27.101998148 +0000 UTC m=+0.134467833 [debug] getting history for release gitlab upgrade.go:164: 2025-02-03 04:27:27.684416333 +0000 UTC m=+0.716886028 [debug] preparing upgrade for gitlab Error: UPGRADE FAILED: another operation (install/upgrade/rollback) is in progress helm.go:86: 2025-02-03 04:27:27.974052107 +0000 UTC m=+1.006521802 [debug] another operation (install/upgrade/rollback) is in progress ```This error was happening with deployments in the us-east1-c and the us-east1-d zonal GKE clusters only. The other zonal cluster in us-east1-b and the regional cluster in us-east1 did not face any issue.
The output of helm list --all only said that the upgrade was pending. Once the CI job failed, this upgrade's status did not change.
$ ./linux-amd64/helm history gitlab
REVISION UPDATED STATUS CHART APP VERSION DESCRIPTION
2102 Fri Jan 31 02:22:01 2025 superseded gitlab-8.8.1 master Upgrade complete
2103 Fri Jan 31 06:03:59 2025 superseded gitlab-8.8.1 master Upgrade complete
2104 Fri Jan 31 08:57:03 2025 superseded gitlab-8.8.1 master Upgrade complete
2105 Fri Jan 31 10:43:40 2025 superseded gitlab-8.8.1 master Upgrade complete
2106 Fri Jan 31 12:37:38 2025 superseded gitlab-8.8.1 master Upgrade complete
2107 Fri Jan 31 14:59:47 2025 superseded gitlab-8.8.1 master Upgrade complete
2108 Fri Jan 31 16:36:41 2025 superseded gitlab-8.8.1 master Upgrade complete
2109 Fri Jan 31 18:13:48 2025 superseded gitlab-8.8.1 master Upgrade complete
2110 Fri Jan 31 20:25:26 2025 deployed gitlab-8.8.1 master Upgrade complete
2111 Mon Feb 3 01:56:41 2025 pending-upgrade gitlab-8.8.1 master Preparing upgrade
This error happened after the k-ctl upgrade command was run.
In order to resolve the issue, we rolled back the Helm chart to the previous deployed revision:
helm rollback gitlab 2110
Once the rollback was completed, we retried the job. This fixed the issue and the job succeeded.
Since this is already resolved. I'm going to use this for deployment blocker tracking.
Past occurrences
What we did to resolve the issue is to roll back the helm chart to an older revision. By searching on Slack we found similar situations before:
-
https://gitlab.slack.com/archives/C0139MAV672/p1736869593105059?thread_ts=1736867637.288879&cid=C0139MAV672

- The incident referred to here is gitlab-com/gl-infra/production#19116 (closed).
- The solution was to apply the steps in the runbook as directed on Slack
-
https://gitlab.slack.com/archives/C8PKBH3M5/p1734581337351889?thread_ts=1734571715.065369&cid=C8PKBH3M5

- This was a clearer case with many pods crashing due to a missing secret.
- The problem was resolved by rolling back to a previous release.
- The underlying secret issue was fixed with a PR: gitlab-com/gl-infra/k8s-workloads/gitlab-com!4038 (merged)
Proposal
There's a runbook on what we need to do when we encounter the error above, but should we also add a similar one to our release docs so that it's easier for release managers to find it (it can also just point to the helm stuck runbook)?


