`gstg` deployment failing due to helm being stuck

What happened

Earlier today, a gstg deployment failed with this error:

Error: UPGRADE FAILED: another operation (install/upgrade/rollback) is in progress
More logs ``` Error: identified at least one change, exiting with non-zero exit code (detailed-exitcode parameter enabled) Error: plugin "diff" exited with error helm.go:86: 2025-02-03 04:27:26.38603708 +0000 UTC m=+11.257154310 [debug] plugin "diff" exited with error Comparing release=gitlab, chart=../../vendor/charts/gitlab/gstg, namespace=gitlab history.go:56: 2025-02-03 04:27:27.101998148 +0000 UTC m=+0.134467833 [debug] getting history for release gitlab upgrade.go:164: 2025-02-03 04:27:27.684416333 +0000 UTC m=+0.716886028 [debug] preparing upgrade for gitlab Error: UPGRADE FAILED: another operation (install/upgrade/rollback) is in progress helm.go:86: 2025-02-03 04:27:27.974052107 +0000 UTC m=+1.006521802 [debug] another operation (install/upgrade/rollback) is in progress ```

This error was happening with deployments in the us-east1-c and the us-east1-d zonal GKE clusters only. The other zonal cluster in us-east1-b and the regional cluster in us-east1 did not face any issue.

The output of helm list --all only said that the upgrade was pending. Once the CI job failed, this upgrade's status did not change.

$ ./linux-amd64/helm history gitlab
REVISION        UPDATED                         STATUS          CHART           APP VERSION     DESCRIPTION
2102            Fri Jan 31 02:22:01 2025        superseded      gitlab-8.8.1    master          Upgrade complete
2103            Fri Jan 31 06:03:59 2025        superseded      gitlab-8.8.1    master          Upgrade complete
2104            Fri Jan 31 08:57:03 2025        superseded      gitlab-8.8.1    master          Upgrade complete
2105            Fri Jan 31 10:43:40 2025        superseded      gitlab-8.8.1    master          Upgrade complete
2106            Fri Jan 31 12:37:38 2025        superseded      gitlab-8.8.1    master          Upgrade complete
2107            Fri Jan 31 14:59:47 2025        superseded      gitlab-8.8.1    master          Upgrade complete
2108            Fri Jan 31 16:36:41 2025        superseded      gitlab-8.8.1    master          Upgrade complete
2109            Fri Jan 31 18:13:48 2025        superseded      gitlab-8.8.1    master          Upgrade complete
2110            Fri Jan 31 20:25:26 2025        deployed        gitlab-8.8.1    master          Upgrade complete
2111            Mon Feb  3 01:56:41 2025        pending-upgrade gitlab-8.8.1    master          Preparing upgrade

This error happened after the k-ctl upgrade command was run.

In order to resolve the issue, we rolled back the Helm chart to the previous deployed revision:

helm rollback gitlab 2110

Once the rollback was completed, we retried the job. This fixed the issue and the job succeeded.

Since this is already resolved. I'm going to use this for deployment blocker tracking.

Past occurrences

What we did to resolve the issue is to roll back the helm chart to an older revision. By searching on Slack we found similar situations before:

Proposal

There's a runbook on what we need to do when we encounter the error above, but should we also add a similar one to our release docs so that it's easier for release managers to find it (it can also just point to the helm stuck runbook)?

Edited by Siddharth Kannan