Operator loops pausing/unpausing webservice deployment, for a very long time, while doing minor or major upgrades
Summary
When upgrading the chart across minor or major versions, the operator can spend a very long time in a loop trying to pause/unpause webservice deployments.
Steps to reproduce
- GitLab instance running under Operator 0.13.2, GitLab Chart version 6.5.2. Web interface claims that "Gitlab 15.5.2" is running.
- Perform upgrade to operator 0.14.1. Get warning that chart version 6.5.2 is not supported.
- Upgrade chart to version 6.5.6 in GitLab CRD
- Wait for GitLab CRD to fully reconcile.
(as far as we can tell, this has happened for every upgrade, minor or major, for both our GitLab instances using the operator on AKS)
Configuration used
apiVersion: apps.gitlab.com/v1beta1
kind: GitLab
metadata:
name: gitlab
namespace: gitlab-system
spec:
chart:
version: 6.5.6
values:
certmanager:
install: false
certmanager-issuer:
email: REDACTED
prometheus:
install: false
gitlab:
webservice:
monitoring:
ipWhitelist: REDACTED
gitaly:
persistence:
size: 64Gi
toolbox:
backups:
cron:
enabled: true
schedule: 00 01 * * *
objectStorage:
config:
secret: swh-minio-credentials
key: s3cfg-config
global:
edition: ce
hosts:
domain: softwareheritage.org
externalIP:
https: true
registry:
name: container-registry.softwareheritage.org
appConfig:
backups:
bucket: backup-gitlab
artifacts:
bucket: artifacts
dependency_proxy:
bucket: dependency-proxy
external_diffs:
bucket: external-diffs
lfs:
bucket: lfs-objects
object_store:
connection:
secret: swh-azure-storage-credentials
enabled: true
proxy_download: true
packages:
bucket: packages
pages:
bucket: pages
registry:
bucket: registry
terraform_state:
bucket: terraform
uploads:
bucket: uploads
email:
display_name: Gitlab
from: REDACTED
reply-to: REDACTED
gitaly:
metrics.enabled: true
ingress:
annotations:
kubernetes.io/tls-acme: true
configureCertmanager: true
tls:
enabled: true
minio:
enabled: false
smtp:
address: REDACTED
authentication: plain
enabled: true
password:
secret: swh-smtp-password
port: 465
tls: true
user_name: REDACTED
registry:
storage:
secret: swh-azure-storage-registry-credentials
key: config
Current behavior
The reconcile operation takes more than an hour and loops more than ten thousand times.
It eventually succeeds, but it's not clear what, if anything, changes for that to happen (we've tried scaling the number of nodes, deleting all jobs, scaling the deployments manually, none of which seem to reliably change anything).
Expected behavior
The reconcile operation happens as soon as the deployments of the new versions of the webservice container succeed. It does not retry ten thousand times.
Versions
- Operator: 0.14.1
- Platform:
- Cloud: AKS
- Kubernetes: (
kubectl version)- Client Version: v1.25.4 (not actually used during this upgrade?)
- Kustomize Version: v4.5.7
- Server Version: v1.22.15
Relevant logs
Full log since a rollout-restart of the operator deployment that happened after the update of the CRD: gitlab-controller-manager.log.gz
$ kubectl --context=euwest-gitlab-production -n gitlab-system logs deployment/gitlab-controller-manager --container=manager --since '2h' | grep controllers.GitLab | cut -c '50-' | sort | uniq -c
1 Changing replica count of deployment with HPA {"deployment": "gitlab-system/gitlab-gitlab-shell", "replicas": 2}
1 Changing replica count of deployment with HPA {"deployment": "gitlab-system/gitlab-kas", "replicas": 2}
1 Changing replica count of deployment with HPA {"deployment": "gitlab-system/gitlab-registry", "replicas": 2}
1 Changing replica count of deployment with HPA {"deployment": "gitlab-system/gitlab-sidekiq-all-in-1-v2", "replicas": 1}
1 Changing replica count of deployment with HPA {"deployment": "gitlab-system/gitlab-webservice-default", "replicas": 2}
7860 CreateOrPatch {"gitlab": "gitlab-system/gitlab", "type": "*v1.Deployment", "reference": "gitlab-system/gitlab-sidekiq-all-in-1-v2", "outcome": "updated"}
8633 CreateOrPatch {"gitlab": "gitlab-system/gitlab", "type": "*v1.Deployment", "reference": "gitlab-system/gitlab-webservice-default", "outcome": "updated"}
1 CreateOrPatch {"gitlab": "gitlab-system/gitlab", "type": "*v1.Job", "reference": "gitlab-system/gitlab-migrations-1-9f5", "outcome": "created"}
1 CreateOrPatch {"gitlab": "gitlab-system/gitlab", "type": "*v1.Job", "reference": "gitlab-system/gitlab-migrations-1-9f5-pre", "outcome": "created"}
2 CreateOrPatch {"gitlab": "gitlab-system/gitlab", "type": "*v1.Job", "reference": "gitlab-system/gitlab-shared-secrets-1-bkz", "outcome": "created"}
2 CreateOrPatch {"gitlab": "gitlab-system/gitlab", "type": "*v2beta1.HorizontalPodAutoscaler", "reference": "gitlab-system/gitlab-gitlab-shell", "outcome": "updated"}
1 CreateOrPatch {"gitlab": "gitlab-system/gitlab", "type": "*v2beta1.HorizontalPodAutoscaler", "reference": "gitlab-system/gitlab-kas", "outcome": "updated"}
1 CreateOrPatch {"gitlab": "gitlab-system/gitlab", "type": "*v2beta1.HorizontalPodAutoscaler", "reference": "gitlab-system/gitlab-registry", "outcome": "updated"}
4 CreateOrPatch {"gitlab": "gitlab-system/gitlab", "type": "*v2beta1.HorizontalPodAutoscaler", "reference": "gitlab-system/gitlab-sidekiq-all-in-1-v2", "outcome": "updated"}
4 CreateOrPatch {"gitlab": "gitlab-system/gitlab", "type": "*v2beta1.HorizontalPodAutoscaler", "reference": "gitlab-system/gitlab-webservice-default", "outcome": "updated"}
10254 createOrUpdate result {"gitlab": "gitlab-system/gitlab", "type": "*v1.Issuer", "reference": "gitlab-system/gitlab-issuer", "result": "updated"}
63 ensuring Sidekiq Deployments are running {"gitlab": "gitlab-system/gitlab"}
8618 ensuring Sidekiq Deployments are unpaused {"gitlab": "gitlab-system/gitlab"}
7843 ensuring Webservice Deployments are running {"gitlab": "gitlab-system/gitlab"}
10238 ensuring Webservice Deployments are unpaused {"gitlab": "gitlab-system/gitlab"}
10261 Reconciling GitLab {"gitlab": "gitlab-system/gitlab"}
2 reconciling post migrations {"gitlab": "gitlab-system/gitlab"}
10246 reconciling pre migrations {"gitlab": "gitlab-system/gitlab"}
8 reconciling Sidekiq Deployments {"gitlab": "gitlab-system/gitlab", "pause": false}
10246 reconciling Sidekiq Deployments {"gitlab": "gitlab-system/gitlab", "pause": true}
8 reconciling Webservice Deployments {"gitlab": "gitlab-system/gitlab", "pause": false}
10246 reconciling Webservice Deployments {"gitlab": "gitlab-system/gitlab", "pause": true}
8 running all migrations {"gitlab": "gitlab-system/gitlab"}
10254 self-signed certificates job skipped, not needed per configuration {"gitlab": "gitlab-system/gitlab"}
1 Using batch/v1beta1 for CronJob
1 Using batch/v1 for CronJob
1 Using cert-manager.io/v1
1 Using monitoring.coreos.com/v1
8 version information {"gitlab": "gitlab-system/gitlab", "upgrade": false, "current version": "6.5.6", "desired version": "6.5.6"}
10253 version information {"gitlab": "gitlab-system/gitlab", "upgrade": true, "current version": "6.5.2", "desired version": "6.5.6"}
I'm a bit at a loss as to what other diagnostics would be useful to diagnose this. When doing a describe of the webservice deployment, it seems to flip between paused=false and paused=true at each iteration of the loop.