Operator loops pausing/unpausing webservice deployment, for a very long time, while doing minor or major upgrades

Summary

When upgrading the chart across minor or major versions, the operator can spend a very long time in a loop trying to pause/unpause webservice deployments.

Steps to reproduce

  1. GitLab instance running under Operator 0.13.2, GitLab Chart version 6.5.2. Web interface claims that "Gitlab 15.5.2" is running.
  2. Perform upgrade to operator 0.14.1. Get warning that chart version 6.5.2 is not supported.
  3. Upgrade chart to version 6.5.6 in GitLab CRD
  4. Wait for GitLab CRD to fully reconcile.

(as far as we can tell, this has happened for every upgrade, minor or major, for both our GitLab instances using the operator on AKS)

Configuration used

apiVersion: apps.gitlab.com/v1beta1
kind: GitLab
metadata:
  name: gitlab
  namespace: gitlab-system
spec:
  chart:
    version: 6.5.6
    values:
      certmanager:
        install: false
      certmanager-issuer:
        email: REDACTED
      prometheus:
        install: false
      gitlab:
        webservice:
            monitoring:
              ipWhitelist: REDACTED
        gitaly:
          persistence:
            size: 64Gi
        toolbox:
          backups:
            cron:
              enabled: true
              schedule: 00 01 * * *
            objectStorage:
              config:
                secret: swh-minio-credentials
                key: s3cfg-config
      global:
        edition: ce
        hosts:
          domain: softwareheritage.org
          externalIP: 
          https: true
          registry:
            name: container-registry.softwareheritage.org
        appConfig:
          backups:
            bucket: backup-gitlab
          artifacts:
            bucket: artifacts
          dependency_proxy:
            bucket: dependency-proxy
          external_diffs:
            bucket: external-diffs
          lfs:
            bucket: lfs-objects
          object_store:
            connection:
              secret: swh-azure-storage-credentials
            enabled: true
            proxy_download: true
          packages:
            bucket: packages
          pages:
            bucket: pages
          registry:
            bucket: registry
          terraform_state:
            bucket: terraform
          uploads:
            bucket: uploads
        email:
          display_name: Gitlab
          from: REDACTED
          reply-to: REDACTED
        gitaly:
          metrics.enabled: true
        ingress:
          annotations:
            kubernetes.io/tls-acme: true
          configureCertmanager: true
          tls:
            enabled: true
        minio:
          enabled: false
        smtp:
          address: REDACTED
          authentication: plain
          enabled: true
          password:
            secret: swh-smtp-password
          port: 465
          tls: true
          user_name: REDACTED
      registry:
        storage:
          secret: swh-azure-storage-registry-credentials
          key: config

Current behavior

The reconcile operation takes more than an hour and loops more than ten thousand times.

It eventually succeeds, but it's not clear what, if anything, changes for that to happen (we've tried scaling the number of nodes, deleting all jobs, scaling the deployments manually, none of which seem to reliably change anything).

Expected behavior

The reconcile operation happens as soon as the deployments of the new versions of the webservice container succeed. It does not retry ten thousand times.

Versions

  • Operator: 0.14.1
  • Platform:
    • Cloud: AKS
  • Kubernetes: (kubectl version)
    • Client Version: v1.25.4 (not actually used during this upgrade?)
    • Kustomize Version: v4.5.7
    • Server Version: v1.22.15

Relevant logs

Full log since a rollout-restart of the operator deployment that happened after the update of the CRD: gitlab-controller-manager.log.gz

$ kubectl --context=euwest-gitlab-production -n gitlab-system logs deployment/gitlab-controller-manager --container=manager --since '2h' | grep controllers.GitLab | cut -c '50-' | sort | uniq -c
      1 	Changing replica count of deployment with HPA	{"deployment": "gitlab-system/gitlab-gitlab-shell", "replicas": 2}
      1 	Changing replica count of deployment with HPA	{"deployment": "gitlab-system/gitlab-kas", "replicas": 2}
      1 	Changing replica count of deployment with HPA	{"deployment": "gitlab-system/gitlab-registry", "replicas": 2}
      1 	Changing replica count of deployment with HPA	{"deployment": "gitlab-system/gitlab-sidekiq-all-in-1-v2", "replicas": 1}
      1 	Changing replica count of deployment with HPA	{"deployment": "gitlab-system/gitlab-webservice-default", "replicas": 2}
   7860 	CreateOrPatch	{"gitlab": "gitlab-system/gitlab", "type": "*v1.Deployment", "reference": "gitlab-system/gitlab-sidekiq-all-in-1-v2", "outcome": "updated"}
   8633 	CreateOrPatch	{"gitlab": "gitlab-system/gitlab", "type": "*v1.Deployment", "reference": "gitlab-system/gitlab-webservice-default", "outcome": "updated"}
      1 	CreateOrPatch	{"gitlab": "gitlab-system/gitlab", "type": "*v1.Job", "reference": "gitlab-system/gitlab-migrations-1-9f5", "outcome": "created"}
      1 	CreateOrPatch	{"gitlab": "gitlab-system/gitlab", "type": "*v1.Job", "reference": "gitlab-system/gitlab-migrations-1-9f5-pre", "outcome": "created"}
      2 	CreateOrPatch	{"gitlab": "gitlab-system/gitlab", "type": "*v1.Job", "reference": "gitlab-system/gitlab-shared-secrets-1-bkz", "outcome": "created"}
      2 	CreateOrPatch	{"gitlab": "gitlab-system/gitlab", "type": "*v2beta1.HorizontalPodAutoscaler", "reference": "gitlab-system/gitlab-gitlab-shell", "outcome": "updated"}
      1 	CreateOrPatch	{"gitlab": "gitlab-system/gitlab", "type": "*v2beta1.HorizontalPodAutoscaler", "reference": "gitlab-system/gitlab-kas", "outcome": "updated"}
      1 	CreateOrPatch	{"gitlab": "gitlab-system/gitlab", "type": "*v2beta1.HorizontalPodAutoscaler", "reference": "gitlab-system/gitlab-registry", "outcome": "updated"}
      4 	CreateOrPatch	{"gitlab": "gitlab-system/gitlab", "type": "*v2beta1.HorizontalPodAutoscaler", "reference": "gitlab-system/gitlab-sidekiq-all-in-1-v2", "outcome": "updated"}
      4 	CreateOrPatch	{"gitlab": "gitlab-system/gitlab", "type": "*v2beta1.HorizontalPodAutoscaler", "reference": "gitlab-system/gitlab-webservice-default", "outcome": "updated"}
  10254 	createOrUpdate result	{"gitlab": "gitlab-system/gitlab", "type": "*v1.Issuer", "reference": "gitlab-system/gitlab-issuer", "result": "updated"}
     63 ensuring Sidekiq Deployments are running	{"gitlab": "gitlab-system/gitlab"}
   8618 ensuring Sidekiq Deployments are unpaused	{"gitlab": "gitlab-system/gitlab"}
   7843 ensuring Webservice Deployments are running	{"gitlab": "gitlab-system/gitlab"}
  10238 ensuring Webservice Deployments are unpaused	{"gitlab": "gitlab-system/gitlab"}
  10261 Reconciling GitLab	{"gitlab": "gitlab-system/gitlab"}
      2 reconciling post migrations	{"gitlab": "gitlab-system/gitlab"}
  10246 reconciling pre migrations	{"gitlab": "gitlab-system/gitlab"}
      8 reconciling Sidekiq Deployments	{"gitlab": "gitlab-system/gitlab", "pause": false}
  10246 reconciling Sidekiq Deployments	{"gitlab": "gitlab-system/gitlab", "pause": true}
      8 reconciling Webservice Deployments	{"gitlab": "gitlab-system/gitlab", "pause": false}
  10246 reconciling Webservice Deployments	{"gitlab": "gitlab-system/gitlab", "pause": true}
      8 running all migrations	{"gitlab": "gitlab-system/gitlab"}
  10254 self-signed certificates job skipped, not needed per configuration	{"gitlab": "gitlab-system/gitlab"}
      1 Using batch/v1beta1 for CronJob
      1 Using batch/v1 for CronJob
      1 Using cert-manager.io/v1
      1 Using monitoring.coreos.com/v1
      8 	version information	{"gitlab": "gitlab-system/gitlab", "upgrade": false, "current version": "6.5.6", "desired version": "6.5.6"}
  10253 	version information	{"gitlab": "gitlab-system/gitlab", "upgrade": true, "current version": "6.5.2", "desired version": "6.5.6"}

I'm a bit at a loss as to what other diagnostics would be useful to diagnose this. When doing a describe of the webservice deployment, it seems to flip between paused=false and paused=true at each iteration of the loop.

Edited by Nicolas Dandrimont