Random registry push failures

Summary

I've installed the GitLab Helm chart on a 2 node Digital Ocean cluster with a custom installation of the ingress-nginx helm chart from https://kubernetes.github.io/ingress-nginx. I'm getting a lot of random errors with the registry when uploading blobs (with buildah or genuinetools/img):

error copying layers and metadata from "containers-storage:[overlay@/var/lib/containers/storage+/run/containers/storage:overlay.mountopt=nodev,metacopy=on]<REGISTRY_IMG_URL>" to "REGISTRY_IMG_URL": Error writing blob: Error uploading layer chunked: received unexpected HTTP status: 504 Gateway Time-out

or

error pushing image "REGISTRY_IMG_URL" to "docker://REGISTRY_IMG_URL": error copying layers and metadata from "containers-storage:[overlay@/var/lib/containers/storage+/run/containers/storage:overlay.mountopt=nodev,metacopy=on]REGISTRY_IMG_URL" to "docker://REGISTRY_IMG_URL": Error trying to reuse blob sha256:c95d2191d7773c6e29188f92922bc9547e1f0b6130e85dfc2f5e4eae13137c7c at destination: Requesting bear token: invalid status code from registry 500 (Internal Server Error)

Note that these errors are random and occur about 50% of the time. Simply rerunning the Gitlab CI job often resolves the issue.

Steps to reproduce

Basically deploying the GitLab helm chart to a managed Digital Ocean cluster. I have a Digital Ocean Load Balancer setup via the ingress-nginx helm chart from https://kubernetes.github.io/ingress-nginx. I also use a custom installation of the cert-manager chart from "https://charts.jetstack.io"

Proxy Protocol is disabled on the Load Balancer (don't know if that has anything to do with it?). Enabling it was tricky with cert-manager so I left if disabled for now.

Configuration used

(Please provide a sanitized version of the configuration used wrapped in a code block ```yaml

certmanager:
  install: false
gitlab:
  task-runner:
    backups:
      cron:
        enabled: true
        schedule: 0 5 * * *
      objectStorage:
        config:
          key: config
          secret: gitlab-s3cmd-secret
    enabled: true
  webservice:
    ingress:
      tls:
        secretName: gitlab-gitlab-tls
gitlab-runner:
  install: true
  runners:
    config: "[[runners]]\n  [runners.kubernetes]\n    cpu_request = \"1.5\"\n    memory_request = \"2.5Gi\"\n    service_cpu_limit = \"1\"\n    service_memory_limit = \"2Gi\"       \n    privileged = true\n    poll_timeout = 600\n    [runners.kubernetes.pod_labels]\n      component = \"gitlab-ci-job\"\n    [runners.kubernetes.pod_annotations]\n      \"container.apparmor.security.beta.kubernetes.io/build\" = \"unconfined\"\n      \"container.seccomp.security.alpha.kubernetes.io/build\" = \"unconfined\"\n    [runners.kubernetes.node_selector]\n      role = \"ci-runner\"\n    [runners.kubernetes.node_tolerations]\n      \"dedicated_role=ci-runner\" = \"NoSchedule\"\n    [[runners.kubernetes.volumes.empty_dir]]\n      name = \"buildah\"\n      mount_path = \"/var/lib/containers/\"\n"
global:
  appConfig:
    artifacts:
      bucket: company-gitlab-artifacts
      connection:
        secret: gitlab-rails-s3-secret
    backups:
      bucket: company-gitlab-backups
      tmpBucket: company-gitlab-backups-tmp
    dependencyProxy:
      bucket: company-gitlab-dependency-proxy
      connection:
        secret: gitlab-rails-s3-secret
    externalDiffs:
      bucket: company-gitlab-external-diffs
      connection:
        secret: gitlab-rails-s3-secret
    initialDefaults:
      signupEnabled: false
    lfs:
      bucket: company-gitlab-lfs
      connection:
        secret: gitlab-rails-s3-secret
    omniauth:
      allowSingleSignOn: true
      autoLinkLdapUser: true
      blockAutoCreatedUsers: false
      enabled: true
      providers:
      - secret: gitlab-azuread-oauth
    packages:
      bucket: company-gitlab-packages
      connection:
        secret: gitlab-rails-s3-secret
    pseudonymizer:
      bucket: company-gitlab-pseudonymizer
      connection:
        secret: gitlab-rails-s3-secret
    terraformState:
      bucket: company-gitlab-terraform-state
      connection:
        secret: gitlab-rails-s3-secret
    uploads:
      bucket: company-gitlab-uploads
      connection:
        secret: gitlab-rails-s3-secret
  hosts:
    domain: git.company.com
    gitlab:
      name: git.company.com
    hostSuffix: bas
  ingress:
    annotations:
      cert-manager.io/cluster-issuer: letsencrypt-prod
      kubernetes.io/tls-acme: true
    class: nginx
    configureCertmanager: false
  minio:
    enabled: false
  operator:
    enabled: false
  registry:
    bucket: company-gitlab-registry
nginx-ingress:
  enabled: false
prometheus:
  install: false
registry:
  enabled: true
  ingress:
    tls:
      secretName: gitlab-registry-tls
  relativeurls: true
  storage:
    key: config
    secret: gitlab-registry-space

Current behavior

Pushing randomly results in these errors:

error copying layers and metadata from "containers-storage:[overlay@/var/lib/containers/storage+/run/containers/storage:overlay.mountopt=nodev,metacopy=on]<REGISTRY_IMG_URL>" to "REGISTRY_IMG_URL": Error writing blob: Error uploading layer chunked: received unexpected HTTP status: 504 Gateway Time-out

or

error pushing image "REGISTRY_IMG_URL" to "docker://REGISTRY_IMG_URL": error copying layers and metadata from "containers-storage:[overlay@/var/lib/containers/storage+/run/containers/storage:overlay.mountopt=nodev,metacopy=on]REGISTRY_IMG_URL" to "docker://REGISTRY_IMG_URL": Error trying to reuse blob sha256:c95d2191d7773c6e29188f92922bc9547e1f0b6130e85dfc2f5e4eae13137c7c at destination: Requesting bear token: invalid status code from registry 500 (Internal Server Error)

If the gitlab-workhorse pod I see a lot of status 500 errors but no further explanation as to why:

{"content_type":"text/html; charset=utf-8","correlation_id":"01EZ73N23DQR4JCXG1QD8HQJB1","duration_ms":41,"host":"git.company.com","level":"info","method":"GET","msg":"access","proto":"HTTP/1.1","referrer":"","remote_addr":"10.244.0.88:44428","remote_ip":"10.244.0.88","route":"","status":500,"system":"http","time":"2021-02-23T09:31:18Z","ttfb_ms":41,"uri":"/jwt/auth?account=gitlab-ci-token\u0026scope=repository%3Acompany%2Ffoundation%2Fmelodic%3Apull%2Cpush\u0026service=container_registry","user_agent":"Buildah/1.19.6","written_bytes":2926}

Expected behavior

I should be able to push images properly

Versions

  • Chart: 4.9.0 (also happens with 4.8.4)
  • Platform:
    • Cloud: DigitalOcean
  • Kubernetes: (kubectl version)
    • Client: 1.20.2
    • Server: 1.20.2
  • Helm: (helm version)
    • Client: v3.5.2
    • Server: -

Relevant logs

See above.