AWS NLB results in many RST packets / 5xx Errors

Summary

Our GitLab instance is deployed in AWS using a L4 Network Load Balancer (NLB) because the default L7 Application Load Balancer (ALB) does not support non 80/443 port ingress, and GitLab-Shell requires port 22 to be open. Over time, we've noticed occasional times when users would get a 500, or other timeout related error. Looking at the LB metrics, we see that there are many RST packets being sent, Client Resets, Target Resets (from the target group), and LB resets, meaning the all three points are sources of the RST packet. We have already configured the keepalives on the worker nodes and in the pods to be less than 350s (fixed by AWS for NLBs), so we're not sure what else could be causing these.

Steps to reproduce

  1. Deploy GitLab via kubernetes and expose the nginx-ingress using a NLB.

Configuration used

gitlab:
  global:
    time_zone: "America/New_York"
    hosts:
      https: true
      domain: domain.com
      gitlab:
        name: app.domain.com
    ingress:
      enabled: false
      configureCertmanager: false
      tls:
        enabled: false
    smtp:
      enabled: true
      address: "email-smtp.us-east-1.amazonaws.com"
      port: 587
      domain: domain.com
      user_name: xxxxxxxxxxxx
      password:
        secret: smtp-password
        key: password
      starttls_auto: true
      tls: false
    email:
      from: notify@domain.com
      display_name: Gitlab Notifications
      reply_to: noreply@domain.com
      subject_suffix: "Haven Life"
    minio:
      enabled: false
    registry:
      bucket: xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx-registry
    appConfig:
      omniauth:
        enabled: true
        allowSingleSignOn: ["google_oauth2"]
        blockAutoCreatedUsers: true
        providers:
          - secret: gitlab-google-oauth2
            key: provider
      lfs:
        bucket: xxxxxxxxxxxxxx-lfs
        connection:
          secret: rails-storage
          key: connection
      artifacts:
        bucket: xxxxxxxxxxxxxx-artifacts
        connection:
          secret: rails-storage
          key: connection
      uploads:
        bucket: xxxxxxxxxxxxxx-uploads
        connection:
          secret: rails-storage
          key: connection
      packages:
        bucket: xxxxxxxxxxxxxx-packages
        connection:
          secret: rails-storage
          key: connection
      externalDiffs:
        enabled: true
        bucket: xxxxxxxxxxxxxx-externaldiffs
        connection:
          secret: rails-storage
          key: connection
      pseudonymizer:
        bucket: xxxxxxxxxxxxxx-pseudonymizer
        connection:
          secret: rails-storage
          key: connection
      backups:
        bucket: xxxxxxxxxxxxxx-backups
        tmpBucket: xxxxxxxxxxxxxx-tmp-backups
        objectStorage:
          backend: s3
    psql:
      host: xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
      database: gitlab
      password:
        secret: postgresql-password
        key: postgres-password
    redis: &redis
      host: xxxxxxxxxxxxxx
      password:
        enabled: false
  # Effectively turn off kubernetes runners since you cannot disable the chart
  gitlab-runner:
    namespace: gitlab
    concurrent: 1
    runners:
      tags: kubernetes-runner
  redis:
    enabled: false
  postgresql:
    install: false
  nginx-ingress:
    enabled: true
    # NOTE: GitLab's Chart exposes port 22 by default
    controller:
      name: gitlab
      config:
        use-forwarded-headers: "true"
        client-header-timeout: "420"
        proxy-stream-timeout: "200s"
      headers:
        X-Forwarded-Ssl: "on"
      configMapNamespace: gitlab
      tcp:
        configMapNamespace: gitlab
      service:
        externalTrafficPolicy: Local
        enableHttp: false
        enableHttps: true
        targetPorts:
          https: http
        annotations:
          service.beta.kubernetes.io/aws-load-balancer-type: nlb
          service.beta.kubernetes.io/aws-load-balancer-backend-protocol: tcp
          external-dns.alpha.kubernetes.io/hostname: app.domain.com.
          # NOTE: the following are no-ops until  upgrade to kube 1.15+, add in TLS / SSL specific annotations:
          service.beta.kubernetes.io/aws-load-balancer-ssl-cert: "xxx"
          service.beta.kubernetes.io/aws-load-balancer-ssl-negotiation-policy: "ELBSecurityPolicy-2016-08"
          service.beta.kubernetes.io/aws-load-balancer-ssl-ports: "443"
  certmanager:
    install: false
  gitlab:
    gitlab:
      hostname: app.domain.com
    unicorn:
      ingress:
        enabled: true
        tls:
          enabled: false
        hpa:
          targetAverageValue: 250m
        annotations: &uni-reg-annot
          nginx.ingress.kubernetes.io/force-ssl-redirect: false
          nginx.ingress.kubernetes.io/enable-cors: true
          nginx.ingress.kubernetes.io/proxy-connect-timeout: "30"
          nginx.ingress.kubernetes.io/proxy-send-timeout: "45"
      service:
        type: NodePort
      replicaCount: 4
      redis:
        <<: *redis
    task-runner:
      persistence:
        enabled: true
        storageClass: efs-persist-task-runner-prod
        accessMode: ReadWriteMany
        subPath: backups
      resources:
        requests:
          memory: 512M
          cpu: 150m
      backups:
        objectStorage:
          config:
            secret: backups-storage
            key: config
        cron:
          enabled: true
          extraArgs: "--skip registry --skip lfs --skip artifacts --skip uploads --skip packages"
          persistence:
            enabled: true
            size: 48Gi
          schedule: "30 5 * * *"
    gitlab-shell:
      tcpExternalConfig: true
      service:
        type: NodePort
      replicaCount: 4
      annotations:
        <<: *uni-reg-annot
    gitaly:
      persistence:
        storageClass: efs-persist-gitaly-prod
        accessMode: ReadWriteMany
        subPath: data
    sidekiq:
      replicas: 6
      redis:
        <<: *redis
  registry:
    relativeurls: true
    hpa:
      cpu:
        targetAverageUtilization: 30
      minReplicas:
      maxReplicas:
    log:
      level: debug
      fields:
        service: registry
    ingress:
      enabled: true
      tls:
        enabled: false
      proxyReadTimeout: 45
      annotations:
        <<: *uni-reg-annot
    service:
      type: NodePort
    storage:
      secret: registry-storage
      key: config

Current behavior

LB shows many Resets in the metrics coming from Target, client, and LB, occasionally resulting in 5xx or Timeout errors for end-users

Expected behavior

should see no Resets and no 500s

Versions

  • Chart: 2.6.7
  • Platform:
    • Cloud: EKS
  • Kubernetes: (kubectl version)
    • Client: 1.15.5
    • Server: 1.14
  • Helm: (helm version)
    • Client: 2.15.1
    • Server: 2.15.1