AWS NLB results in many RST packets / 5xx Errors
Summary
Our GitLab instance is deployed in AWS using a L4 Network Load Balancer (NLB) because the default L7 Application Load Balancer (ALB) does not support non 80/443 port ingress, and GitLab-Shell requires port 22 to be open. Over time, we've noticed occasional times when users would get a 500, or other timeout related error. Looking at the LB metrics, we see that there are many RST packets being sent, Client Resets, Target Resets (from the target group), and LB resets, meaning the all three points are sources of the RST packet. We have already configured the keepalives on the worker nodes and in the pods to be less than 350s (fixed by AWS for NLBs), so we're not sure what else could be causing these.
Steps to reproduce
- Deploy GitLab via kubernetes and expose the nginx-ingress using a NLB.
Configuration used
gitlab:
global:
time_zone: "America/New_York"
hosts:
https: true
domain: domain.com
gitlab:
name: app.domain.com
ingress:
enabled: false
configureCertmanager: false
tls:
enabled: false
smtp:
enabled: true
address: "email-smtp.us-east-1.amazonaws.com"
port: 587
domain: domain.com
user_name: xxxxxxxxxxxx
password:
secret: smtp-password
key: password
starttls_auto: true
tls: false
email:
from: notify@domain.com
display_name: Gitlab Notifications
reply_to: noreply@domain.com
subject_suffix: "Haven Life"
minio:
enabled: false
registry:
bucket: xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx-registry
appConfig:
omniauth:
enabled: true
allowSingleSignOn: ["google_oauth2"]
blockAutoCreatedUsers: true
providers:
- secret: gitlab-google-oauth2
key: provider
lfs:
bucket: xxxxxxxxxxxxxx-lfs
connection:
secret: rails-storage
key: connection
artifacts:
bucket: xxxxxxxxxxxxxx-artifacts
connection:
secret: rails-storage
key: connection
uploads:
bucket: xxxxxxxxxxxxxx-uploads
connection:
secret: rails-storage
key: connection
packages:
bucket: xxxxxxxxxxxxxx-packages
connection:
secret: rails-storage
key: connection
externalDiffs:
enabled: true
bucket: xxxxxxxxxxxxxx-externaldiffs
connection:
secret: rails-storage
key: connection
pseudonymizer:
bucket: xxxxxxxxxxxxxx-pseudonymizer
connection:
secret: rails-storage
key: connection
backups:
bucket: xxxxxxxxxxxxxx-backups
tmpBucket: xxxxxxxxxxxxxx-tmp-backups
objectStorage:
backend: s3
psql:
host: xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
database: gitlab
password:
secret: postgresql-password
key: postgres-password
redis: &redis
host: xxxxxxxxxxxxxx
password:
enabled: false
# Effectively turn off kubernetes runners since you cannot disable the chart
gitlab-runner:
namespace: gitlab
concurrent: 1
runners:
tags: kubernetes-runner
redis:
enabled: false
postgresql:
install: false
nginx-ingress:
enabled: true
# NOTE: GitLab's Chart exposes port 22 by default
controller:
name: gitlab
config:
use-forwarded-headers: "true"
client-header-timeout: "420"
proxy-stream-timeout: "200s"
headers:
X-Forwarded-Ssl: "on"
configMapNamespace: gitlab
tcp:
configMapNamespace: gitlab
service:
externalTrafficPolicy: Local
enableHttp: false
enableHttps: true
targetPorts:
https: http
annotations:
service.beta.kubernetes.io/aws-load-balancer-type: nlb
service.beta.kubernetes.io/aws-load-balancer-backend-protocol: tcp
external-dns.alpha.kubernetes.io/hostname: app.domain.com.
# NOTE: the following are no-ops until upgrade to kube 1.15+, add in TLS / SSL specific annotations:
service.beta.kubernetes.io/aws-load-balancer-ssl-cert: "xxx"
service.beta.kubernetes.io/aws-load-balancer-ssl-negotiation-policy: "ELBSecurityPolicy-2016-08"
service.beta.kubernetes.io/aws-load-balancer-ssl-ports: "443"
certmanager:
install: false
gitlab:
gitlab:
hostname: app.domain.com
unicorn:
ingress:
enabled: true
tls:
enabled: false
hpa:
targetAverageValue: 250m
annotations: &uni-reg-annot
nginx.ingress.kubernetes.io/force-ssl-redirect: false
nginx.ingress.kubernetes.io/enable-cors: true
nginx.ingress.kubernetes.io/proxy-connect-timeout: "30"
nginx.ingress.kubernetes.io/proxy-send-timeout: "45"
service:
type: NodePort
replicaCount: 4
redis:
<<: *redis
task-runner:
persistence:
enabled: true
storageClass: efs-persist-task-runner-prod
accessMode: ReadWriteMany
subPath: backups
resources:
requests:
memory: 512M
cpu: 150m
backups:
objectStorage:
config:
secret: backups-storage
key: config
cron:
enabled: true
extraArgs: "--skip registry --skip lfs --skip artifacts --skip uploads --skip packages"
persistence:
enabled: true
size: 48Gi
schedule: "30 5 * * *"
gitlab-shell:
tcpExternalConfig: true
service:
type: NodePort
replicaCount: 4
annotations:
<<: *uni-reg-annot
gitaly:
persistence:
storageClass: efs-persist-gitaly-prod
accessMode: ReadWriteMany
subPath: data
sidekiq:
replicas: 6
redis:
<<: *redis
registry:
relativeurls: true
hpa:
cpu:
targetAverageUtilization: 30
minReplicas:
maxReplicas:
log:
level: debug
fields:
service: registry
ingress:
enabled: true
tls:
enabled: false
proxyReadTimeout: 45
annotations:
<<: *uni-reg-annot
service:
type: NodePort
storage:
secret: registry-storage
key: config
Current behavior
LB shows many Resets in the metrics coming from Target, client, and LB, occasionally resulting in 5xx or Timeout errors for end-users
Expected behavior
should see no Resets and no 500s
Versions
- Chart:
2.6.7 - Platform:
- Cloud:
EKS
- Cloud:
- Kubernetes: (
kubectl version)- Client:
1.15.5 - Server:
1.14
- Client:
- Helm: (
helm version)- Client:
2.15.1 - Server:
2.15.1
- Client: