Investigate NGINX controller Pod Failures

❗ This is a moved issue that started as an incident ❗

See Incident https://gitlab.com/gitlab-com/gl-infra/production/-/issues/5165

Sometimes newly created NGINX Controller Pods that come online throw a lot of Address not available errors when attempting to connect to the Service IP address associated with the Kubernetes Service object gitlab-webservice-api.

Utilize this issue to investigate why this happens.

GCP Support Case: https://console.cloud.google.com/support/cases/detail/28560646?project=gitlab-production

Current State

2021-07-28

Lowered Severity due to low impact: occurs roughly 3 times per day. The last set of occurrences have not resulted in any pages to the on-call https://gitlab.com/gitlab-com/gl-infra/infrastructure/-/issues/13882#note_637907890

This is still a kubernetes-migration-blocker, however for the web migration.

2021-07-30

A mitigation is in place for 3 of our production clusters. We will continue to monitor for re-occurrences of this issue on those clusters to ensure viability of this mitigation: https://gitlab.com/gitlab-com/gl-infra/infrastructure/-/issues/13882#note_637810254

While at the same time, we'll monitor gprd-us-east1-d for re-occurrences of this issue and observe any log detail out of our initContainer.

2021-08-03

The prior mitigation has been removed. Recent investigations and information from GCP have proven that the networking is appropriately configured on the Pod on start up. https://gitlab.com/gitlab-com/gl-infra/infrastructure/-/issues/13882#note_640954832

A new mitigation is now in place which increases the amount of ephemeral ports available to the Pod. https://gitlab.com/gitlab-com/gl-infra/infrastructure/-/issues/13882#note_640169447

We should now continue monitoring for reoccurrances of this problem and see if we can learn anything from: delivery#1922 (closed)

We will continue working this alongside GCP support at the moment.

2021-08-05

Root Cause is settled on as investigated in delivery#1922 (closed)

Will continue to monitor until Monday high load has passed.

GCP Support Case and this issue due to close on Monday if no changes or re-occurrence of this issue prop up again.

Current Theories

Ruled out: ~~Race condition between newly create node, with Calico Networking and the NGINX Controller Pod clashing: delivery#1921 (closed)~~
Connection loss between Calico Pods and Calico Typha - delivery#1905 (moved)
NGINX configuration tuning leading to poor behavior due to port exhaustion - https://gitlab.com/gitlab-com/gl-infra/infrastructure/-/issues/13882#note_640169447
Ruled out: ~~Botched NGINX configuration~~ - https://gitlab.com/gitlab-com/gl-infra/infrastructure/-/issues/13882#note_636668786
➡ Traffic imbalance from HAProxy: delivery#1922 (closed) ⬅

Mitigation Strategies

Insert an initContainer with a sleep to slow down a newly created nginx controller from coming online with a potential bad network configuration - https://gitlab.com/gitlab-com/gl-infra/infrastructure/-/issues/13882#note_637810254
~~Enable Keepalive nginx ingress configuration~~ - https://gitlab.com/gitlab-com/gl-infra/infrastructure/-/issues/13882#note_636668932
➡ Increase ephemeral port range - https://gitlab.com/gitlab-com/gl-infra/infrastructure/-/issues/13882#note_640169447 ⬅
Ruled out: ~~Resolve calico issues~~ - delivery#1905 (moved)
Remove nginx in front of our services all-together - delivery#1924 (closed)

Results

Traffic imbalance is being blamed as root cause. Details of this can be found in the issue description: delivery#1922 (closed). However, a summary:

We are utilizing an NGINX Ingress Controller to route traffic for our API traffic. The service that accepts this traffic has an External Traffic Policy of Local. This sits behind an Internal Google Loadbalancer (ILB). With this configuration, only nodes for which Pods are running on are registered with the ILB. When this issue (incident) first occurred, we saw an influx of failed requests with an error from the NGINX stating Address not available. This is the linux kernel reporting EADDRNOTAVAIL which pointed towards ephemeral port exhaustion. The reason this was happening is due to an imbalance of traffic across all Pods, but this is due to how the ILB is routing traffic. When we scaled up a new node due to needing to schedule a Pod that didn't fit on the existing nodes, this 1 Pod would see an equal amount of traffic that was distributed among all Nodes and Pods elsewhere. So for example, if we had 9 Pods equally spread across 3 nodes, and the HPA wanted a new Pod which forced a 4th node to come online, the ILB would spread 25% of the traffic across all 4 nodes. This seems fine, until you see the imbalance of traffic across the Pods. 9 of the 10 nodes would see about 8% of traffic, while the newest Pod would see 25% of all traffic. This Pod was effectively overloaded. The ultimate resolution was increasing the amount of ephemeral ports utilized by that Pod.

An issue to see if we want to try and resolve the imbalance to the Pods themselves is opened but currently not prioritized. There's a lot of caveats we would need to further test. delivery#1937 (closed)

Reference:

EADDRNOTAVAIL - https://man7.org/linux/man-pages/man2/connect.2.html#ERRORS
Traffic Imabalance Investigation - delivery#1922 (closed)
Ephemeral Port Expansion - gitlab-com/gl-infra/k8s-workloads/gitlab-com!1075 (merged)
NGINX Explainer on Ephemeral Ports - https://www.nginx.com/blog/overcoming-ephemeral-port-exhaustion-nginx-plus/

Edited Aug 10, 2021 by John Skarbek

Assignee Loading

Time tracking Loading