Investigate traffic balancing between HAProxy and NGINX Ingress Controllers
As can be seen in the following image:
PromQL:
sum by (pod) (
rate(
nginx_ingress_controller_requests:labeled{env="gprd", type="api", stage="main", cluster="gprd-us-east1-d"}[5m]
)
)
Misbehaving Pods
The following is a list of Pods that were misbehaving during the time period for which we are targeting this investigation:
-
gitlab-nginx-ingress-controller-7b77bd6754-98cmr- ~12:00 - 12:05 - on instancegke-gprd-us-east1-d-default-2-f2b6f604-fx69 -
gitlab-nginx-ingress-controller-7b77bd6754-sc2n9- ~13:00 - 13:04 - on instancegke-gprd-us-east1-d-default-2-f2b6f604-lvfx -
gitlab-nginx-ingress-controller-7b77bd6754-f967w- ~13:10 - 13:13 - on instancegke-gprd-us-east1-d-default-2-f2b6f604-5j0s
As highlighted by this chart (subset of the above):
It would appear that traffic is not well balanced between HAProxy. Let's attempt to learn how GCP handles traffic and determine if this is a problem at the GLB level, or if kube-proxy might be the responsible party. Let's also look into what configuration options may be available to see if we can leverage any tweaks that may help prevent the overloading of our Ingresses.
Milestones
-
Determine if EVERY fresh Pod sees this traffic behavior -
Determine if every Pod and every NEW node sees this traffic behavior -
Investigate tuning the kube-proxy -
Investigate tuning the GLB -
Determine potential mitigation if any
Results
Traffic is imbalanced due to our usage of an Internal Load Balancer that is unaware of the amount of Pods running on a given node. The ILB will evenly spread traffic across all nodes, so any new Pod on a new load balancer may be subject to higher traffic than the rest of the Pods. Further analysis gathered in the below thread: #1922 (comment 643339784)
As a result of this finding, alternative options to potentially relieving high pressure on new Pods has been opened for future work: #1937 (closed)
Along with this we attempted to validate whether or not Pods that are under a much higher amount of traffic may be stressed causing slow downs in traffic. This was analyzed here: #1922 (comment 644132615) and it was determined that no negative impact was found.

