Load Balancing in front of Omnibus instances

Summary

When placing a load balancer in front of multiple Omnibus machines/VMs, incoming connections are too eagerly being accepted by nginx even if the instance is too busy to process new requests. This is especially annoying if the cluster has load not distributed evenly (because some requests take much longer than others).

Steps to reproduce

Issue requests in this manner:

What is the current bug behavior?

What happens is that in this setup, because nginx is eagerly accepting connections, a request is queued on a busy instance, even if there are other instances with capacity to immediately process the request. Ordinarily this isn't a problem; users only see that their requests are slow. In exceptional situations where GitLab takes more than maybe a couple of seconds (because of large repos or MRs), the frontmost load balancer sees that the request has taken too long and the request times out.

In the worst possible scenario, the request being queued is a health check probe. If the probe times out for long enough, the instance is killed even if the instance is just too busy.

What is the expected correct behavior?

Requests should be queued at the front most load balancer (ELB/HAProxy/whatever), so that it can make more intelligent decisions on which instance should be handling a request, avoiding instances which are too overloaded.

Relevant logs

None.

Details of package version

We're using gitlab-ee 11.6.3.

Provide the package version installation details

||/ Name                              Version               Architecture          Description
+++-=================================-=====================-=====================-=======================================================================
un  gitlab-ce                                                         (no description available)
ii  gitlab-ee                         11.6.3-ee.0           amd64                 GitLab Enterprise Edition (including NGINX, Postgres, Redis)

Environment details

Operating System: Ubuntu 16.04
Installation Target, remove incorrect values:
- VM: AWS
- In theory this would also affect other kinds of installs.
Installation Type, remove incorrect values:
- Should be regardless of installation type.
Is there any other software running on the machine: Nothing exceptional.
Is this a single or multiple node installation? Multiple nodes.
Resources
- CPU: In our case, c5.xlarge, with multiple nodes (4vCPUs)
- Memory total: 8GB ram per node

Configuration details

nginx configurations:

nginx['listen_port'] = 80
nginx['listen_https'] = false
nginx['proxy_set_headers'] =  {
				'Host' => '$http_host_with_default',
				'X-Forwarded-For' => '$proxy_add_x_forwarded_for',
				'X-Forwarded-Proto' => '$http_x_forwarded_proto',
				'X-Forwarded-Ssl' => 'on',
				'Upgrade' => '$http_upgrade',
				'Connection' => '$connection_upgrade'
			}
nginx['real_ip_trusted_addresses'] = ['ip-range', 'ip-range']  # our VPC subnet ranges
nginx['real_ip_recursive'] = 'on'
nginx['custom_gitlab_server_config'] = <<-NGINX
    if ($http_x_forwarded_proto = 'http') {
        return 301 https://$host$request_uri;
    }
  NGINX

Edited Jan 14, 2019 by Joel Low