Consider the current state of nginx in Gitlab and define a roadmap moving forward

Problem

The Gitlab.com environment is unique to other Gitlab installations in that we have a haproxy layer in front our Kubernetes deployments, which serves similar functionality to the Kubernetes ingress typically deployed in a Kubernetes installation of Gitlab. During the migration of Gitlab.com from VMs to Kubernetes, we have been able to avoid using Kubernetes ingress objects as they aren't needed in our setup, and add an extra un-needed layer that has caused us issues.

With the migration of the API service, and the upcoming migration of the web service, we identified in omnibus installations that these services have nginx in front of them, and use nginx specific configuration settings which it was determined might be needed for the health of Gitlab.com

In order to do the quickest iteration of moving what was currently deployed in the api VMs to Kubernetes, we leveraged ingres-nginx to place an nginx installation back in front of our webservice pods, then leveraged some of its specific configuration options to try and mimic the nginx setup we had on VMs.

While this solution works, it's technically abusing the Kubernetes ingress specification (which is designed to simply route traffic, and be implementation agnostic), by tying us to a specific implementation of ingress (ingress-nginx), and it leverages an implementation detail of ingress-nginx (where we create multiple ingress objects for the same host, and they are "merged" together in a way that conviently works for us) that might change in the future, breaking what currently works for us.

In addition to this, it raises the question of the current status of nginx in both Omnibus and Kubernetes installations. We see that nginx might provide some value in the stack, however there is no guarantee that customers deploying Gitlab in Kubernetes will use ingress-nginx (they might use a different implementation, such as one from their cloud provider), and even if they do use ingress-nginx, they aren't using the same configurations we currently feel are needed for Gitlab.com

Next steps

For Gitlab.com, we continue to target not using Kubernetes ingress objects as much as possible, instead having our haproxy VM layer serve as our "ingress". In the future, we will likely take our haproxy as a service to be migrated to Kubernetes. At that point, we can leverage Kubernetes ingress objects, and decide on an ingress implementation that makes the most sense. Our current setup is not a blocker for the migration of web to Kubernetes (though this might turn out to be wrong later), but the current setup will likely be changed/improved when we migrate our haproxy layer to Kubnernetes.
Get a full review of the current configuration we have for nginx in our omnibus installations, especially the ones for specific URL paths, and determine if they are still needed for all installations, or just for Gitlab.com. Also we should determine if they are needed only if you use nginx.
If there are configurations that are identified as needed on all installations of Gitlab, then we should determine the optimal path for making sure these configurations exist in Kubernetes installations of Gitlab.com, considering that there is no guarentee a customer might be using ingress-nginx in their environment (thus there is no guarentee of nginx being in their environment).

Possible solutions if nginx configurations are needed

Option 1: Removing ingress-nginx and moving the configurations into haproxy (this solves the problem for Gitlab.com only)

While we are trying not to push more complexity or settings into haproxy (as we know we want to migrate it to Kubernetes and potentially change it), the settings themselves are related to ingress, how traffic moves from client to different services, and we wish to control these settings on different URLs. This means that our ingress layer (which we conceptually agree to being our haproxy vms) is the appropriate place to implement this, and when we choose to migrate haproxy later, we can re-evaluate these configurations then. This could be a short term solution.

Option 2: Removing ingress-nginx and move the configurations downstream into workhorse

If we determine that the configurations are needed on all installations, regardless of VM or Kubernetes, and regardless of ingress controller, than we should put them in a place in our stack we have full control over, and is not dependant on specific ingress implementation (ingress-nginx). As workhorse is the next component behind the ingress layer, making workhose perform what nginx did previously makes the most sense.

Option 3: Removing ingress-nginx and make our webservice pods have a custom nginx container/configuration

This mimics how we currently have omnibus installations, where we place nginx in front of every workhorse install, with configuration settings specifically needed for us. This is similar to option two (and has the benefit of being similar to omnibus, while not requiring coding work on workhorse), however, increases resource utilisation, especially for small environments, and can be confusing to users who do use ingress-nginx as an ingress controller (as there will be two nginx instances inside their installation)

Old Text:

As discovered in #1731 (closed) our nginx configurations do not currently allow us to specify and route traffic based on fleets of services as we do with haproxy. This led to a configuration error where healthchecks for our API service were going to the GIT service instead. Prior to migrating the web fleet, we need to solve this problem as what will soon change, is that all healthchecks will go to the API service.

This is a blocker for migrating our web fleets into Kuberentes as no traffic would end up going to any web deployment with the way nginx is currently configured.

From @ggillies in gitlab-com/gl-infra/k8s-workloads/gitlab-com!856 (comment 573167148):

So I have reviewed this and from the perspective of being the quickest solution to get a fix in place, it makes sense.

However I am not sure how I feel with the overall approach, and we might need to take a step back and think about our overall strategy with ingress, nginx, different fleets, etc.

The problem is we are slightly twisting what the ingress specification beyond what it was designed for, which was simple routing of traffic, to services, and focusing too much on one specific implementation of the ingress specification (ingress-nginx).

The reality is we shouldn't be relying on a specific implementation of the ingress specification at all. A majority of customers should be able to use their built in ingress controller from their cloud/platform provider (thus requiring nothing extra) to implement certs, domain, path routing etc (so the ingress objects in our chart are simple, with no implementation specific annotations needed), and if we require a specific piece of software in front of our pods to do specific configurations (e.g. proxy buffering), then we add our own deployment of nginx between the ingress implementation and the pods, or have it as a small container part of the webservice pod spec.

Slightly related, our current implementation is also lacking in that we do all these separation out of different webservice pods (git-https, api, web), but then bind them all behind a single nginx deployment (well we don't for gitlab.com, but do for the upstream chart, though I guess the upstream chart has no separate fleets out of the box). This still means we have essentially a single bottleneck/point of failure everything goes through (if you do use ingress-nginx for multiple fleets). This is why when I deployed kas, I used the GKE inbuilt ingress, because kas connections are long-lived (potentially forever), so I didn't want to tie up ingress-nginx resources (shared by other services) with kas connections. This means kas is entirely isolated from impacting other workloads, and because I pushed to have any http/s settings moved into kas itself, it means we can use any ingress implementation we want (which is the whole point of the ingress specification!).

A few options we have for correcting this long term are

Make it so that when you deploy multiple different instances of webserver pods, you also deploy an ingress-nginx controller for those pods and those pods only. This wouldn't really work for environments without our haproxy layer, so it might be something we have to do ourselves? This would allow us independent nginx pods for each of our services, and allow us to do haproxy healthchecks correctly.
Move away from ingress-nginx altogether. Make nginx an actual deployment object, with whatever configurations we need, and place a generic ingress object in front of it, and the webservice pods behind it at as needed. This is how the ingress specification is meant to work
Similar to above, move nginx to a container as part of each webservice pod. This does mean you might be running more nginx containers than necessary (1 per pod), but that's what we did back when we had vms, and simplifies the stack somewhat. This would also greatly increase the ability to debug (traffic from nginx, to workhorse, to puma/rails would all be single pod on the same host). Then the ingress object becomes simple and implementation independent.
My preferred idea, moving all the specific configuration we actually need from nginx (proxy buffering etc) into workhorse, where we can control it all in our own codebase, remove ingress-nginx altogether, and keep a very simple ingress in front of everything. This ingress can then just leverage the cloud provider ingress, or optionally people can install their own (and we could still provide ingress-nginx for that case, but no annotations would be needed at all). For Gitlab.com we could just remove the ingress while we had our haproxy layer in front (so healthchecks go to the right services), and if we choose to remove haproxy later, bring the ingress back.

All of this is definitely out of scope for now, but I suspect we may need to talk about/have a strategy for this once the web migration happens, as knowing our luck it's going to have its own set of specific requirements that will make the gitlab-extras release even more convoluted.

Edited Jun 08, 2021 by Graeme Gillies