Decide what to do about NAT in production

NAT background

We put most gprd infrastructure behind a Cloud NAT gateway. Cloud NAT takes a pool of IP addresses, each of which provides 65536 TCP (and also UDP) ports to be used for outbound connections. As a first approximation, NAT ports are reserved for individual VMs based on the min_ports_per_vm setting (it's a little more complicated than that - see background link). A NAT port is required for each concurrent connection from a VM to the same destination address on the same protocol. Concurrent connections to different destinations can reuse the same NAT port. The more concurrent connections a VM must make to a unique destination, the more NAT ports that VM needs, and the more NAT IPs we need overall.

We are currently using 14 of the 16 contiguous static IPs provisioned by google in our project for the NAT gateway. We advertise this range here: https://docs.gitlab.com/ee/user/gitlab_com/#ip-range. We are using 2048 NAT ports per VM. Until recently, we were using 1024 ports per VM, with 7 IPs.

(65536 / 2048) * 14 = 448: we have capacity for 448 VMs behind the NAT, and currently have about 280.

Problem

A Cloud NAT translation error is analogous to a dropped IP packet. If the baseline rate is sufficiently low, then higher-layer protocols such as TCP should compensate for the unreliability, and so these errors are not 1:1 with user-visible errors. We do see a non-zero baseline error rate most of the time: https://dashboards.gitlab.net/d/nat-main/nat-cloud-nat?orgId=1&refresh=30s&from=now-3h&to=now&var-PROMETHEUS_DS=Global&var-environment=gprd&var-gateway=gitlab-gke

When Cloud NAT was originally rolled out in production (&97) we did not observe user-facing errors that were attributed to it. Recently, we've tightened our queue-specific error SLO alerting, and have attributed an incident to Cloud NAT errors (production#2309 (closed)).

Depending on the output of https://gitlab.com/gitlab-com/gl-infra/infrastructure/-/issues/10627, we might have to scale up the NAT IP count - but we can't, beceause we are using almost all of the contiguous range provisioned by Google in our project. The output of that issue may also be ways to bottleneck certain outbound connections through the use of proxies - in which case this issue becomes less relevant.

Options

Some of these options are mutually exclusive, some are not.

Reduce the demand for NAT ports, e.g. by using proxies as a bottleneck or tuning keepalive settings: follow-ons to https://gitlab.com/gitlab-com/gl-infra/infrastructure/-/issues/10627
Ask google to move our unused, and rather large, contiguous range in CI (https://console.cloud.google.com/networking/addresses/list?project=gitlab-ci-155816) to the production project (or a part of that range). We are not planning to roll out Cloud NAT in CI any time soon. If we want to go down this road, we should start asking now - it might take a while to fulfil. We'd need to update advertised range in the docs, and I'm not sure how much notice we have to give to allow customers to tweak their firewalls.
Roll back Cloud NAT in production, use public IPs. This would undo &97 abd would potentially upset customers.
Investigate the use of a self-made GCE NAT instead of Cloud NAT (https://gitlab.com/gitlab-com/gl-infra/infrastructure/-/issues/9336). This may or may not help us.

Background: https://cloud.google.com/nat/docs/ports-and-addresses#ports

For context, @hphilipps and I worked on the initial deployment of Cloud NAT: &97

@dawsmith is this ~"team::Core-Infra"?

cc @AnthonySandoval @hphilipps