Initial investigation into running NAT gateways in GCE
I spent a little time looking into what it would take to run our own NAT gateways in GCE instead of using Cloud NAT, depending on what the result of https://gitlab.com/gitlab-com/gl-infra/infrastructure/issues/9280 is.
Ansible for the NAT boxes: https://gitlab.com/craigfurman/infrastructure-playground/-/blob/master/ansible/gce_hat_nat.yml
Note that the repo I've linked is my infra-as-code sandbox repo and contains some other unrelated stuff.
This was based on https://cloud.google.com/vpc/docs/special-configurations#multiple-natgateways.
What I learned / what we still need to check
If I did not restrict the NAT gateway internet route to a list of tags, I was unable to ssh into the NAT boxes themselves. I don't know if this would have also prevented certain traffic to/from other machines with public IPs in that network, as there weren't any. I also didn't check if this could be worked around using a bastion host. If we indeed need to tag most machines, this will require recreation of machines that are created from instance templates.
Simply running curl https://api.ipify.org
(a service that responds with your IP) demonstrated the ECMP load balancing at work, as I received different IPs (each of the 2 gateways) on successive requests. This occurred even though the test machine was in the same zone as 1 of the 2 gateways. In production we'd run >1 gateway per zone.
I didn't bother setting up managed instance groups + health checks, but this is something we should do in production in order to replace faulty gateways. I didn't check what happened to the ECMP selection of routes that pointed to faulty instances.
This setup might have different saturation characteristics to Cloud NAT. Cloud NAT reserves a certain number of TCP+UDP ports per machine using it (docs, which interestingly appear to have added a lot more info since I last read them). I am not knowledgable about this, but my current mental model of using iptables masquerade and conntrack (which would be a standard linux NAT setup in GCE), is that there is no such reservation. A NAT'ed connection takes up an entry in the conntrack table. I don't know if Linux can reuse epehemeral ports for multiple NAT connections to different addresses.
One saturation point we should monitor is the conntrack table, which node exporter exposes (node_nf_conntrack_entries / node_nf_conntrack_entries_limit
). I don't know about the viability of monitoring for ephemeral port exhaustion.
We can also monitor bandwidth through the gateways: GCE has a maximum of 2Gb/s per vCPU.
Cloud NAT has a non-configurable 2 minute delay on reuse of the same {source host+port, dest host+port, protocol} tuple. I don't know if conntrack has similar properties, configurable or not, or if such a delay/keepalive is desirable.
I only spent about an hour on this, as I didn't want to invest much time into something that probably won't be used, but wanted to jot down what I found for future reference.
Mainly of interest to @hphilipps and @ansdval.