Initial investigation into running NAT gateways in GCE

I spent a little time looking into what it would take to run our own NAT gateways in GCE instead of using Cloud NAT, depending on what the result of https://gitlab.com/gitlab-com/gl-infra/infrastructure/issues/9280 is.

Terraform: https://gitlab.com/craigfurman/infrastructure-playground/-/blob/master/terraform/environments/gce-ha-nat/ha-nat.tf

Ansible for the NAT boxes: https://gitlab.com/craigfurman/infrastructure-playground/-/blob/master/ansible/gce_hat_nat.yml

Note that the repo I've linked is my infra-as-code sandbox repo and contains some other unrelated stuff.

This was based on https://cloud.google.com/vpc/docs/special-configurations#multiple-natgateways.

What I learned / what we still need to check

If I did not restrict the NAT gateway internet route to a list of tags, I was unable to ssh into the NAT boxes themselves. I don't know if this would have also prevented certain traffic to/from other machines with public IPs in that network, as there weren't any. I also didn't check if this could be worked around using a bastion host. If we indeed need to tag most machines, this will require recreation of machines that are created from instance templates.

Simply running curl https://api.ipify.org (a service that responds with your IP) demonstrated the ECMP load balancing at work, as I received different IPs (each of the 2 gateways) on successive requests. This occurred even though the test machine was in the same zone as 1 of the 2 gateways. In production we'd run >1 gateway per zone.

I didn't bother setting up managed instance groups + health checks, but this is something we should do in production in order to replace faulty gateways. I didn't check what happened to the ECMP selection of routes that pointed to faulty instances.

This setup might have different saturation characteristics to Cloud NAT. Cloud NAT reserves a certain number of TCP+UDP ports per machine using it (docs, which interestingly appear to have added a lot more info since I last read them). I am not knowledgable about this, but my current mental model of using iptables masquerade and conntrack (which would be a standard linux NAT setup in GCE), is that there is no such reservation. A NAT'ed connection takes up an entry in the conntrack table. I don't know if Linux can reuse epehemeral ports for multiple NAT connections to different addresses.

One saturation point we should monitor is the conntrack table, which node exporter exposes (node_nf_conntrack_entries / node_nf_conntrack_entries_limit). I don't know about the viability of monitoring for ephemeral port exhaustion.

We can also monitor bandwidth through the gateways: GCE has a maximum of 2Gb/s per vCPU.

Cloud NAT has a non-configurable 2 minute delay on reuse of the same {source host+port, dest host+port, protocol} tuple. I don't know if conntrack has similar properties, configurable or not, or if such a delay/keepalive is desirable.

I only spent about an hour on this, as I didn't want to invest much time into something that probably won't be used, but wanted to jot down what I found for future reference.

Mainly of interest to @hphilipps and @ansdval.

Edited Feb 27, 2020 by Craig Furman