Investigate number of concurrent connections made by some of our services to the same external addresses
We put most gprd infrastructure behind a Cloud NAT gateway. Cloud NAT takes a pool of IP addresses, each of which provides 65536 TCP (and also UDP) ports to be used for outbound connections. As a first approximation, NAT ports are reserved for individual VMs based on the min_ports_per_vm setting (it's a little more complicated than that - see background link). A NAT port is required for each concurrent connection from a VM to the same destination address on the same protocol. Concurrent connections to different destinations can reuse the same NAT port. The more concurrent connections a VM must make to a unique destination, the more NAT ports that VM needs, and the more NAT IPs we need overall.
We are currently using 14 of the 16 contiguous static IPs provisioned by google in our project for the NAT gateway. We advertise this range here: https://docs.gitlab.com/ee/user/gitlab_com/#ip-range.
We should investigate the number of concurrent connections to unique destinations from our VMs. A good starting point would be the sidekiq-catchall VM fleet, as we've seen NAT errors that look as if they originate from the mailer queue (production#2309 (closed)). The output of this issue should be recommendations on minimum NAT ports per VM, and/or follow-on issues to reduce this demand (i.e. by use of proxies to bottleneck outbound traffic).
Background: https://cloud.google.com/nat/docs/ports-and-addresses#ports
@AnthonySandoval @dawsmith I think this is quite an urgent corrective action for production#2309 (closed), and we should assign people to it soon. I'm not sure whether it's o11y or core infra, so I labelled this reliability.