Skip to content

Cross AZ network egress/ingress when migrating to Kubernetes for the git fleet

After creating diagrams of the network topology before and after the Git HTTPS transition to Kubernetes, it is clear that we are going to be billed more for cross-AZ network traffic. This could potentially be a significant cost increase, since the Git HTTPS service represents a large amount of ingress/egress for Git operations.

Current VM infrastructure

After the GKE migration

  • There are 3 points of cross AZ network traffic, compared to 1 point when running in VMs
  • We also are introducing a new network connection between nginx and webservice (workhorse), where on VMs we use a unix socket. This connection is not encrypted, tracked with https://gitlab.com/gitlab-com/gl-infra/delivery/-/issues/1151

Mitigations

Option 1: Regional cluster with NGINX sidecar

Reduce the cross AZ network traffic is to move the NGINX ingress to the webservice pod gitlab-org/charts/gitlab#2264 (closed)

Option 2: Split the regional cluster into 3 zonal clusters

If we switch from a regional cluster to multiple zonal clusters we can have much more control over cross-AZ network traffic

Downsides:

  • We would need to migrate to this this new configuration and do a little bit of refactoring in our Terraform config
  • We currently set database-throttled which handles database migrations to maxReplica=1, we would need to figure out how to keep a single pod across three different clusters. We currently don't have hiearachial configuration using helmfile that spans multiple environments, we are also thinking that we would probably use a single environment in helmfile for three zones
  • We could possibly have a situation where some zones are upgraded and others are not

Benefits:

  • Lower cost due to cross-AZ network traffic
  • Safer cluster upgrades
  • This makes zonal outages much clearer for us, and we could eventually extend this to multiple regions
  • Higher cost due to going from 1 cluster to 3. Pricing is $0.10 per cluster per hour so I don't think this is significant.
Edited by John Jarvis