Skip to content

Terminate TLS at the edge for subdomain connections

Summary

We are seeing cases where internal traffic is hitting TLS rate limits on pages. Instead of introducing bypass options, we should consider eliminating the current TLS connection bottleneck.

Problem

This came up in the context of #681 (closed) and gitlab#381312 (closed).

We are seeing cases where internal traffic is hitting TLS rate limits on pages. We introduced the TLS rate limit because TLS handshakes can consume a lot of CPU time on pages pods and this allows us to throttle connection establishment.

The reason we need to handle TLS termination on pods is in order to support custom domains. We perform TCP level proxying all the way to the pods, and the pods then terminate the TLS connection.

This leaves us at the mercy of clients who might establish many short-lived connections, which increases the work we need to do on the TLS stack. That is the reason why we rate-limit incoming connections.

This limit is difficult to calibrate however, since many clients may be sharing a single NATed IP.

Proposal

We may be able to optimize the network path and offload TLS termination away from the pods in a large set of cases.

In the custom domain case this is not possible (at least not without custom domains at cloudflare or GCP level), because the application needs to fetch the TLS cert. But for the wildcard subdomain case (*.gitlab.io) we can provision a wildcard cert at the load balancer and offload TLS termination this way.

The load balancer (possibly at the edge) will handle incoming TLS connections at the frontend but maintain a pool of long-lived TLS connections to pages at the backend.

This alleviates the bottleneck of TLS handshakes and may allow us to eliminate the rate limit entirely in the wildcard subdomain case.

Custom domain flow (same as current setup):

client -> GCP TCP lb -> (haproxy TCP and PROXY protocol) -> pages pods (TLS termination)

Subdomain flow (new):

client -> cloudflare edge HTTPS lb (TLS termination) -> GCP HTTPS lb -> pages pods

We'd need to deploy wildcard certs to the edge (cloudflare) as well as the GCP LB.

Capabilities this unlocks:

  • TLS termination at the edge (improves latency)
  • Traffic introspection (we decrypt earlier)
  • Rate-limiting at edge / LB level
  • Caching

Do note that we do not get these capabilities for custom domains, so we do need to support that flow still, and it needs to continue to have the rate limits in place.