Static Egress IP Addresses for Cloud Run Based Runway Services

Problem Statement

Currently, Runway services deployed on Cloud Run (AI Gateway, DWS, glgo, PVS, etc.) send requests from a dynamic IP address pool. This creates challenges for downstream services that need to:

  1. Implement rate limiting while allowing legitimate Runway service traffic to pass through unrestricted
  2. Protect against DDoS attacks by allowlisting trusted traffic sources
  3. Secure billing-critical endpoints by restricting access to known, trusted sources only

Without static egress IPs, downstream services cannot reliably distinguish between legitimate Runway service requests and potentially malicious traffic.

Business Impact

  • Usage Billing Protection: Data Insights Platform (DIP) ingress needs to lockdown access to the correct sources of billing events from AI Gateway
  • Security Posture: WAF/Cloudflare rate limiting rules cannot effectively protect endpoints while allowing Runway services through
  • Operational Overhead: Current workarounds (high rate limit thresholds, request characteristic fingerprinting) are less reliable and harder to maintain

Proposed Solution

Provide static egress IP addresses for Cloud Run based Runway services:

  • One static IP for staging environment services
  • One static IP for production environment services
  • These IPs would be the source address for all outbound requests from: AI Gateway, DWS, glgo, PVS, and other Cloud Run based services

The IP addresses do not need to be from GitLab-owned blocks - GCP-provided static IPs are acceptable.

Implementation Approach

Based on Google Cloud documentation, this can be implemented using Cloud Router and Cloud NAT:

  1. Reserve static external IP addresses: Create google_compute_address resources for staging and production

  2. Use existing Terraform module: The terraform-google-modules/cloud-router/google module is already in use for GKE (see modules/runtimes/gcp_gke/network.tf)

  3. Configure Cloud NAT with manual IP allocation:

    nats = [
      {
        nat_ip_allocate_option = "MANUAL_ONLY"
        nat_ips                = [google_compute_address.static.self_link]
      }
    ]
  4. Configure Cloud Run services: Set up the VPC connector to route egress traffic through the NAT gateway. This likely needs to happen in the Reconciler, i.e. in the runwayctl repository.

Timeline

Target: Mid-November 2025 to support Usage Billing launch and DIP security requirements.

Open Questions

Does this approach remain relevant given the discussion in gitlab-org/architecture/usage-billing/design-doc#11 about using NATS to decouple Snowplow and AI Gateway? If NATS is used for usage metrics ingestion, would the Snowplow endpoint still need to be publicly reachable?

Success Criteria

  • Downstream services (DIP, Snowplow) can configure WAF/Cloudflare rules to allowlist specific static IPs
  • Rate limiting can be applied to all non-allowlisted traffic without impacting Runway services
  • Solution is in place before Usage Billing goes live in mid-November
  • Implementation follows existing Runway infrastructure patterns and uses established Terraform modules

Related Issues

  • gitlab-org/gitlab#571768+
  • gitlab-org/architecture/usage-billing/design-doc#11+

References

Edited by Florian Forster