Cell: Routing PoC: Cloudflare Worker

The routing layer is meant to offer a consistent user experience where all Cells are presented under a single domain (for example, gitlab.com), instead of navigating to separate domains.

We need to decide what technology the routing service is written in. The choice depends on the best-performing language and the routing layer's expected way and place of deployment. If it is required to make the service multi-cloud, it might be necessary to deploy it to the CDN provider. Then, the service needs to be written using a technology compatible with the CDN provider.

We also have a PoC from @andrewn https://gitlab.com/andrewn/stateless-router, see #408507 (comment 1379811355) from more context.

As an outcome of this issue, we expect:

  • Evaluate Cloudflare worker as the technology to implement the routing layer
  • Document decisions in the Cells blueprint
  • Create a plan with concrete issues to continue the development of the routing layer: discovery, single domain interaction, endpoints classification, GraphQL, etc.

Related to #408507 (closed).

Reference:

  • Cells blueprint Routing layer
  • Proposal: Stateless Router with buffering requests
  • Proposal: Stateless Router with learning routes

Requirements:

See https://docs.gitlab.com/ee/architecture/blueprints/cells/routing-service.html#requirements

Results

We deployed the router in https://rules-router.sxuereb.workers.dev/ and you can see the demo where we have a single domain routing to 2 different GitLab instances.

Pros

  • Low Latency: Around 10-15ms, or max is 50ms
  • Global Deployment: We automatically deploy to multiple regions, which helps us move towards regional deployments.
  • At the Edge: We are at the edge of the request, having the ability to route to multiple Cells before reaching GCP's network.
  • Autoscaling: We don't have to provision scale up/down manually, everything is handled for us. Cloudflare is a trustworthy company we've been using and we know it can handle our load.
  • Multiple Protocol support: Multiple Protocol support like websockets.
  • Meet requirements Can meet all the requirments, which some work since not everything is out of the box.
  • Support for storage: Provides cache, and KV for storage options that are global and scale up /down with workers.
  • Security: Since this is a hosted offering it has a strong security model to prevent problems.
  • Enforce Zero Trust: All sub requests need to go to public resources but provides easy mTLS authentication
  • Community Support: The documentation is informative, and they provide a forum and discord for support (outside of enterprise support)
  • Deployment tooling: Can use our existing deployment/provisioning tooling like GitLab CI and Terraform
  • Developer experience: Easy to get set up locally and provides a good and fast experience.

Cons

  • Day 2 operations: There is no "out of the box" rolling deployment.
    • We also don't have this functionality with HAProxy
    • We can implement this with https://betterprogramming.pub/gradual-rollout-of-cloudflare-workers-9cc151ed23a8
  • Platform limits: Cloudflare simultaneous connection limit can be a performance hit
  • Costs: Can be a bit costly to run.
  • Vendor Locking: We would be locked in the Cloudflare eco-system even more, and doesn't provide a solution for our self-hosted customers.

Action items

  • Modify Rails app for X-GitLab-Cell-Redirect, MR !137360 (closed)
  • Write Cloudflare worker for X-GitLab-Cell-Redirect, demo https://www.youtube.com/watch?v=xyTIF0dnkng, repo https://gitlab.com/tkuah/stateless-router-cloudflare
  • Extend Cloudflare worker to have cache
  • Measure latency for Cloudflare worker(local) with X-GitLab-Cell-Redirect. 👉 #433471 (comment 1692215842)
  • Write Cloudflare worker with cookie routing rules
  • Deploy Cloudflare worker live with two real GitLab sites 👉 https://rules-router.sxuereb.workers.dev/poc/gitlab-1
  • Record demo 👉 https://www.youtube.com/watch?v=taTBQEBiny4
  • Calculate latency overhead of a deployed Cloudflare worker 👉 #433471 (comment 1715862545)
    • Run tests from the US instead of locally 👉 #433471 (comment 1718150920)
    • Run tests hitting the worker vs the website 👉 #433471 (comment 1718373925)
  • Investigate if Cloudflare worker limits are suitable for Routing layer https://developers.cloudflare.com/workers/platform/limits/ 👉 #433471 (comment 1707918030)
    • Networking. #427726 (comment 1597077130) indicates that resources that are fetched needs to be accessible publicly (aka no private networking).
  • Calculate costs of running this router with the current traffic 👉 #433471 (comment 1707800071)
  • Observability 👉 #433471 (comment 1717951790)
    • Logs
    • Metrics
    • Traces
  • Clean up Resources
    • Delete GCP machines: `gcloud --project eng-core-tenant-poc-bbc34148 compute instances delete gitlab-0 gitlab-1 k6
    • Delete worker https://dash.cloudflare.com/1a7f12af9c5a32804050e7cfe7ca313e/workers/services/view/rules-router/production
    • Remove Cloudflare as nameserver for steveazz.xyz domain (@sxuereb owns this domain)

Resources

  • https://developers.cloudflare.com/workers/observability/local-development-and-testing/
Edited Jan 15, 2024 by Steve Xuereb
Assignee Loading
Time tracking Loading