Cell: Routing PoC: Cloudflare Worker
The routing layer is meant to offer a consistent user experience where all Cells are presented under a single domain (for example, gitlab.com), instead of navigating to separate domains.
We need to decide what technology the routing service is written in. The choice depends on the best-performing language and the routing layer's expected way and place of deployment. If it is required to make the service multi-cloud, it might be necessary to deploy it to the CDN provider. Then, the service needs to be written using a technology compatible with the CDN provider.
We also have a PoC from @andrewn https://gitlab.com/andrewn/stateless-router, see #408507 (comment 1379811355) from more context.
As an outcome of this issue, we expect:
- Evaluate Cloudflare worker as the technology to implement the routing layer
- Document decisions in the Cells blueprint
- Create a plan with concrete issues to continue the development of the routing layer: discovery, single domain interaction, endpoints classification, GraphQL, etc.
Related to #408507.
Reference:
- Cells blueprint Routing layer
- Proposal: Stateless Router with buffering requests
- Proposal: Stateless Router with learning routes
Requirements:
See https://docs.gitlab.com/ee/architecture/blueprints/cells/routing-service.html#requirements
Results
We deployed the router in https://rules-router.sxuereb.workers.dev/ and you can see the demo where we have a single domain routing to 2 different GitLab instances.
Pros
- Low Latency: Around 10-15ms, or max is 50ms
- Global Deployment: We automatically deploy to multiple regions, which helps us move towards regional deployments.
- At the Edge: We are at the edge of the request, having the ability to route to multiple Cells before reaching GCP's network.
- Autoscaling: We don't have to provision scale up/down manually, everything is handled for us. Cloudflare is a trustworthy company we've been using and we know it can handle our load.
- Multiple Protocol support: Multiple Protocol support like websockets.
- Meet requirements Can meet all the requirments, which some work since not everything is out of the box.
- Support for storage: Provides cache, and KV for storage options that are global and scale up /down with workers.
- Security: Since this is a hosted offering it has a strong security model to prevent problems.
- Enforce Zero Trust: All sub requests need to go to public resources but provides easy mTLS authentication
- Community Support: The documentation is informative, and they provide a forum and discord for support (outside of enterprise support)
- Deployment tooling: Can use our existing deployment/provisioning tooling like GitLab CI and Terraform
- Developer experience: Easy to get set up locally and provides a good and fast experience.
Cons
-
Day 2 operations: There is no "out of the box" rolling deployment.
- We also don't have this functionality with HAProxy
- We can implement this with https://betterprogramming.pub/gradual-rollout-of-cloudflare-workers-9cc151ed23a8
- Platform limits: Cloudflare simultaneous connection limit can be a performance hit
- Costs: Can be a bit costly to run.
- Vendor Locking: We would be locked in the Cloudflare eco-system even more, and doesn't provide a solution for our self-hosted customers.
Action items
-
Modify Rails app for X-GitLab-Cell-Redirect
, MR !137360 (closed) -
Write Cloudflare worker for X-GitLab-Cell-Redirect
, demo https://www.youtube.com/watch?v=xyTIF0dnkng, repo https://gitlab.com/tkuah/stateless-router-cloudflare -
Extend Cloudflare worker to have cache -
Measure latency for Cloudflare worker(local) with X-GitLab-Cell-Redirect
.👉 #433471 (comment 1692215842) -
Write Cloudflare worker with cookie routing rules -
Deploy Cloudflare worker live with two real GitLab sites 👉 https://rules-router.sxuereb.workers.dev/poc/gitlab-1 -
Record demo 👉 https://www.youtube.com/watch?v=taTBQEBiny4 -
Calculate latency overhead of a deployed Cloudflare worker 👉 #433471 (comment 1715862545)-
Run tests from the US instead of locally 👉 #433471 (comment 1718150920) -
Run tests hitting the worker vs the website 👉 #433471 (comment 1718373925)
-
-
Investigate if Cloudflare worker limits are suitable for Routing layer https://developers.cloudflare.com/workers/platform/limits/ 👉 #433471 (comment 1707918030)-
Networking. #427726 (comment 1597077130) indicates that resources that are fetched needs to be accessible publicly (aka no private networking).
-
-
Calculate costs of running this router with the current traffic 👉 #433471 (comment 1707800071) -
Observability 👉 #433471 (comment 1717951790)- Logs
- Metrics
- Traces
-
Clean up Resources -
Delete GCP machines: `gcloud --project eng-core-tenant-poc-bbc34148 compute instances delete gitlab-0 gitlab-1 k6 -
Delete worker https://dash.cloudflare.com/1a7f12af9c5a32804050e7cfe7ca313e/workers/services/view/rules-router/production -
Remove Cloudflare as nameserver for steveazz.xyz domain (@sxuereb owns this domain)
-