Cell deloyment topology

We want to use Dedicated to deploy and manage future Cells. Dedicated is only available on AWS today, while the rest of GitLab.com is on GCP. (Rough business case was calculated here: https://gitlab.com/gitlab-com/gl-infra/gitlab-dedicated/team/-/issues/172)

We should think about what cloud provider we would want/need additional Cells to be on, and whether that means we need to think about extending Dedicated to GCP. That will be a significant undertaking, so it would be good to know if we need to start this project sooner than later.

Cell topology outline

Provider - GCP Decided
- Database shared context - we will need close to zero latency on accessing those information.
- Single site GitLab.com - it means that there will be a single entry point (load balancer), and our service (routing layer) will route traffic to correct cell. Doing different cloud providers will incur double ingress cost - first from AWS to GCP, then from GCP to client (via CloudFlare).
- Regions - it would still likely need to stay within single cloud provider as rates across zones are lower than between different providers, and in such case we would provide Load Balancer close to region (AnyCast?)
Resilience
- Backup / Restore
  - Git: Likely snapshots - continuing the work we are doing on .com today.
  - DB: Cloud SQL snapshots
  - Object: Versioning, if needed.
- Regional failover
  - Paid: Geo replication? Or potentially new Cell-based service which can selectively replicate organizations between cells.
  - Free: Restore from backup as needed.
Networking
- VPC & Project per cell? We are splitting across multiple projects anyway on GCP as there are project-level rate limits we are hitting.
- Dedicated Private Networking design - https://gitlab.com/groups/gitlab-com/gl-infra/gitlab-dedicated/-/epics/46
- May need peering connection to VPC with shared database context, unless we can access that via public API endpoint.
Deployment
- 100k or 250k reference architecture? Large enough for bin-packing efficiencies, but still reduce blast radius and fit within GCP project rate limits. Can define this further as DR WG proceeds with # of git GB per project.
- DB: Cloud SQL?
- Redis: Memorystore?
- Elastic: Elastic SaaS?
- Zoekt: Gitlab-managed
- Clickhouse: Gitlab-managed

Edited Mar 27, 2023 by Joshua Lambert