Steve Xuereb - Out of Office Back 2025-04-21
--- a/doc/architecture/blueprints/cells/global-service.md

+ 94

− 3
+++ b/doc/architecture/blueprints/cells/global-service.md

+ 94

− 3
 @@ -5,6 +5,8 @@ description: 'Cells: Global Service'
 @@ -5,6 +5,8 @@ description: 'Cells: Global Service'
 status: accepted
 ---
+<!-- vale gitlab.FutureTense = NO -->
 # Cells: Global Service
 This document describes design goals and architecture of Global Service
 @@ -19,7 +21,7 @@ Global Service, that can be deployed in many regions.
 @@ -19,7 +21,7 @@ Global Service, that can be deployed in many regions.
 1. **Technology.**
-    The Global Service will be written in [Golang](https://go.dev/)
+    The Global Service will be written in [Go](https://go.dev/)
    and expose API over [gRPC](https://grpc.io/).
 1. **Cells aware.**
 @@ -173,12 +175,101 @@ The original [Cells 1.0](iterations/cells-1.0.md) described [Primary Cell API](i
 @@ -173,12 +175,101 @@ The original [Cells 1.0](iterations/cells-1.0.md) described [Primary Cell API](i
   by various services (HTTP Routing Service, SSH Routing Service, each Cell).
 1. As part of Cells 1.0 PoC we discovered that we need to provide robust classification API
   to support more workflows than anticipated. We need to classify various resources
-   (username for login, projects for ssh routing, etc.) to route to correct Cell.
+   (username for login, projects for SSH routing, etc.) to route to correct Cell.
   This would put a lot of dependency on resilience of the First Cell.
 1. It is our desire long-term to have Global Service for passing information across Cells.
   This does a first step towards long-term direction, allowing us to much easier perform
   additional functions.
+## Spanner
+[Spanner](https://cloud.google.com/spanner) will be a new data store introduced into the GitLab Stack, the reasons we are going with Spanner are:
+1. It supports Multi-Regional read-write access with a lot less operations when compared to PostgreSQL helping with out [regional DR](../disaster_recovery/index.md)
+1. The data is read heavy not write heavy.
+1. Spanner provides [99.999%](https://cloud.google.com/spanner/sla) SLA when using Multi-Regional deployments.
+1. Provides consistency whilst still being globally distributed.
+1. Shards/[Splits](https://cloud.google.com/spanner/docs/schema-and-data-model#database-splits) are handled for us.
+The cons of using Spanners are:
+1. Vendor lock-in, our data will be hosted in a proprietary data.
+    - How to prevent this: Global Service will use generic SQL.
+1. Not self-managed friendly, when we want to have Global Service available for self-managed customers.
+    - How to prevent this: Spanner supports PostgreSQL dialect.
+1. Brand new data store we need to learn to operate/develop with.
+### GoogleSQL vs PostgreSQL dialects
+Spanner supports two dialects one called [GoogleSQL](https://cloud.google.com/spanner/docs/reference/standard-sql/overview) and [PostgreSQL](https://cloud.google.com/spanner/docs/reference/postgresql/overview).
+The dialect [doesn't change the performance characteristics of Spanner](https://cloud.google.com/spanner/docs/postgresql-interface#choose), it's mostly how the Database schemas and queries are written.
+Choosing a dialect is a one-way door decision, to change the dialect we'll have to go through a data migration process.
+We will use the `PostgreSQL` dialect, and use the official spanner clients to connect to the database so we don't manage a [proxy layer](https://cloud.google.com/spanner/docs/pgadapter), because:
+1. GitLab is mostly familiar with PostgreSQL development.
+1. Global Service doesn't use any of the [known issues](https://cloud.google.com/spanner/docs/reference/postgresql/known-issues-postgresql-interface) of the dialect.
+1. Global Service won't use any of the [unsupported data types](https://cloud.google.com/spanner/docs/reference/postgresql/data-types#unsupported-data-types).
+1. Global Service queries as [supported](https://cloud.google.com/spanner/docs/reference/postgresql/overview).
+1. The difference will be smaller if we provide Global Service to self-managed customers.
+1. Same type of data in Information Schema, [GoogleSQL](https://cloud.google.com/spanner/docs/information-schema), [PostgreSQL](https://cloud.google.com/spanner/docs/information-schema-pg).
+This also means that:
+1. Schemas will still be writing specific for Spanner, example Primary Key schema will be as follows:
+    ```sql
+    CREATE TABLE Fans (
+      FanId varchar(36) DEFAULT spanner.generate_uuid(),
+      Name text,
+      PRIMARY KEY (FanId)
+    );
+    ```
+    [source](https://cloud.google.com/spanner/docs/primary-key-default-value#universally_unique_identifier_uuid)
+    This means the schema will not be fully compatible with PostgreSQL.
+1. Connect to Google Spanner using the official [Spanner Client Libraries](https://cloud.google.com/spanner/docs/reference/libraries#client-libraries-install-go) for maximum compatibility.
+1. We will not set up the [`PGAdapter`](https://cloud.google.com/spanner/docs/pgadapter) to not add another service the architecture.
+1. Use [emulator](https://cloud.google.com/spanner/docs/emulator) for local development. Using a local PostgreSQL instance will not work for schema definition.
+Citations:
+1. Google (n.d.). _PostgreSQL interface for Spanner._ Google Cloud. Retrieved April 1, 2024, from <https://cloud.google.com/spanner/docs/postgresql-interface>
+1. Google (n.d.). _Dialect parity between GoogleSQL and PostgreSQL._ Google Cloud. Retrieved April 1, 2024, from <https://cloud.google.com/spanner/docs/reference/dialect-differences>
+### Multi-Regional
+Running Multi-Regional read-write is one of the biggest selling points of Spanner.
+When provisioning an instance you can choose single Region or Multi-region.
+After provisioning you can [move an instance](https://cloud.google.com/spanner/docs/move-instance) whilst is running but this is a manual process that requires assistance from GCP.
+We will provision a Multi-Regional Cloud Spanner instance because:
+1. Won't require migration to Multi-Regional in the future.
+1. Have Multi Regional on day 0 which cuts the scope of multi region deployments at GitLab.
+This will however increase the cost considerably, using public facing numbers from GCP:
+1. [Regional](https://cloud.google.com/products/calculator?hl=en&dl=CiRlMjU0ZDQyMy05MmE5LTRhNjktYjUzYi1hZWE2MjQ4N2JkNDcQIhokOTlGQUM4RjUtNjdBRi00QTY1LTk5NDctNThCODRGM0ZFMERC): $1,716
+1. [Multi Regional](https://cloud.google.com/products/calculator?hl=en&dl=CiQzNjc2ODc5My05Y2JjLTQ4NDQtYjRhNi1iYzIzODMxYjRkYzYQIhokOTlGQUM4RjUtNjdBRi00QTY1LTk5NDctNThCODRGM0ZFMERC): $9,085
+Citations:
+1. Google (n.d.). _Regional and multi-region configurations._ Google Cloud. Retrieved April 1, 2024, from <https://cloud.google.com/spanner/docs/instance-configurations>
+1. Google (n.d.). FeedbackReplication. Google Cloud. Retrieved April 1, 2024, from <https://cloud.google.com/spanner/docs/replication>
+### Performance
+We haven't run any benchmarks ourselves because we don't have a full schema designed.
+However looking at the [performance documentation](https://cloud.google.com/spanner/docs/performance), both the read and write throughputs of a Spanner instance scale linearly as you add more compute capacity.
+### Alternatives
+1. PostgreSQL: Having a multi-regional deployment requires a lot of operations.
+1. ClickHouse: It's an `OLAP` database not an `OLTP`.
+1. Elasticsearch: Search and analytics document store.
 ## FAQ
 1. Does Global Service implement all services for Cells 1.0?
 @@ -192,7 +283,7 @@ The original [Cells 1.0](iterations/cells-1.0.md) described [Primary Cell API](i
 @@ -192,7 +283,7 @@ The original [Cells 1.0](iterations/cells-1.0.md) described [Primary Cell API](i
 1. How we will push all existing claims from "First Cell" into Global Service?
    We would add `rake gitlab:cells:claims:create` task. Then we would configure First Cell
-    to use Global Service, and execute the rake task. That way First Cell would claim all new
+    to use Global Service, and execute the Rake task. That way First Cell would claim all new
    records via Global Service, and concurrently we would copy data over.
 1. How and where the Global Service will be deployed?