Document Google Spanner for Global Service

What --- Add a new section in the Global Service blueprint about Google Spanner Why --- We only looked at Spanner briefly, so we needed to do more research on the platform. Reference: #454056 Signed-off-by: Steve Xuereb <sxuereb@gitlab.com>

Document Google Spanner for Global Service
860507ee · Steve Xuereb · 3d7374fe · 860507ee
Verified Commit 860507ee authored 10 months ago by Steve Xuereb
--- a/doc/architecture/blueprints/cells/global-service.md
+++ b/doc/architecture/blueprints/cells/global-service.md
@@ -5,6 +5,8 @@ description: 'Cells: Global Service'
 status: accepted
 ---

+<!-- vale gitlab.FutureTense = NO -->
+
 # Cells: Global Service

 This document describes design goals and architecture of Global Service
@@ -19,7 +21,7 @@ Global Service, that can be deployed in many regions.

 1. **Technology.**

-    The Global Service will be written in [Golang](https://go.dev/)
+    The Global Service will be written in [Go](https://go.dev/)
    and expose API over [gRPC](https://grpc.io/).

 1. **Cells aware.**
@@ -173,12 +175,80 @@ The original [Cells 1.0](iterations/cells-1.0.md) described [Primary Cell API](i
   by various services (HTTP Routing Service, SSH Routing Service, each Cell).
 1. As part of Cells 1.0 PoC we discovered that we need to provide robust classification API
   to support more workflows than anticipated. We need to classify various resources
-   (username for login, projects for ssh routing, etc.) to route to correct Cell.
+   (username for login, projects for SSH routing, etc.) to route to correct Cell.
   This would put a lot of dependency on resilience of the First Cell.
 1. It is our desire long-term to have Global Service for passing information across Cells.
   This does a first step towards long-term direction, allowing us to much easier perform
   additional functions.

+## Spanner
+
+[Spanner](https://cloud.google.com/spanner) will be a new data store introduced into the GitLab Stack, the reasons we are going with Spanner are:
+
+1. It supports Multi-Regional read-write access with a lot less operations when compared to PostgreSQL helping with out [regional DR](../disaster_recovery/index.md)
+1. The data is read heavy not write heavy.
+1. Spanner provides [99.999%](https://cloud.google.com/spanner/sla) SLA when using Multi-Regional deployments.
+1. Provides consistency whilst still being globally distributed.
+1. Shards/[Splits](https://cloud.google.com/spanner/docs/schema-and-data-model#database-splits) are handled for us.
+
+The cons of using Spanners are:
+
+1. Vendor lock-in, our data will be hosted in a proprietary data.
+    - How to prevent this: Global Service will use generic SQL.
+1. Not self-managed friendly, when we want to have Global Service available for self-managed customers.
+    - How to prevent this: Spanner supports PostgreSQL dialect.
+1. Brand new data store we need to learn to operate/develop with.
+
+### GoogleSQL vs PostgreSQL dialects
+
+Spanner supports two dialects one called [GoogleSQL](https://cloud.google.com/spanner/docs/reference/standard-sql/overview) and [PostgreSQL](https://cloud.google.com/spanner/docs/reference/postgresql/overview).
+The dialect [doesn't change the performance characteristics of Spanner](https://cloud.google.com/spanner/docs/postgresql-interface#choose), it's mostly how the Database schemas and queries are written.
+Choosing a dialect is a one-way door decision, to change the dialect we'll have to go through a data migration process.
+
+We will use the `GoogleSQL` dialect for the Global Service, and [go-sql-spanner](https://github.com/googleapis/go-sql-spanner) to connect to it, because:
+
+1. Using Go's standard library `database/sql` will allow us to swap implementations which is needed to support self-managed.
+1. GoogleSQL [data types](https://cloud.google.com/spanner/docs/reference/standard-sql/data-types) are narrower and don't allow to make mistakes for example choosing int32 because it only supports int64.
+1. New features seem to be released on GoogleSQL first, for example, <https://cloud.google.com/spanner/docs/ml>. We don't need this feature specifically, but it shows that new features support GoogleSQL first.
+1. A more clear split in the code when we are using Google Spanner or native PostgreSQL, and won't hit edge cases.
+
+Citations:
+
+1. Google (n.d.). _PostgreSQL interface for Spanner._ Google Cloud. Retrieved April 1, 2024, from <https://cloud.google.com/spanner/docs/postgresql-interface>
+1. Google (n.d.). _Dialect parity between GoogleSQL and PostgreSQL._ Google Cloud. Retrieved April 1, 2024, from <https://cloud.google.com/spanner/docs/reference/dialect-differences>
+
+### Multi-Regional
+
+Running Multi-Regional read-write is one of the biggest selling points of Spanner.
+When provisioning an instance you can choose single Region or Multi-region.
+After provisioning you can [move an instance](https://cloud.google.com/spanner/docs/move-instance) whilst is running but this is a manual process that requires assistance from GCP.
+
+We will provision a Multi-Regional Cloud Spanner instance because:
+
+1. Won't require migration to Multi-Regional in the future.
+1. Have Multi Regional on day 0 which cuts the scope of multi region deployments at GitLab.
+
+This will however increase the cost considerably, using public facing numbers from GCP:
+
+1. [Regional](https://cloud.google.com/products/calculator?hl=en&dl=CiRlMjU0ZDQyMy05MmE5LTRhNjktYjUzYi1hZWE2MjQ4N2JkNDcQIhokOTlGQUM4RjUtNjdBRi00QTY1LTk5NDctNThCODRGM0ZFMERC): $1,716
+1. [Multi Regional](https://cloud.google.com/products/calculator?hl=en&dl=CiQzNjc2ODc5My05Y2JjLTQ4NDQtYjRhNi1iYzIzODMxYjRkYzYQIhokOTlGQUM4RjUtNjdBRi00QTY1LTk5NDctNThCODRGM0ZFMERC): $9,085
+
+Citations:
+
+1. Google (n.d.). _Regional and multi-region configurations._ Google Cloud. Retrieved April 1, 2024, from <https://cloud.google.com/spanner/docs/instance-configurations>
+1. Google (n.d.). FeedbackReplication. Google Cloud. Retrieved April 1, 2024, from <https://cloud.google.com/spanner/docs/replication>
+
+### Performance
+
+We haven't run any benchmarks ourselves because we don't have a full schema designed.
+However looking at the [performance documentation](https://cloud.google.com/spanner/docs/performance), both the read and write throughputs of a Spanner instance scale linearly as you add more compute capacity.
+
+### Alternatives
+
+1. PostgreSQL: Having a multi-regional deployment requires a lot of operations.
+1. ClickHouse: It's an `OLAP` database not an `OLTP`.
+1. Elasticsearch: Search and analytics document store.
+
 ## FAQ

 1. Does Global Service implement all services for Cells 1.0?
@@ -192,7 +262,7 @@ The original [Cells 1.0](iterations/cells-1.0.md) described [Primary Cell API](i
 1. How we will push all existing claims from "First Cell" into Global Service?

    We would add `rake gitlab:cells:claims:create` task. Then we would configure First Cell
-    to use Global Service, and execute the rake task. That way First Cell would claim all new
+    to use Global Service, and execute the Rake task. That way First Cell would claim all new
    records via Global Service, and concurrently we would copy data over.

 1. How and where the Global Service will be deployed?