Alternatives To Postgres DB (Google Spanner vs AlloyDB)

The objective of this issue is to discuss alternatives to Self Management of Database (Infra & Platform aspects) i.e the current Postgres stack.

This discussion was raised by @rhns in this doc. This discussion should provide inputs for the Data Store Platform comparison at: https://gitlab.com/gitlab-org/gitlab/-/tree/as/postgresql-architecture/doc/architecture/blueprints/database/dadabase_operation

DRI : @rhns

Note:

This issue is only limited to the Postgresql database used in GitLab, and this document doesn’t intend to serve as a Technical Architecture of any kind.

Narrative

Let us get ourselves away from managing the Infrastructure & some Platform specific aspects of the Database and entirely position ourselves on the Application aspects of the DB. This will allow us to focus more on areas such as Resource Fungibility, Application Scalability, Customer Satisfaction -> SLO refinement.

Objective

The objective of this document is to seek approval for getting away from Self Management of Database (Infra & Platform aspects) i.e the current Postgres stack.

An approval on this issue would warrant a project that would identify and perform the evaluation and stress an alternative that would meet the above objective.

Problem

Taking on the responsibility of Self Managing the Database Instance poses some challenges as captured below -:

Scalability: the responsibility of managing the stack (such as upgrades) gets multiplied if we were to spin up more dedicated instances that need to be run on separate Infra (for any isolation reasons) such as Cells.
Automation: Setting up an objective of 100% automated database management inclusive of Major Upgrades is difficult to achieve. This is due to the nature of the platform and the dependencies it comes with.
Cost: Human Effort, Finances & Opportunity Cost are directly impacted. If we were not managing infrastructure and platform we would be focusing and aligning our efforts more on areas/factors which directly impact customer satisfaction.
Overhead : Since we are managing the databases, maintaining database uptime also falls down onto us. Whereas, we should only be focusing our efforts on Gitlab as an application and offload the infrastructure and platform reliability aspects of systems like Databases to the Cloud/Service Provider.
Logging & Monitoring: Since, we are managing the tech stack we are also responsible for managing the infrastructure for logging & monitoring and also its configuration.
Toil & Supervision: Currently, the way we have been perceiving the components of our systems fall into the category of Pets in an analogy of Pets vs Herd. Since they are pets they need a lot of attention, care, some manual supervision and attention.

Proposed Solution

Since, we already have our footprint in Cloud, let us identify a product which will allow us to offload any Infrastructure and Platform management, upkeep, patching and upgrade procedures. This will allow us to focus more on Databases designed from application perspectives i.e. focus on customer outcomes & satisfaction, scale, schema refinements, undertake projects to have aggressive Data RPO, RTO (incl. Of PiTR), ideally reduce cost (from all aspects, i.e. actual finances, opportunity cost and overhead).

Options

Extending, from the above section that we have some options but in order to consider options we need to establish Options Evaluation Criteria

Options Evaluation Criteria

Low Management Overhead : Ideally we shouldn’t be responsible for activities such as Upgrades, Patching and if these are transparent to the product then even better.
Control: Offloading Management tasks shouldn’t impact our control over dictating the performance and existing availability aspects of our application.
Alerting & Observability : Picking up a cloud based SAAS/PAAS product shouldn’t result in compromise of our existing standards with Monitoring & Alerting. The product must come with mature overall observability aspects.
Scalability: Should be easy to scale physically (horizontal or vertical) + logically (this complements with our cell based approach, that is should be easy to turnup and integrate with ease into our new Cell Based Approach).
Performance: Selection of product must allow performance evaluations to be performed and at-least should be at par with our existing benchmarking standards.
Cost: Shouldn’t result in an increase instead the objective should be to observe an overall reduction in Cost.
Switch Over Cost: If we were to consider alternate options then ideally if we need to keep this cost (resulting from code changes and others) to low.

Based upon the above Evaluation Criteria below are some of the Options which appear to be a good option.

Spanner

Spanner despite being Multi-Master and allowed to have multiple R/W replicas + Read Only Replicas is considered to be Strongly Consistent(this is achieved by using Strong Reads when reading the data) DB.

With Spanner we can easily support use-cases such as having separate Regional, Multi-Region clusters, also we could easily support Large Customer based isolation use-cases in order to avoid Shared Fate.

Evaluation Criteria	Value	Commentary
Management Overhead	None - Low	Cloud Product manages any upgrades/patches & reliability aspects. Inclusive of `Replication` & `Backup` components.
Control	High	Control is achieved by Schema Design and Choice of Node Topology.
Alerting & Observability	Mature	Mature observability capabilities.
Scalability	High	Highly Scalable with multiple R/W replicas, easy with both physical and logical. Can be easily integrated with any Auto Scaling Tool (such as scale up and out ) based on Throughput or any other custom requirement.
Performance	High - Unvalidated*	This is documented by public claims and personal experience.
Cost	Uncomputed	Need to identify or a reasonable calculation is needed.
Switch Over Cost	High	Learning curve, some changes in the way we have been interacting with databases.

AlloyDB - Postgres

Evaluation Criteria	Value	Commentary
Management Overhead	Medium	We will still be responsible for some operations, such as upgrades will be still onto us. So if we have 10 cells and each cell with dedicated AlloyDB then we need to handle the upgrade process 10x. Considering that upgrade in Alloy means spinning up a new cluster and then moving data.
Control	Medium	We will be able to control some placement aspects.
Alerting & Observability	Unsure	Since the tool is still new, the capabilities from Observability perspectives are unsure. We may have limited control to tweak and create our own tooling around this.
Scalability	Complex	Is not easily scalable , manual actions and some updates will be required to either scale up the Master or add more Read Only replicas. Hence, this process cannot be integrated with any Auto Scaling tool to automatically scale up or turn up new Replicas.
Performance	High - Unvalidated*	This is documented by public claims and personal experience.
Cost	Uncomputed	Need to identify or a reasonable calculation is needed.
Switch Over Cost	Medium	Since this is Postgres, hence a low amount of effort is needed as there won’t be any Application Level Code Changes. But we may have to make certain config level changes to adjust with AlloyDB.

Cloud SQL - Postgres

Edited Oct 31, 2023 by Kennedy Wanyangu