Alternatives To Postgres DB (Google Spanner vs AlloyDB)
The objective of this issue is to discuss alternatives to Self Management of Database (Infra & Platform aspects) i.e the current Postgres stack.
This discussion was raised by @rhns in this doc. This discussion should provide inputs for the Data Store Platform comparison at: https://gitlab.com/gitlab-org/gitlab/-/tree/as/postgresql-architecture/doc/architecture/blueprints/database/dadabase_operation
DRI : @rhns
Note:
This issue is only limited to the Postgresql database used in GitLab, and this document doesn’t intend to serve as a Technical Architecture of any kind.
Narrative
Let us get ourselves away from managing the Infrastructure & some Platform specific aspects of the Database and entirely position ourselves on the Application aspects of the DB.
This will allow us to focus more on areas such as Resource Fungibility, Application Scalability, Customer Satisfaction -> SLO refinement.
Objective
The objective of this document is to seek approval for getting away from Self Management of Database (Infra & Platform aspects) i.e the current Postgres stack.
An approval on this issue would warrant a project that would identify and perform the evaluation and stress an alternative that would meet the above objective.
Problem
Taking on the responsibility of Self Managing the Database Instance poses some challenges as captured below -:
-
Scalability: the responsibility of managing the stack (such as upgrades) gets multiplied if we were to spin up more dedicated instances that need to be run on separate Infra (for any isolation reasons) such as Cells.
-
Automation: Setting up an objective of 100% automated database management inclusive of
Major Upgradesis difficult to achieve. This is due to the nature of the platform and the dependencies it comes with. -
Cost:
Human Effort, Finances & Opportunity Costare directly impacted. If we were not managing infrastructure and platform we would be focusing and aligning our efforts more on areas/factors which directly impact customer satisfaction. -
Overhead : Since we are managing the databases, maintaining database uptime also falls down onto us. Whereas, we should only be focusing our efforts on Gitlab as an application and offload the infrastructure and platform reliability aspects of systems like Databases to the Cloud/Service Provider.
-
Logging & Monitoring: Since, we are managing the tech stack we are also responsible for managing the infrastructure for logging & monitoring and also its configuration.
-
Toil & Supervision: Currently, the way we have been perceiving the components of our systems fall into the category of
Petsin an analogy ofPets vs Herd. Since they are pets they need a lot of attention, care, some manual supervision and attention.
Proposed Solution
Since, we already have our footprint in Cloud, let us identify a product which will allow us to offload any Infrastructure and Platform management, upkeep, patching and upgrade procedures. This will allow us to focus more on Databases designed from application perspectives i.e. focus on customer outcomes & satisfaction, scale, schema refinements, undertake projects to have aggressive Data RPO, RTO (incl. Of PiTR), ideally reduce cost (from all aspects, i.e. actual finances, opportunity cost and overhead).
Options
Extending, from the above section that we have some options but in order to consider options we need to establish Options Evaluation Criteria
Options Evaluation Criteria
- Low Management Overhead : Ideally we shouldn’t be responsible for activities such as
Upgrades,Patchingand if these are transparent to the product then even better. - Control: Offloading Management tasks shouldn’t impact our control over dictating the performance and existing availability aspects of our application.
- Alerting & Observability : Picking up a cloud based SAAS/PAAS product shouldn’t result in compromise of our existing standards with Monitoring & Alerting. The product must come with mature overall observability aspects.
- Scalability: Should be easy to scale physically (horizontal or vertical) + logically (this complements with our cell based approach, that is should be easy to turnup and integrate with ease into our new Cell Based Approach).
- Performance: Selection of product must allow performance evaluations to be performed and at-least should be at par with our existing benchmarking standards.
- Cost: Shouldn’t result in an increase instead the objective should be to observe an overall reduction in Cost.
- Switch Over Cost: If we were to consider alternate options then ideally if we need to keep this cost (resulting from code changes and others) to low.
Based upon the above Evaluation Criteria below are some of the Options which appear to be a good option.
Spanner
Spanner despite being Multi-Master and allowed to have multiple R/W replicas + Read Only Replicas is considered to be Strongly Consistent(this is achieved by using Strong Reads when reading the data) DB.
With Spanner we can easily support use-cases such as having separate Regional, Multi-Region clusters, also we could easily support Large Customer based isolation use-cases in order to avoid Shared Fate.
| Evaluation Criteria | Value | Commentary |
|---|---|---|
| Management Overhead | None - Low | Cloud Product manages any upgrades/patches & reliability aspects. Inclusive of Replication & Backup components. |
| Control | High | Control is achieved by Schema Design and Choice of Node Topology. |
| Alerting & Observability | Mature | Mature observability capabilities. |
| Scalability | High | Highly Scalable with multiple R/W replicas, easy with both physical and logical. Can be easily integrated with any Auto Scaling Tool (such as scale up and out ) based on Throughput or any other custom requirement. |
| Performance | High - Unvalidated* | This is documented by public claims and personal experience. |
| Cost | Uncomputed | Need to identify or a reasonable calculation is needed. |
| Switch Over Cost | High | Learning curve, some changes in the way we have been interacting with databases. |
AlloyDB - Postgres
| Evaluation Criteria | Value | Commentary |
|---|---|---|
| Management Overhead | Medium | We will still be responsible for some operations, such as upgrades will be still onto us. So if we have 10 cells and each cell with dedicated AlloyDB then we need to handle the upgrade process 10x. Considering that upgrade in Alloy means spinning up a new cluster and then moving data. |
| Control | Medium | We will be able to control some placement aspects. |
| Alerting & Observability | Unsure | Since the tool is still new, the capabilities from Observability perspectives are unsure. We may have limited control to tweak and create our own tooling around this. |
| Scalability | Complex | Is not easily scalable , manual actions and some updates will be required to either scale up the Master or add more Read Only replicas. Hence, this process cannot be integrated with any Auto Scaling tool to automatically scale up or turn up new Replicas. |
| Performance | High - Unvalidated* | This is documented by public claims and personal experience. |
| Cost | Uncomputed | Need to identify or a reasonable calculation is needed. |
| Switch Over Cost | Medium | Since this is Postgres, hence a low amount of effort is needed as there won’t be any Application Level Code Changes. But we may have to make certain config level changes to adjust with AlloyDB. |
Cloud SQL - Postgres
<To be added….>