Proof of Concept: Shard the GitLab database using decomposition (#5883) · Epics · GitLab.org

Proof of Concept: Shard the GitLab database using decomposition

## Summary Decomposition (FKA Application level sharding, Vertical Sharding) is a potential solution to scale GitLab's database to 10 million monthly active users. This approach relies on moving a feature tables into a separate logical database. Each of these databases would be partitioned and replicated using a regular tooling available for postgres. This proof of concept is trying to evaluate the following hypothesis: > _Decomposition ~~Vertical application level sharding~~ is a viable solution for scaling our database to 10M monthly active users_ This POC will introduce many logical databases each of them containing a feature-owned database structure. Application will connect to all these feature-owned databases to read data from. ## Problem to solve The current database architecture uses a single database cluster. We know that some features require more computing resources. By moving these tables to a separately owned databases we allow to ensure that features do not impact other features of a GitLab. This makes it much easier to model the data structure of these features, for example implementing time decay or another partitioning pattern. ## Solution Implement a proof of concept for decomposition ~~vertical application level sharding~~. The proposed solution may have the following properties: * Separate databases with a single application that are not part of the same database cluster * Defined responsibilities of databases and their expected SLAs and impact on a system availability * Completely self-sufficient database clusters with several read-only replicas that utilize PostgreSQL streaming replication. Patroni is used to manage these clusters and, for example, provide automatic failovers. * Define a features that will own their own databases, example, move all tables that are `ci_*` into a separate owned database * Move away from foreign keys for a separately owned databases and use instead `has_many ... on_destroy: delete_all` approach to ensure that data are removed * Change model definitions to use a separate owned databases * Figure out a way to transparently migrate (online) from a logical database A to logical database B * Figure out if we can apply partitioning schema during data migration * Figure out how a unavailability of the single database affects the whole GitLab #### Design overview ![Decomposition](/uploads/81208224da95f0a7a535bc4c4dd8f85e/Decomposition.png) ## Proof of concept implementation * Implement basic application-level sharding * Define many logical databases, provide a way to write a data migrations for each of them * Declare which models use which database * Evaluate a way to use [Rails 6](https://gitlab.com/gitlab-org/gitlab/-/issues/296870) for supporting many databases to avoid a custom made solution * Evaluate a way to ensure that a cross-shard data access is forbidden (use [`disable_joins`](https://github.com/rails/rails/pull/41937) * Evaluate an impact on the foreign keys or what foreign keys would have to be removed * Evaluate a way to migrate offline/online data within a sharding criteria between shards * Evaluate a way to iteratively use new application shards for validating application performance and being bug-free when supporting many databases * Evaluate how the code would be implemented in a FOSS, vs EE to reduce the sharding abstraction layer, as it is assumed that EE will only receive this feature * Evaluate how the shard-application-by-default approach can be implemented for on-premise installations: ex.: can we use many logical databases for application sharding in a same way how we run GitLab.com? * Evaluate a way how the unavailability of the database affects resiliency of GitLab ## Possible logical databases * gitlab_production_ci: all `ci_*` tables * gitlab_production_packages: all `packages_*` tables * gitlab_production_users: all `users*` and associated tables * gitlab_production_groups: all `projects/groups` and hierarchy * gitlab_production_code_review: all `issues/merge_requests` and associated * gitlab_production_archive: all `web_hook_logs` and other related (a data that is expected to be infrequently accessed) ## Testing * Evaluate how the sharding criteria impacts different functions of GitLab and a way to efficiently aggregate the data. For example usage of many logical databases might lead to increase of SQL queries, which might negatively impact performance of some features. Some features might need to be rewritten. ## Summary of Impact As we investigate different areas of the database potentially targeted for decomposition we will summarize the metrics below. [Assumptions for calculating %](https://gitlab.com/gitlab-org/gitlab/-/issues/331523#note_581047057) | Element | DB size (GB) | DB size (%) | Reads/s | Reads/s (%) | Writes/s | Writes/s (%) | |----------------|--------------|-------------|-----------|-------------|----------|--------------| | Web hook logs | 2964.1 | 22.39% | 52.5 | 0.00% | 110.0 | 2.82% | | Merge Requests | 2673.7 | 20.20% | 126073.4 | 1.31% | 795.4 | 20.40% | | CI | 4725.0 | 35.69% | 1712843.8 | 17.87% | 1909.2 | 48.98% | | Rest | 2876.3 | 21.73% | 7748488.5 | 80.82% | 1083.6 | 27.80% | [Google Sheet for reference](https://docs.google.com/spreadsheets/d/1U2cdneKwNOs37AiDxpyxKXtB3X3kFhhxr1hXPJ1bDwY/edit#gid=644136024)

epic