Evaluation: Holistic approaches: Shard GitLab database using Application Sharding

Executive Summary

This issue is to use application-level sharding to shard GitLab database. This is meant to define a holistic approach that allows us to shard all current database and all added data in the future in a well-defined approach.

This can imply evaluating:

extending GitLab Rails application with support for many databases
introduce a middleware for database access layer to inteligently access additional database
define a boundaries around data access and data affinity

Design validation goals

Use separate databases with a single application that are not part of the same cluster
Define responsibilities of databases and their expected SLAs and impact on a system availability
Each physical database is treated as a complete self-sufficient instance with patroni, many primaries and replicas
Each physical database will be sharded by default by using many logical databases
If logical shards do grow we move logical databases to a separate physical databases to spread the load
The logical shards are used as this simplifies management as the whole physical database (containing many shards) can be replicated and automated with backup functions
The logical shards are also desired to reduce amount of needed open database connections, as a single physical connection can be used to access many logical shards
Evaluate a way to ensure that a cross-shard data access is forbidden
Evaluate a way to use Rails 6 for supporting many databases to avoid a custom made solution
The smallest shardable entity is Project. However, for the focus on a first implementation we allow to define a much broader sharding critiera: top-level group or group
Evaluate a way to migrate offline/online data within a sharding criteria between shards
Evaluate a way to iteratively use new application shards for validating application performance and being bug-free when supporting many databases
Evaluate after defining sharding criteria impact on a different functions of GitLab and a way to efficiently aggregate the data:
- ex.: if we shard by project, it means that for a single group project can be on many shards: how we would aggregate group level features?
- ex2.: if we shard by top-level group, likely only some functions would be broken, like: user groups, all user issues and all user merge requests, the explore function: how we would aggregate instance-wide level features?
Evaluate how the code would be implemented in a FOSS, vs EE to reduce the sharding abstraction layer, as it is assumed that EE will only receive this feature
Evaluate how the shard-application-by-default approach can be implemented for on-premise installations: ex.: can we use many logical databases for application sharding in a same way how we run GitLab.com?

Demo Goals

Implement application-level sharding
Implement middleware on database access layer to intelligently choose application shards based on accessed data
Define a main database, and many logical shards that will contain a hierarchy of top-level group > group > project
Define a sharding critiera and sharding affinity (what type of data, and up-to what level) is accessible within a single application shard
Evaluate a ways to discover a cross-shards joins where some queries might try to access sibling projects or groups which are impossible to be accessed
Evaluate an impact on the foreign keys or what foreign keys would have to be removed
Evaluate a way to have a consistent and unique IDs across all shards
Evaluate a way to copy data over in a manual/offline way

Edited Apr 15, 2021 by Kamil Trzciński