Evaluation: Holistic approaches: Shard GitLab database using Application Sharding
Executive Summary
This issue is to use application-level sharding to shard GitLab database. This is meant to define a holistic approach that allows us to shard all current database and all added data in the future in a well-defined approach.
This can imply evaluating:
- extending GitLab Rails application with support for many databases
- introduce a middleware for database access layer to inteligently access additional database
- define a boundaries around data access and data affinity
Design validation goals
- Use separate databases with a single application that are not part of the same cluster
- Define responsibilities of databases and their expected SLAs and impact on a system availability
- Each physical database is treated as a complete self-sufficient instance with patroni, many primaries and replicas
- Each physical database will be sharded by default by using many logical databases
- If logical shards do grow we move logical databases to a separate physical databases to spread the load
- The logical shards are used as this simplifies management as the whole physical database (containing many shards) can be replicated and automated with backup functions
- The logical shards are also desired to reduce amount of needed open database connections, as a single physical connection can be used to access many logical shards
- Evaluate a way to ensure that a cross-shard data access is forbidden
- Evaluate a way to use Rails 6 for supporting many databases to avoid a custom made solution
- The smallest shardable entity is Project. However, for the focus on a first implementation we allow to define a much broader sharding critiera: top-level group or group
- Evaluate a way to migrate offline/online data within a sharding criteria between shards
- Evaluate a way to iteratively use new application shards for validating application performance and being bug-free when supporting many databases
- Evaluate after defining sharding criteria impact on a different functions of GitLab and a way to efficiently aggregate the data:
- ex.: if we shard by project, it means that for a single group project can be on many shards: how we would aggregate group level features?
- ex2.: if we shard by top-level group, likely only some functions would be broken, like: user groups, all user issues and all user merge requests, the explore function: how we would aggregate instance-wide level features?
- Evaluate how the code would be implemented in a FOSS, vs EE to reduce the sharding abstraction layer, as it is assumed that EE will only receive this feature
- Evaluate how the shard-application-by-default approach can be implemented for on-premise installations: ex.: can we use many logical databases for application sharding in a same way how we run GitLab.com?
Demo Goals
- Implement application-level sharding
- Implement middleware on database access layer to intelligently choose application shards based on accessed data
- Define a main database, and many logical shards that will contain a hierarchy of
top-level group > group > project
- Define a sharding critiera and sharding affinity (what type of data, and up-to what level) is accessible within a single application shard
- Evaluate a ways to discover a cross-shards joins where some queries might try to access sibling projects or groups which are impossible to be accessed
- Evaluate an impact on the foreign keys or what foreign keys would have to be removed
- Evaluate a way to have a consistent and unique IDs across all shards
- Evaluate a way to copy data over in a manual/offline way
Edited by Kamil Trzciński