Proof of Concept: Shard the GitLab database using horizontal application-level sharding
## Summary Application level sharding is a potential solution to scale GitLab's database to 10 million monthly active users. This approach relies on a middleware developed by GitLab to handle several different database shards. This proof of concept is trying to evaluate the following hypothesis: > _Horizontal application level sharding using top-level namespace as a sharding key is a viable solution for scaling our database to 10M monthly active users_ This POC will implement a minimal middleware and evaluate how it can handle several different databases that contain data associated with a sharding key. The initial sharding key for this POC is **Top-level namespaces**. There are three main workstreams: * **[Make the application shard-aware](https://gitlab.com/groups/gitlab-org/-/epics/5941)** - this covers concerns about general GitLab functionality. We need to implement a model that allows us to seamlessly support most of the application. For example, we still need to create issues and projects within a shard. What about comments? * **[Manage logical database shards](https://gitlab.com/groups/gitlab-org/-/epics/5839)** - by delegating concerns about data location, access patterns, rebalancing etc. to the application, we need to manage this ourselves. For example, how do we create IDs? How do we move things? * **[Testing POC on production data](https://gitlab.com/groups/gitlab-org/-/epics/5840)** - this is mostly concerned with understanding how 1 and 2 behave on real data ## Problem to solve The current database architecture uses a single database cluster. Horizontal scaling is not possible and GitLab is limited to efficiency improvements on the existing database e.g. by implementing time decay or vertically scaling the database e.g. buying larger hardware. We need to find a way to sale horizontally to enable future growth. ## Solution Implement a proof of concept for horizontal application level sharding. The proposed solution may have the following properties: * Separate databases with a single application that are not part of the same database cluster * Defined responsibilities of databases and their expected SLAs and impact on a system availability * Completely self-sufficient database clusters with several read-only replicas that utilize PostgreSQL streaming replication. Patroni is used to manage these clusters and, for example, provide automatic failovers. * Database servers will be sharded by default using many databases * If individual shards grow we move databases to a separate database server to spread the load * The shards are used as this simplifies management as the whole database server (containing many shards) can be replicated and automated with backup functions * The shards are also desired to reduce amount of needed open database connections, as a single physical connection can be used to access many logical shards * The smallest shardable entity is a **Project**. The focus on a first implementation we allow to **define** a much broader sharding criteria: **top-level namespace** #### Design overview ![2021-04-21-DB-current-vs-app-sharding-v2](/uploads/7199d5e8faa0a4f81130f308ddb7c365/2021-04-21-DB-current-vs-app-sharding-v2.png) ## Proof of concept implementation * Implement basic application-level sharding * Define a main database, and many shards that will contain a hierarchy of `top-level namespace > group > project` * Implement a minimal middleware on top of the database access layer to intelligently choose application shards based on accessed data * Evaluate a way to use [Rails 6](https://gitlab.com/gitlab-org/gitlab/-/issues/296870) for supporting many databases to avoid a custom made solution * Evaluate a way to discover a cross-shards joins where some queries might try to access which are impossible to be accessed * Evaluate a way to ensure that a cross-shard data access is forbidden * Evaluate an impact on the foreign keys or what foreign keys would have to be removed * Evaluate a way to have a consistent and unique IDs across all shards * Evaluate a way to migrate offline/online data within a sharding criteria between shards * Evaluate a way to iteratively use new application shards for validating application performance and being bug-free when supporting many databases * Evaluate how the code would be implemented in a FOSS, vs EE to reduce the sharding abstraction layer, as it is assumed that EE will only receive this feature * Evaluate how the shard-application-by-default approach can be implemented for on-premise installations: ex.: can we use many logical databases for application sharding in a same way how we run GitLab.com? ## Testing * Evaluate how the sharding criteria impacts different functions of GitLab and a way to efficiently aggregate the data. For example, sharding by top-level namespace may break some functions including user groups, all user issues and all user merge requests, the explore function and all other instance-wide level features? Confidence level 35% - based on feedback from team <!-- start-discoto-summary --> ## Auto-Summary :robot: <details> <summary>Discoto Usage</summary> --- > **Points** > > Discussion points are declared by headings, list items, and single > lines that start with the text (case-insensitive) `point:`. For > example, the following are all valid points: > > * `#### POINT: This is a point` > * `* point: This is a point` > * `+ Point: This is a point` > * `- pOINT: This is a point` > * `point: This is a **point**` > > Note that any markdown used in the point text will also be propagated > into the topic summaries. > > **Outcomes** > > Outcomes define the decisions or resolutions of a discussion. Once > outcomes are defined, sub-topics and points are collapsed > underneath the outcomes. > > Outcomes are declared in a similar manner as points: > > * `#### OUTCOME: This is an outcome` > * `* outcome: This is an outcome` > * `+ Outcome: This is an outcome` > * `- oUTCOME: This is an outcome` > * `outcome: This is an outcome` > > Note that multiple outcomes may be declared for each topic. > > **Topics** > > Topics can be stand-alone and contained within an issuable (epic, > issue, MR), or can be inline. > > Inline topics are defined by creating a new thread (discussion) > where the first line of the first comment is a heading that starts > with (case-insensitive) `topic:`. For example, the following are all > valid topics: > > * `# Topic: Inline discussion topic 1` > * `## TOPIC: **{+A Green, bolded topic+}**` > * `### tOpIc: Another topic` > > **Quick Actions** > > | Action | Description | > |-------------------------------|---------------------------------------------------------| > | `/discuss sub-topic TITLE` | Create an issue for a sub-topic. Does not work in epics | > | `/discuss link ISSUABLE-LINK` | Link an issuable as a child of this discussion | > > **Discussion-Size Indicators** > > The relative size of the discussion occurring within a topic > and its sub-topics is indicated via braille dots. > > More dots means that more points or sub-topics exist within a > given topic. > > Examples: > > * TOPIC `⣿⣿⡆` A large discussion occurred here > * TOPIC `⣇ ` A smaller discussion occurred here --- </details> Last updated by [this job](https://gitlab.com/gitlab-org/secure/pocs/discoto-runner/-/jobs/1272517500) <ul><li><details><summary>TOPIC <code title='Relative Number of Notes'>⡆ </code> <code title='Relative Number of Actions'> </code> Are we ready for the POC? https://gitlab.com/groups/gitlab-org/-/epics/5838#note_575145802</summary><ul></ul></details></li></ul> <!-- end-discoto-summary --> <!-- start-discoto-topic-settings --><details> <summary>Discoto Settings</summary> <br/> ```yaml --- summary: max_items: -1 sort_by: created sort_direction: ascending ``` See the [settings schema](https://gitlab.com/gitlab-org/secure/pocs/discussion-automation#settings-schema) for details. </details> <!-- end-discoto-topic-settings -->
epic