Explore CitusDB as a sharding solution
CitusDB is an extension to scale-out native Postgres.
Reliability-wise, CitusDB tolerates worker failures (transparently) and suggests to use a standard Postgres HA setup for the coordinator cluster.
License for open-source edition: GNU v3
One use-case for CitusDB is a multi-tenant application. This already led the thinking about exploring a "Tenancy Model" for GitLab and the discussion about partitioning (Read: "Migrating an existing app").
This is a proposal to explore CitusDB as a database sharding solution for GitLab.com.
In GitLab's case, most of the data resides in a namespace (projects, their issues, MRs, CI data, other data etc.) and around users. A top-level namespace is considered a candidate to think of as a "tenant" or the distribution key. There are cross-cutting concerns around user-related data that we might need to extract.
Goal
Understand feasibility, limitations, road blocks and design choices when migrating to CitusDB. The goal is to know
- what we'll have to do in order to migrate the application to CitusDB and
- whether or not CitusDB meets our expectations.
Approach
- Start with GDK and try to load the initial schema into CitusDB
- How do we shard
- How do we distribute the data
- How do we handle joins that don't fit our sharding strategy
- Set up a cluster for testing CitusDB
- Use data from staging for testing
Caveats
Worth to note that there are Postgres features that CitusDB currently does not support to execute across shards and we make use of.
| PG Feature | Usage in GitLab |
|---|---|
| Correlated subqueries | ? |
| Recursive CTEs | Yes |
| Table sample | Yes |
| SELECT … FOR UPDATE | Yes |
| Grouping sets | ? |
| Window functions that do not include the distribution column in PARTITION BY |
? |
Decision
Our decision to not continue with Citus exploration is outlined here: https://about.gitlab.com/handbook/engineering/development/enablement/database/doc/citus.html