Explore CitusDB as a sharding solution
CitusDB is an extension to scale-out native Postgres.
Reliability-wise, CitusDB tolerates worker failures (transparently) and suggests to use a standard Postgres HA setup for the coordinator cluster.
License for open-source edition: GNU v3
One use-case for CitusDB is a multi-tenant application. This already led the thinking about exploring a "Tenancy Model" for GitLab and the discussion about partitioning (Read: "Migrating an existing app").
This is a proposal to explore CitusDB as a database sharding solution for GitLab.com.
In GitLab's case, most of the data resides in a namespace (projects, their issues, MRs, CI data, other data etc.) and around users. A top-level namespace is considered a candidate to think of as a "tenant" or the distribution key. There are cross-cutting concerns around user-related data that we might need to extract.
Goal
Understand feasibility, limitations, road blocks and design choices when migrating to CitusDB. The goal is to know
- what we'll have to do in order to migrate the application to CitusDB and
- whether or not CitusDB meets our expectations.
Approach
- Start with GDK and try to load the initial schema into CitusDB
- How do we shard
- How do we distribute the data
- How do we handle joins that don't fit our sharding strategy
- Set up a cluster for testing CitusDB
- Use data from staging for testing
Caveats
Worth to note that there are Postgres features that CitusDB currently does not support to execute across shards and we make use of.
PG Feature | Usage in GitLab |
---|---|
Correlated subqueries | ? |
Recursive CTEs | Yes |
Table sample | Yes |
SELECT … FOR UPDATE | Yes |
Grouping sets | ? |
Window functions that do not include the distribution column in PARTITION BY |
? |
Decision
Our decision to not continue with Citus exploration is outlined here: https://about.gitlab.com/handbook/engineering/development/enablement/database/doc/citus.html