[db] Database Selection - ClickHouse
Problem to Solve
Given the sudden deprecation of Kuzu, the GKG team will be accelerating it's efforts to consider alternative databases that overlap with Kuzu's querying capabilities and general performance.
The intent of this issue is to provide a broad analysis of building a graph query layer on top of ClickHouse and expose tools for agents to traverse GitLab's SDLC data with graph-like capabilities.
But... ClickHouse is not a graph database?
ClickHouse is not a graph database, but it is great a performing data analysis on large dataset. It is also already used by GitLab, which means it doesn't require purchasing a new license, setting and maintaining the infrastructure for the new database and hiring new SREs which are specialized a operating these databases in production.
All of these constraint make ClickHouse a strong contender if it is able to perform well on graph-like SQL queries.
Benchmarks methodology
Data generation
For both single-table and multi-table, we generated a giant dataset that was directly imported into ClickHouse on which we ran our various tests on. Here's the configuration we used to generate the graph.
Note: We called it gitlab-org because of the giant nature of the dataset and not because it reflects actual numbers.
pub fn gitlab_org() -> Self {
Self {
num_tenants: 3,
groups_per_tenant: 5000,
max_subgroup_depth: 8,
projects_per_group: 18,
users_per_tenant: 500000,
issues_per_project: 120,
mrs_per_project: 100,
epics_per_group: 100,
}
}
This configuration generates over 25 millions nodes and 93 million relationships which span across multiple tenant (separated by tenant_id in the tables).
The actual scripts can be found here for single-table and here for multi-table.
Running the benchmark
Both the single-table and multi-table benchmarks were run against a set of queries, which have been edited to best fit each schemas to compare the results. Each queries were run in parallel 5 times to simulate how the database reacts to having to evaluate multiple queries on large dataset.
Here are the results...
Single-Table
This approach attempts to model entities using two generic and schema-flexible tables:
-
nodes: A single table containing all entities (e.g., Users, Projects, Issues). A
node_typecolumn distinguishes between them, and apropertiescolumn (JSON) stores their specific attributes. -
relationships: A single table for all connections. A
typecolumn defines the relationship (e.g.,GROUP_HAS_PROJECT,MR_CLOSES_ISSUE).
The tables each have a bloom filter on node_type and type and a projection on the reverse relationships to help with query efficiency.
Results
Here are some results we obtained from running the following queries:
The system demonstrates excellent performance for simple operations, handling basic node lookups and single-hop traversals in under 30 milliseconds. However, performance degrades dramatically when dealing with more complex queries, particularly multi-hop traversals involving two or more steps, where response times balloon to between 500 milliseconds and over 1000 milliseconds. This seems to align with ClickHouse documentation on how joins should be limited where milliseconds performance is required.
This approach offers high flexibility, allowing new entity and relation types to be added without requiring schema changes. The drawbacks, however, are significant: multi-hop queries suffer from poor join performance due to the need for repeated scans and joins across massive tables. The tables also lacks strong typing guarantees since data integrity and type enforcement must be handled at the application level rather than being built into the database schema itself.
Multi-Table
This model uses a traditional relational or property graph design with a specific table for each entity and relationship type.
-
Node Tables: Separate tables for each node type (e.g.,
groups,projects,issues) with strongly-typed columns for properties. -
Edge Tables: Separate tables for each relationship type (e.g.,
group_has_project,project_has_issue), linking the primary keys of the corresponding node tables.
Results
Here are some results we obtained from running the following queries:
The system delivers well-balanced performance across a wide range of query types, with 45% of user-facing queries completing in under 100 milliseconds and 70% of analytics queries finishing within 500 milliseconds. It efficiently handles one, two, and three-hop traversals thanks to smaller and more specific tables that optimize join operations.
The approach offers significant advantages in performance and reliability joins execute quickly on focused tables rather than generic ones. The schema enforces strong data integrity by validating types and relationships at the database level.
The primary drawback is schema rigidity, which means adding new entity or relationship requires schema migrations and the creation of new tables making the system less flexible when data models need to evolve.
Analyzing the results
Both architectural approaches present distinct trade-offs between flexibility and performance. The single-table design excels in adaptability, enabling rapid schema evolution without migrations, but lacks in terms of performance on multi-hop queries versus the multi-table design. The multi-table approach delivered strong performance across both simple and complex queries, with most completing well within acceptable latency thresholds thanks to purpose-built tables.
While the multi-table design requires more upfront planning and schema migrations for changes, its superior query performance and built-in data integrity make it the more suitable choice for production workloads where reliability and speed are paramount. The decision ultimately hinges on whether we prioritize schema flexibility or query performance. In our case we believe the multi-table approach would be a great candidate if we decide to go with ClickHouse as a database.
Potential areas of improvements
The Kuzu team has published a white-paper on their database technology and there are ideas we could potentially investigate to build our graph engine on top of ClickHouse.
Schema improvements
- Columnar Node Tables: Store nodes as standard MergeTree tables with tenant ID and node ID as primary key and properties as columns.
-
Bidirectional Edge Storage (Double Indexing): Create separate
_forwardand_backwardtables for each edge type, sorted by source and destination respectively for fast traversal in both directions. - Materialized Views for Sync: Use materialized views to automatically keep forward/backward edge tables synchronized from a single source.
Cypher to SQL Conversion Patterns
- MATCH Patterns: Convert graph patterns to explicit joins between edge tables and node tables, using CTEs for readability and to enable query optimization.
-
Bidirectional Traversal: Use
_forwardtables for outgoing edges(a)-[]->(b)and_backwardtables for incoming edges(a)<-[]-(b). - WHERE Filters: Push selective filters into early CTEs to mimic sideways information passing. Compute filtered node sets first and then use them to filter edge scans.
- Multi-hop Paths: Chain CTEs for each hop (hop1, hop2, hop3) where each CTE joins with the appropriate edge table based on previous hop results.
-
Variable-Length Paths: Use
WITH RECURSIVEfor variable-length patterns or explicit UNION of fixed hops for better performance on shallow depths (1-3 hops).
Code indexing
note: This section has not been tested at scale yet.

