[db] Database Selection - ClickHouse

Problem to Solve

Given the sudden deprecation of Kuzu, the GKG team will be accelerating it's efforts to consider alternative databases that overlap with Kuzu's querying capabilities and general performance.

The intent of this issue is to provide a broad analysis of building a graph query layer on top of ClickHouse and expose tools for agents to traverse GitLab's SDLC data with graph-like capabilities.

But... ClickHouse is not a graph database?

ClickHouse is not a graph database, but it is great a performing data analysis on large dataset. It is also already used by GitLab, which means it doesn't require purchasing a new license, setting and maintaining the infrastructure for the new database and hiring new SREs which are specialized a operating these databases in production.

All of these constraint make ClickHouse a strong contender if it is able to perform well on graph-like SQL queries.

Benchmarks

The result here can be found in the graphhouse project.

Single-Table Approach

This model uses a generic, schema-flexible design:

  • nodes table: A single table containing all entities (e.g., Users, Projects, Issues). A node_type column distinguishes between them, and a properties column (JSON) stores their specific attributes.
  • relationships table: A single table for all connections. A type column defines the relationship (e.g., GROUP_HAS_PROJECT, MR_CLOSES_ISSUE).

Results

  • Performance:
    • Excellent for simple node lookups and 1-hop traversals (often < 30ms).
    • Poor for multi-hop (2+) traversals, with performance degrading significantly (500ms - 1000ms+). Queries require multiple self-joins on the same large tables, which is inefficient.
  • Pros:
    • High Flexibility: New entity and relation types can be added without schema changes.
  • Cons:
    • Poor Join Performance: Multi-hop queries are slow due to repeated scans and joins on the same massive tables.
    • Weaker Typing: Data integrity and typing are enforced at the application level, not by the database schema.

Multi-Table Approach

This model uses a traditional relational or property graph design with a specific table for each entity and relationship type.

  • Node Tables: Separate tables for each node type (e.g., groups, projects, issues), with strongly-typed columns for properties.
  • Edge Tables: Separate tables for each relationship type (e.g., group_has_project, project_has_issue), linking the primary keys of the corresponding node tables.

Results

  • Performance:
    • Well-balanced across a wide range of queries.
    • Fast for user-facing features (45% of queries < 100ms).
    • Good for interactive analytics (70% of queries < 500ms).
    • Efficiently handles 1, 2, and 3-hop traversals due to smaller, specific tables for joins.
    • Recursive path-finding queries are slower but perform acceptably for batch analysis.
  • Pros:
    • High Performance: Joins are performed on smaller, more specific tables, leading to faster execution for complex traversals.
    • Strong Data Integrity: The schema enforces data types and relationships.
  • Cons:
    • Schema Rigidity: Adding new entity or relationship types requires schema migrations (new tables).

Hybrid Approach

todo

Edited by Jean-Gabriel Doyon