KG Database Selection - FalkorDB
Given the sudden deprecation of Kuzu, the GKG team will be accelerating it's efforts to consider alternative graph databases that strongly overlap with Kuzu's querying capabilities and general performance, while also looking to ameliorate some of the downsides of Kuzu.
The intent of this issue is to provide a broad analysis of the features deemed necessary to the GKGaaS effort, an overview of how FalkorDB performs, and some notes on deployment/resource provisioning alongside a broader checklist.
Overview of FalkorDB
FalkorDB is a Redis Module (analogous to a Postgres Extension), that uses GraphBLAS to implement various graph traversal algorithms necessary for compatibility with OpenCypher, while deferring to Redis for storage, transactionality, durability, and support for horizontal and vertical scaling.
Regarding exactly how FalkorDB integrates with Redis, please see: https://docs.falkordb.com/getting-started/configuration.html, but at a high level, FalkorDB dynamically links their graph traversal and storage utilities at runtime to Redis via the LOAD MODULE
command, and FalkorDB communicates with the Redis internals using their C ABI. Thus, the surface area of Falkor is much smaller than entirely de-novo graph database efforts like Kuzu, and connecting to a FalkorDB instance is identical to connecting to a vanilla Redis cluster, assuming the LOAD MODULE
command is called. One benefit of extending Redis versus adding a new stateful service is that GitLab already deploys Redis in production: https://docs.gitlab.com/charts/advanced/external-redis/.
Checklist
-
Property Graph Model: The database must be a native property graph database. Our data model relies on nodes, relationships (edges), and properties attached to both. Alternative models like RDF are not suitable. -
Cypher Query Language Support: The database must support the OpenCypher query language. Cypher is a standardized, SQL-like language for graphs that integrates well with Large Language Models (LLMs) and provides a common interface for our services.
Note: Like many of the Cypher-compatible databases in &31, FalkorDB does not implement the exact specification for Cypher, but rather a best effort attempt. See the full list of limitations here: https://docs.falkordb.com/cypher/known-limitations.html
-
Mixed Workload Versatility: The database must perform efficiently under two distinct workload profiles: -
Code Indexing: High volume of nodes with smaller data payloads (e.g., code symbols). -
SDLC Data: Lower volume of nodes with larger data payloads (e.g., issue titles, descriptions, and other long-form text).
-
-
Support for Indexing at Scale: The database must be compatible with an indexing pipeline that performs lookups against an external data lake (like ClickHouse or PostgreSQL) to enrich graph data with necessary context (e.g., top-level namespace IDs) before ingestion. -
(Nice-to-have) Embeddable:~~ Enable embeddable distributions, like Kuzu, for the local code indexing Knowledge Graph edition.~~Unfortunately, as described in the summary, FalkorDB is implemented as a Redis Module and thus is indistinguishable from running a single node Redis server or N-node Redis cluster. The Falkor team is experimenting with an embedded version of their database, but it still experimental (https://docs.falkordb.com/operations/falkordblite.html) and thus not suitable for production at this time.
Performance & Scalability Requirements
-
Optimized for Analytical (OLAP) Workloads: The Knowledge Graph is a read-heavy, analytical system. The database must be optimized for complex OLAP-style queries and does not require high-frequency, user-driven transactional (OLTP) write capabilities. -
High Availability & Horizontal Scalability: The solution must be architected for high availability with multi-node, multi-replica configurations to eliminate single points of failure and scale horizontally to handle the growth of GitLab.com. -
High Write Speed & Bulk Import Capability: The database must demonstrate high-performance write throughput and, critically, support efficient bulk importing of data from prepared files (e.g., Parquet). This is essential for our ETL pipeline, which processes data from a data lake before loading it into the graph. -
High Node Count Support: The database must be capable of storing and querying graphs containing an extremely high number of nodes and edges to accommodate enterprise-scale codebases and years of SDLC history. -
Efficient Filtering by Traversal IDs: The database must be able to efficiently execute queries that filter nodes and relationships based on a list of traversal IDs to support our authorization model.
Security & Authorization Requirements
-
Robust Multi-tenancy & Data Isolation: The system must enforce strict data segregation between tenants (top-level namespaces). It must support creating logical or physical database separations per tenant to ensure customer data is never mixed, which is a critical security and compliance requirement. -
Query-Time Authorization via Dynamic Filtering: The query language (Cypher) must support the injection of dynamic filters (e.g., WHERE
clauses) based on claims from a JWT (like user-specific traversal IDs). This allows the service to enforce fine-grained, row-level security at query time. This ensures users can only access data they are authorized to see.
Operational & Deployment Requirements
-
Deployment Model Flexibility: The solution must be deployable across all GitLab offerings, including GitLab.com, GitLab Dedicated, and self-managed instances. Managed cloud services (like Amazon Neptune) are viable for .com/Dedicated, but a self-hostable option is still necessary for self-managed customers. -
Integration with GitLab Deployment Tooling: The database must be deployable and manageable through our standard toolchain, including CNG (Cloud Native GitLab) and Helm charts. -
Low Operational Overhead: A managed solution or a database with low operational complexity is strongly preferred to minimize the on-call burden for our DBRE and Data Engineering teams. The chosen solution must be supportable by our internal teams. -
Monitoring & Logging Integration: The database must integrate with our existing observability stack for monitoring, logging, and alerting - or at least be observable by the GKG Service. -
FinOps - what will the associated costs to be run an instance at the gitlab-org
scale?
COPY FROM
Benchmark Results - Dataset Configuration | Bulk Import Time (s) | Import Throughput (records/s) | Final Database Size (MB) | Storage Efficiency (ratio) | Description |
---|---|---|---|---|---|
DatasetConfig { num_users: 50, num_groups: 5, num_projects: 20, num_issues_per_project: 10, num_mrs_per_project: 5, num_epics_per_group: 2, num_milestones_per_project: 2, long_description_ratio: 0.0, long_description_size_bytes: 0 } |
- | - | - | - | Small dataset with minimal entities |
DatasetConfig { num_users: 1000, num_groups: 50, num_projects: 500, num_issues_per_project: 100, num_mrs_per_project: 50, num_epics_per_group: 20, num_milestones_per_project: 10, long_description_ratio: 0.0, long_description_size_bytes: 0 } |
- | - | - | - | Large dataset with moderate complexity |
DatasetConfig { num_users: 10000, num_groups: 200, num_projects: 6000, num_issues_per_project: 20, num_mrs_per_project: 100, num_epics_per_group: 10, num_milestones_per_project: 4, long_description_ratio: 0.025, long_description_size_bytes: 393216 } |
- | - | - | - | Huge dataset with large descriptions (384KB, 2.5% ratio) |
DatasetConfig { num_users: 100000, num_groups: 2000, num_projects: 60000, num_issues_per_project: 200, num_mrs_per_project: 1000, num_epics_per_group: 100, num_milestones_per_project: 4, long_description_ratio: 0.05, long_description_size_bytes: 524288 } |
- | - | - | - | Giant dataset with large descriptions (512KB, 5% ratio) |
Benchmark Results - Individual Queries
Query | Average Latency (ms) | p95 Latency (ms) | Throughput (queries/s) | Description | Cypher Query |
---|---|---|---|---|---|
Node Counts | - | - | - | Count nodes by type (Users, Groups, Projects, Issues, MergeRequests, Epics, Milestones) |
MATCH (u:GitLabUser) RETURN count(u) (and similar for each type) |
Issues Per Project (Top 10) | - | - | - | Aggregation with relationship traversal and ordering: (Issue)-[:BELONGS_TO_PROJECT]->(Project)
|
MATCH (i:GitLabIssue)-[:BELONGS_TO_PROJECT]->(p:GitLabProject) RETURN p.name, count(i) as issue_count ORDER BY issue_count DESC LIMIT 10 |
Most Active Users by Authored Issues | - | - | - | Aggregation with relationship traversal: (Issue)-[:AUTHORED_BY]->(User) with ordering and limit |
MATCH (i:GitLabIssue)-[:AUTHORED_BY]->(u:GitLabUser) RETURN u.username, count(i) as issues_authored ORDER BY issues_authored DESC LIMIT 10 |
Assigned Issues Count | - | - | - | Simple relationship count: (Issue)-[:ASSIGNED_TO]->(User)
|
MATCH (i:GitLabIssue)-[:ASSIGNED_TO]->(u:GitLabUser) RETURN count(i) as assigned_issues_count |
Merge Requests Closing Issues | - | - | - | Relationship traversal count: (MergeRequest)-[:CLOSES_ISSUE]->(Issue)
|
MATCH (mr:GitLabMergeRequest)-[:CLOSES_ISSUE]->(i:GitLabIssue) RETURN count(mr) as mrs_closing_issues |
Issues in Milestones (Top 5) | - | - | - | Aggregation with relationship traversal: (Issue)-[:IN_MILESTONE]->(Milestone) with ordering and limit |
MATCH (i:GitLabIssue)-[:IN_MILESTONE]->(m:GitLabMilestone) RETURN m.title, count(i) as issues_in_milestone ORDER BY issues_in_milestone DESC LIMIT 5 |
Epic to Issues Relationship (Top 5) | - | - | - | Aggregation with relationship traversal: (Issue)-[:IN_EPIC]->(Epic) with ordering and limit |
MATCH (i:GitLabIssue)-[:IN_EPIC]->(e:GitLabEpic) RETURN e.title, count(i) as issues_in_epic ORDER BY issues_in_epic DESC LIMIT 5 |
Users with Most Projects | - | - | - | Aggregation with relationship traversal: (User)-[:MEMBER_OF_PROJECT]->(Project) with ordering and limit |
MATCH (u:GitLabUser)-[:MEMBER_OF_PROJECT]->(p:GitLabProject) RETURN u.username, count(p) as project_count ORDER BY project_count DESC LIMIT 5 |
Project → Issue → Assignee Chain | - | - | - | Multi-hop traversal: (Project)<-[:BELONGS_TO_PROJECT]-(Issue)-[:ASSIGNED_TO]->(User)
|
MATCH (p:GitLabProject)<-[:BELONGS_TO_PROJECT]-(i:GitLabIssue)-[:ASSIGNED_TO]->(u:GitLabUser) RETURN p.name, i.title, u.username LIMIT 10 |
Relationship Counts by Type | - | - | - | Global relationship aggregation by type across entire graph | MATCH ()-[r]->() RETURN type(r) as rel_type, count(r) as count |