KG Database Selection - Neo4j
Given the sudden deprecation of Kuzu, the GKG team will be accelerating it's efforts to consider alternative graph databases that strongly overlap with Kuzu's querying capabilities and general performance, while also looking to ameliorate some of the downsides of Kuzu.
The intent of this issue is to provide a broad analysis of the features deemed necessary to the GKGaaS effort, an overview of how Neo4j performs, and some notes on deployment/resource provisioning alongside a broader checklist.
Checklist
-
✅ Property Graph Model: The database must be a native property graph database. Our data model relies on nodes, relationships (edges), and properties attached to both. Alternative models like RDF are not suitable. -
✅ Cypher Query Language Support: The database must support the OpenCypher query language. Cypher is a standardized, SQL-like language for graphs that integrates well with Large Language Models (LLMs) and provides a common interface for our services.
Neo4j is on the leading edge of Cypher compatibility, supporting both Cypher 25 and Cypher 5 - the latter being stable and backwards compatible with a variety of Neo4j versions. Cypher 5 is also more amenable to cross-database support and LLM-based query generation, where the former version (Cypher 25) includes more cutting edge features that drastically reduce the execution times of some specific queries, see here: https://neo4j.com/blog/developer/declarative-route-planning-with-cypher-25/.
-
Mixed Workload Versatility: The database must perform efficiently under two distinct workload profiles: -
⚠️ Code Indexing: High volume of nodes with smaller data payloads (e.g., code symbols). -
⚠️ SDLC Data: Lower volume of nodes with larger data payloads (e.g., issue titles, descriptions, and other long-form text).
-
-
⚠️ Support for Indexing at Scale: The database must be compatible with an indexing pipeline that performs lookups against an external data lake (like ClickHouse or PostgreSQL) to enrich graph data with necessary context (e.g., top-level namespace IDs) before ingestion.As is discussed deeper in the checklist, indexing performance depends entirely on how how many columns our nodes have. If they are "slim" and nearly all columns that would be used to satisfy
WHEREpredicates are located in the Data Lake, indexing at scale becomes much more manageable. However, indexing performance seems to be satisfactory and tunable, especially with the introduction ofCALL {} WITH TRANSACTION, making batching for bulk ingestion more reliable. -
❌ (Nice-to-have) Embeddable:~~ Enable embeddable distributions, like Kuzu, for the local code indexing Knowledge Graph edition.~~Unfortunately, Neo4j is not embeddable - but given that the local version of GKG already incorporates a daemon-like service, it would not be too much of a stretch to spin up a local instance of Neo4j as a "sidecar" database to our long lived daemon.
Performance & Scalability Requirements
-
✅ Optimized for Analytical (OLAP) Workloads: The Knowledge Graph is a read-heavy, analytical system. The database must be optimized for complex OLAP-style queries and does not require high-frequency, user-driven transactional (OLTP) write capabilities.While Neo4j is not traditionally seen as an OLAP-style engine, the Neo4j team has invested much effort into speeding up queries, including migrating to a block based storage system, better support for concurrent batch writes, and a parallel runtime. You can read more about it here: https://neo4j.com/blog/developer/neo4j-v5-lts-evolution/. Also, it's worth noting that the aggregate queries we'd be running will be mostly relegated to
COUNT,AVG,MEDIAN. Neo4j also explicitly supports spinning up analytical clusters e.g read-replicas tuned for analytical workloads: https://neo4j.com/docs/operations-manual/current/clustering/setup/analytics-cluster/. -
✅ High Availability & Horizontal Scalability: The solution must be architected for high availability with multi-node, multi-replica configurations to eliminate single points of failure and scale horizontally to handle the growth of GitLab.com.Neo4j clusters now more readily support HA and horizontal scalability as of version 5 with Autonomous Clustering, including single-writer failovers, read more here: https://neo4j.com/blog/developer/scalable-fault-tolerant-graph-databases-neo4j-autonomous-clustering/
-
⚠️ High Write Speed & Bulk Import Capability: The database must demonstrate high-performance write throughput and, critically, support efficient bulk importing of data from prepared files (e.g., Parquet). This is essential for our ETL pipeline, which processes data from a data lake before loading it into the graph.Neo4j has a reasonably high ingestion rate with respect to bulk inserts especially with their new
CALL {} WITH TRANSACTIONfunctionality improves concurrent write performance but it still has a few caveats. Like FalkorDB, Neo4j lacks support for loading directly from Parquet, and currently only supportsLOAD CSV. This requires GKG's writer infrastructure to make one of two choices, either write directly to CSV (consuming large amounts of disk space with no columnar compression) or first write to Parquet, and then batch write to CSV, and ingest each batch serially (or with a limited amount of parallelism), which increases ingestion latency. -
✅ High Node Count Support: The database must be capable of storing and querying graphs containing an extremely high number of nodes and edges to accommodate enterprise-scale codebases and years of SDLC history. -
✅ Efficient Filtering by Traversal IDs: The database must be able to efficiently execute queries that filter nodes and relationships based on a list of traversal IDs to support our authorization model.
Security & Authorization Requirements
-
✅ Robust Multi-tenancy & Data Isolation: The system must enforce strict data segregation between tenants (top-level namespaces). It must support creating logical or physical database separations per tenant to ensure customer data is never mixed, which is a critical security and compliance requirement.Neo4j supports multiple, independent tenants which can each be created with the
CREATE DATABASEcommand on a Neo4j instance or cluster. The default limit for this ~100 databases per instance, but it can be raised arbitrarily and does not have a hard limit like Spanner. -
✅ Query-Time Authorization via Dynamic Filtering: The query language (Cypher) must support the injection of dynamic filters (e.g.,WHEREclauses) based on claims from a JWT (like user-specific traversal IDs). This allows the service to enforce fine-grained, row-level security at query time. This ensures users can only access data they are authorized to see.
Operational & Deployment Requirements
-
✅ Deployment Model Flexibility: The solution must be deployable across all GitLab offerings, including GitLab.com, GitLab Dedicated, and self-managed instances. Managed cloud services (like Amazon Neptune) are viable for .com/Dedicated, but a self-hostable option is still necessary for self-managed customers.Neo4j supports managed deployments via their Cloud (for .com and Dedicated), as well as self-hosting via a Helm chart (https://neo4j.com/docs/operations-manual/current/kubernetes/quickstart-cluster/), including support for analytical clusters hosted on k8s (https://neo4j.com/docs/operations-manual/current/kubernetes/quickstart-analytics-cluster/).
-
✅ Integration with GitLab Deployment Tooling: The database must be deployable and manageable through our standard toolchain, including CNG (Cloud Native GitLab) and Helm charts. -
✅ Low Operational Overhead: A managed solution or a database with low operational complexity is strongly preferred to minimize the on-call burden for our DBRE and Data Engineering teams. The chosen solution must be supportable by our internal teams.Given Neo4j's native support for k8s, operational overhead is not likely to be a major bottleneck from a CNG perspective.
-
✅ Monitoring & Logging Integration: The database must integrate with our existing observability stack for monitoring, logging, and alerting - or at least be observable by the GKG Service.Neo4j supports Prometheus/Grafana (https://neo4j.com/developer/kb/how-to-monitor-neo4j-with-prometheus/) as well as OTel out of the box.
-
✅ FinOps - what will the associated costs to be run an instance at thegitlab-orgscale?For
DatasetConfig::giant, which creates an SDLC graph of about 8 million nodes and 11 million edges, with a total graph footprint of~21GB. The graph was indexed and queried with a single core, with a maximum heap size of16GB. With a c2d-highmem-2 GCP instance, it would reflect a $105/month cost. This of course, would not necessarily scale linearly, as enabling the Neo4j parallel runtime, block storage format, etc would give ample room to "grow into our instances".
Benchmark Results - COPY FROM
| Dataset Configuration | Bulk Import Time (s) | Import Throughput (records/s) | Final Database Size (MB) | Storage Efficiency (ratio) | Description |
|---|---|---|---|---|---|
DatasetConfig { num_users: 50, num_groups: 5, num_projects: 20, num_issues_per_project: 10, num_mrs_per_project: 5, num_epics_per_group: 2, num_milestones_per_project: 2, long_description_ratio: 0.0, long_description_size_bytes: 0 } |
1.713 | 645.58 | - | - | Small dataset with minimal entities |
DatasetConfig { num_users: 1000, num_groups: 50, num_projects: 500, num_issues_per_project: 100, num_mrs_per_project: 50, num_epics_per_group: 20, num_milestones_per_project: 10, long_description_ratio: 0.0, long_description_size_bytes: 0 } |
5.739 | 37427.86 | - | - | Large dataset with moderate complexity |
DatasetConfig { num_users: 10000, num_groups: 200, num_projects: 6000, num_issues_per_project: 20, num_mrs_per_project: 100, num_epics_per_group: 10, num_milestones_per_project: 4, long_description_ratio: 0.025, long_description_size_bytes: 393216 } |
68.05 | 25195.56 | - | - | Huge dataset with large descriptions (384KB, 2.5% ratio) |
DatasetConfig { num_users: 100000, num_groups: 2000, num_projects: 60000, num_issues_per_project: 200, num_mrs_per_project: 1000, num_epics_per_group: 100, num_milestones_per_project: 4, long_description_ratio: 0.05, long_description_size_bytes: 524288 } |
433.61 | 41158.65 | ~21GB | - | Giant dataset with large descriptions (512KB, 5% ratio) |
Benchmark Results - Individual Queries (With DatasetConfig::giant())
| Query | Average Latency (ms) | p95 Latency (ms) | Throughput (queries/s) | Description | Cypher Query |
|---|---|---|---|---|---|
| Node Counts | 25.79 | 61.04 | 38.78 | Count nodes by type (Users, Groups, Projects, Issues, MergeRequests, Epics, Milestones) |
MATCH (u:GitLabUser) RETURN count(u) (and similar for each type) |
| Issues Per Project (Top 10) | 75.77 | 186.09 | 13.20 | Aggregation with relationship traversal and ordering: (Issue)-[:BELONGS_TO_PROJECT]->(Project)
|
MATCH (i:GitLabIssue)-[:BELONGS_TO_PROJECT]->(p:GitLabProject) RETURN p.name, count(i) as issue_count ORDER BY issue_count DESC LIMIT 10 |
| Most Active Users by Authored Issues | 29.85 | 71.39 | 33.51 | Aggregation with relationship traversal: (Issue)-[:AUTHORED_BY]->(User) with ordering and limit |
MATCH (i:GitLabIssue)-[:AUTHORED_BY]->(u:GitLabUser) RETURN u.username, count(i) as issues_authored ORDER BY issues_authored DESC LIMIT 10 |
| Assigned Issues Count | 28.95 | 67.94 | 34.54 | Simple relationship count: (Issue)-[:ASSIGNED_TO]->(User)
|
MATCH (i:GitLabIssue)-[:ASSIGNED_TO]->(u:GitLabUser) RETURN count(i) as assigned_issues_count |
| Merge Requests Closing Issues | 13.43 | 30.92 | 74.48 | Relationship traversal count: (MergeRequest)-[:CLOSES_ISSUE]->(Issue)
|
MATCH (mr:GitLabMergeRequest)-[:CLOSES_ISSUE]->(i:GitLabIssue) RETURN count(mr) as mrs_closing_issues |
| Issues in Milestones (Top 5) | 15.17 | 33.08 | 65.92 | Aggregation with relationship traversal: (Issue)-[:IN_MILESTONE]->(Milestone) with ordering and limit |
MATCH (i:GitLabIssue)-[:IN_MILESTONE]->(m:GitLabMilestone) RETURN m.title, count(i) as issues_in_milestone ORDER BY issues_in_milestone DESC LIMIT 5 |
| Epic to Issues Relationship (Top 5) | 17.33 | 36.83 | 57.70 | Aggregation with relationship traversal: (Issue)-[:IN_EPIC]->(Epic) with ordering and limit |
MATCH (i:GitLabIssue)-[:IN_EPIC]->(e:GitLabEpic) RETURN e.title, count(i) as issues_in_epic ORDER BY issues_in_epic DESC LIMIT 5 |
| Users with Most Projects | 26.32 | 59.15 | 38.00 | Aggregation with relationship traversal: (User)-[:MEMBER_OF_PROJECT]->(Project) with ordering and limit |
MATCH (u:GitLabUser)-[:MEMBER_OF_PROJECT]->(p:GitLabProject) RETURN u.username, count(p) as project_count ORDER BY project_count DESC LIMIT 5 |
| Project → Issue → Assignee Chain | 30.17 | 68.73 | 33.14 | Multi-hop traversal: (Project)<-[:BELONGS_TO_PROJECT]-(Issue)-[:ASSIGNED_TO]->(User)
|
MATCH (p:GitLabProject)<-[:BELONGS_TO_PROJECT]-(i:GitLabIssue)-[:ASSIGNED_TO]->(u:GitLabUser) RETURN p.name, i.title, u.username LIMIT 10 |
| Relationship Counts by Type | 20.94 | 49.29 | 47.77 | Global relationship aggregation by type across entire graph | MATCH ()-[r]->() RETURN type(r) as rel_type, count(r) as count |
Notes:
- Node Counts average is calculated across all 7 node type count queries (Users, Groups, Projects, Issues, MergeRequests, Epics, Milestones)
- Throughput is calculated as 1000 / average latency (queries per second)
- Some queries experienced failures (MRs closing issues: 13/20 failures, Project-Issue-Assignee chain: 20/20 failures, Relationship counts: 20/20 failures), likely due to timeout or resource constraints