[deployed] Knowledge Graph Database Selection (#20822) · Epics · GitLab.org

[deployed] Knowledge Graph Database Selection

### Problem to Solve The GitLab Knowledge Graph (GKG) is a platform for delivering next-generation AI-powered features and some analytics features, including Code Indexing and comprehensive SDLC (Software Development Life Cycle) analysis. Our initial architecture was built upon [KùzuDB](https://github.com/kuzudb/kuzu), an embedded graph database chosen for its performance. However, [**KùzuDB has been archived**](https://www.theregister.com/2025/10/14/kuzudb_abandoned/), rendering it an unviable foundation for a production system and creates a need for a new graph database selection. This database is the "corner piece" of the GKG architecture. Its selection is a decision that will dictate our system's scalability, performance, operational complexity, and ability to support both GitLab.com and self-managed customers. We must choose a robust graph database that can handle the scale and complexity of our data. ### Proposed Solution We will conduct an formal evaluation, benchmarking, and selection process to identify the optimal graph database for the GitLab Knowledge Graph. The final selection will be validated through proof-of-concept implementations and performance testing that simulate production workloads. ### Important Note The GitLab Knowledge Graph will be following the [Omnibus Adjacent Architecture](https://handbook.gitlab.com/handbook/engineering/architecture/design-documents/selfmanaged_segmentation/#omnibus-adjacent-kubernetes-oak), meaning that we do not have to design for a strict omnibus environment. We can assume that this service will be _cloud native only_ - though we should still take into consideration the standard complexity and tradeoffs of new components and new databases. --- ### Candidates The following databases are candidates for evaluation: * [Neo4j](https://www.google.com/search?q=neo4j&oq=neo4j&gs_lcrp=EgRlZGdlKgYIARBFGDsyCQgAEEUYPBj5BzIGCAEQRRg7MgcIAhAAGIAEMgcIAxAAGIAEMgYIBBBFGDsyBggFEEUYPTIGCAYQRRg8MgYIBxBFGDwyBggIEEUYPNIBCDI0ODdqMGo0qAIAsAIB&sourceid=chrome&ie=UTF-8) * [Apache AGE](https://age.apache.org/) * [FalkorDB](https://www.falkordb.com/) * [Memgraph](https://memgraph.com/) * [Neptune](https://aws.amazon.com/neptune/) * [NebulaGraph](https://github.com/vesoft-inc/nebula) * [Spanner Graph](https://cloud.google.com/spanner/docs/graph/overview) (if we get time). Note: It doesn't support [Open]Cyper. ### Detailed Requirements The selected database must meet the following criteria, sourced from our design discussions and architectural needs: #### **Functional Requirements** * **Property Graph Model:** The database **must** be a native property graph database. Our data model relies on nodes, relationships (edges), and properties attached to both. Alternative models like RDF are not suitable. * **Cypher Query Language Support:** The database **must** support the OpenCypher query language (or [GQL](https://opencypher.org/articles/2019/09/12/SQL-and-now-GQL/), the evolution of OpenCypher). Cypher is a standardized, SQL-like language for graphs that integrates well with Large Language Models (LLMs) and provides a common interface for our services. * **Mixed Workload Versatility:** The database must perform efficiently under two distinct workload profiles: 1. **Code Indexing:** High volume of nodes with smaller data payloads (e.g., code symbols). 2. **SDLC Data:** Lower volume of nodes with larger data payloads (e.g., issue titles, descriptions, and other long-form text). * **Support for Indexing at Scale:** The database must be compatible with an indexing pipeline that performs lookups against an external data lake (like ClickHouse or PostgreSQL) to enrich graph data with necessary context (e.g., top-level namespace IDs) before ingestion. * **(Nice-to-have) Embeddable:** Enable embeddable distributions, like Kuzu, for the local code indexing Knowledge Graph edition. #### **Performance & Scalability Requirements** * **Optimized for Analytical (OLAP) Workloads:** The Knowledge Graph is a read-heavy, analytical system. The database must be optimized for complex OLAP-style queries and does not require high-frequency, user-driven transactional (OLTP) write capabilities. * **High Availability & Horizontal Scalability:** The solution must be architected for high availability with multi-node, multi-replica configurations to eliminate single points of failure and scale horizontally to handle the growth of GitLab.com. * **High Write Speed & Bulk Import Capability:** The database must demonstrate high-performance write throughput and, critically, support efficient **bulk importing** of data from prepared files (e.g., Parquet). This is essential for our ETL pipeline, which processes data from a data lake before loading it into the graph. * **High Node Count Support:** The database must be capable of storing and querying graphs containing an extremely high number of nodes and edges to accommodate enterprise-scale codebases and years of SDLC history. * **Efficient Filtering by Traversal IDs:** The database must be able to efficiently execute queries that filter nodes and relationships based on a list of traversal IDs to support our authorization model. #### **Security & Authorization Requirements** * **Robust Multi-tenancy & Data Isolation:** The system must enforce strict data segregation between tenants (top-level namespaces). It must support creating logical or physical database separations per tenant to ensure customer data is never mixed, which is a critical security and compliance requirement. * **Query-Time Authorization via Dynamic Filtering:** The query language (Cypher) must support the injection of dynamic filters (e.g., `WHERE` clauses) based on claims from a JWT (like user-specific traversal IDs). This allows the service to enforce fine-grained, row-level security at query time. This ensures users can only access data they are authorized to see. #### **Operational & Deployment Requirements** * **Deployment Model Flexibility:** The solution must be deployable across all GitLab offerings, including GitLab.com, GitLab Dedicated, and self-managed instances. Managed cloud services (like Amazon Neptune) are viable for .com/Dedicated, but a self-hostable option is still necessary for self-managed customers. * **Integration with GitLab Deployment Tooling:** The database must be deployable and manageable through our standard toolchain, including **CNG (Cloud Native GitLab)** and Helm charts. * **Low Operational Overhead:** A managed solution or a database with low operational complexity is strongly preferred to minimize the on-call burden for our DBRE and Data Engineering teams. The chosen solution must be supportable by our internal teams. * **Monitoring & Logging Integration:** The database must integrate with our existing observability stack for monitoring, logging, and alerting - or at least be observable by the GKG Service. * **FinOps** - what will the associated costs to be run an instance at the `gitlab-org` scale? #### **Legal & Commercial Requirements** * **Acceptable License:** The database license **should** be compatible with GitLab's distribution model for both our open-source Community Edition (CE) and commercial Enterprise Edition (EE) for self-managed customers. If the license is does not follow [this guide](https://handbook.gitlab.com/handbook/legal/product/#using-open-source-software), it will require legal review. --- ### Methodology We will adopt a standardized benchmarking methodology for each candidate database. This approach is modeled directly by how we evaluated Kuzu performance. See https://gitlab.com/gitlab-org/rust/knowledge-graph/-/merge_requests/292+. The evaluation will proceed in the following phases for each candidate: #### **1. Synthetic Data Generation & Export** We will generate a large-scale, synthetic dataset that realistically mimics the structure and scale of a large GitLab instance. This dataset will be the common input for all database tests. * **Configurable Scale:** Using a `DatasetConfig`, we will generate a graph with tens of millions of nodes and hundreds of millions of relationships, including users, groups, projects, issues, merge requests, epics, and milestones. * **Realistic Data Properties:** The generation will include varied data, such as a configurable ratio of issues and merge requests with large description fields, to test performance with large text properties. * **Streaming to Parquet:** To handle the massive data volume without exceeding memory limits, the generated data will be streamed directly to a set of Parquet files or CSV files (whatever is most applicable). This will be later tested with each DB's bulk import functionality. #### **2. Database Ingestion and Storage Benchmark** This phase will measure the efficiency of each database at ingesting the prepared Parquet data and storing it (either in memory or on disk). * **Bulk Import:** We will use each database's native bulk import functionality (e.g., `COPY FROM`) to load the graph from the Parquet files. * **Metrics to Measure:** * **Bulk Import Time:** The total time taken to load the entire dataset. * **Import Throughput:** The number of records (nodes and relationships) ingested per second. * **Final Database Size:** The total on-disk or memory size of the database after ingestion. * **Storage Efficiency:** The ratio of the final database size to the raw Parquet data size, indicating the database's compression and storage effectiveness. #### **3. Query Performance Benchmark** Once populated, each database will be tested through various access patterns relevant to our use cases. * **Single-Threaded Queries:** A series of individual queries will be executed to establish a baseline for latency on common operations, including: * Simple node counts (`MATCH (n) RETURN count(n)`). * Aggregations with ordering and limits (e.g., "top 10 most active users"). * Relationship traversals and joins (e.g., "count of merge requests that close issues"). * **Concurrent Query Benchmark:** To simulate real-world application load, we will execute a suite of complex queries in parallel across multiple threads. This will measure performance under pressure. * **Metrics to Measure:** * **Average Latency (ms):** The mean execution time for each query type. * **p95 Latency (ms):** The 95th percentile latency, which is critical for understanding worst-case user experience. * **Specific Query Patterns to be Tested:** * **Multi-hop Traversals:** Queries that traverse multiple relationships (e.g., `Project -> Issue -> Assignee`). * **Variable-Length Path Queries:** More expensive queries that find nodes within a variable number of hops (e.g., `(issue)-[*1..2]-(neighbor)`). * **Impact of Large Properties:** We will explicitly compare the performance of queries that return full node objects (including large `description` fields) versus queries that return only specific, smaller properties (e.g., `title`, `id`). #### **4. Analysis and Comparison** The quantitative results from all phases will be compiled into a comprehensive report in an issue per database. Each candidate database will be scored against the above metrics (ingestion speed, storage size, query latency) and evaluated against our full list of functional, operational, and legal requirements to make a final, informed decision.

epic