[db] KG Database Selection - Google Spanner

Problem to Solve

Given the sudden deprecation of Kuzu, the GKG team will be accelerating it's efforts to consider alternative graph databases that strongly overlap with Kuzu's querying capabilities and general performance, while also looking to ameliorate some of the downsides of Kuzu.

The intent of this issue is to provide a broad analysis of the features deemed necessary to the GKGaaS effort, an overview of how Spanner Graph performs, and some notes on deployment/resource provisioning for the DB.

Spanner https://cloud.google.com/spanner?hl=en

Analysis

⚠️ Property Graph Model: The database must be a native property graph database. Our data model relies on nodes, relationships (edges), and properties attached to both. Alternative models like RDF are not suitable.

  • Spanner Graph appears to have been built on top of the existing Spanner relational database offering. Spanner Graph is a feature which integrates graph capabilities with Spanner. It requires to define nodes and edges as table and then specify the graph based on the existing tables, primary keys and foreign keys.
  • If we decide to go with a "slim" graph we can get away with two simple table (Node {id, traversal} and Edge { from, to }) and assign them different label. But this will get messy if we change our mind.

⚠️ Cypher Query Language Support: The database must support the openCypher query language. Cypher is a standardized, SQL-like language for graphs that integrates well with Large Language Models (LLMs) and provides a common interface for our services.

Mixed Workload Versatility: The database must perform efficiently under two distinct workload profiles:

  • Code Indexing: High volume of nodes with smaller data payloads (e.g., code symbols).
  • SDLC Data: Lower volume of nodes with larger data payloads (e.g., issue titles, descriptions, and other long-form text).

Support for Indexing at Scale: The database must be compatible with an indexing pipeline that performs lookups against an external data lake (like ClickHouse or PostgreSQL) to enrich graph data with necessary context (e.g., top-level namespace IDs) before ingestion.

  • Spanner supports importing from PostgreSQL via Datastream and from Avro and CSV files via the Import API. For ClickHouse we can use Avro files with the Import API for handling massive datasets with proper parallelization, error handling, and throughput optimization.

(Nice-to-have) Embeddable: Enable embeddable distributions, like Kuzu, for the local code indexing Knowledge Graph edition.

  • The database is not embeddable. Furthermore, there isn't any supported way to run locally in a Docker image. There is an emulator https://docs.cloud.google.com/spanner/docs/emulator for local development, but it is not specified if it supports graph features and is not meant to be used as a local database.

Performance & Scalability Requirements

⚠️ Optimized for Analytical (OLAP) Workloads: The Knowledge Graph is a read-heavy, analytical system. The database must be optimized for complex OLAP-style queries and does not require high-frequency, user-driven transactional (OLTP) write capabilities.

  • Spanner is fundamentally an OLTP system designed for high-volume transactional writes. In August this year, they have introduced a columnar engine which speeds up the analytical queries. At the scale of multi-millions of nodes and edges, I think the columnar engine can easily do the job.
  • Spanner does not appear to offer the same algorithm library as other graph database do. This is limiting if we want to run PageRank, shortest-path, etc.

High Availability & Horizontal Scalability: The solution must be architected for high availability with multi-node, multi-replica configurations to eliminate single points of failure and scale horizontally to handle the growth of GitLab.com.

  • Spanner is built for global distribution with replication across zones and simple horizontal scaling providing 99.999% SLA. It automatically handles all those aspect which result in low maintenance costs on our side.

High Write Speed & Bulk Import Capability: The database must demonstrate high-performance write throughput and, critically, support efficient bulk importing of data from prepared files (e.g., Parquet). This is essential for our ETL pipeline, which processes data from a data lake before loading it into the graph.

  • If we're going to bulk import using files, we should use Avro files, which are the row-based counterpart of Parquet files.

High Node Count Support: The database must be capable of storing and querying graphs containing an extremely high number of nodes and edges to accommodate enterprise-scale codebases and years of SDLC history.

  • Spanner is built for scale. It without a doubt can store billions of nodes and edges if needs be.

Efficient Filtering by Traversal IDs: The database must be able to efficiently execute queries that filter nodes and relationships based on a list of traversal IDs to support our authorization model.

  • Here we should probably use the same approach as in the monolith for PostgreSQL to store the information.

Security & Authorization Requirements

Robust Multi-tenancy & Data Isolation: The system must enforce strict data segregation between tenants (top-level namespaces). It must support creating logical or physical database separations per tenant to ensure customer data is never mixed, which is a critical security and compliance requirement.

  • Spanner supports creating a limit of 100 databases per instance https://cloud.google.com/spanner/quotas#database-limits, which means if we want to roll this out to every top-level namespaces, we'd need (number of top-level namespaces / 100) instances.
  • Spanner create logical separation between the database, ensuring the customer data is never mixed.

Query-Time Authorization via Dynamic Filtering: The query language (Cypher) must support the injection of dynamic filters (e.g., WHERE clauses) based on claims from a JWT (like user-specific traversal IDs). This allows the service to enforce fine-grained, row-level security at query time. This ensures users can only access data they are authorized to see.

Operational & Deployment Requirements

Deployment Model Flexibility: The solution must be deployable across all GitLab offerings, including GitLab.com, GitLab Dedicated, and self-managed instances.

  • There is not flexibility in the deployment model. The databases are managed by Google in their infrastructure.
  • For self-managed customers who don't want to use GCP, we would need to offer the ability to start the Knowledge Graph with another database.

⚠️ Integration with GitLab Deployment Tooling: The database must be deployable and manageable through our standard toolchain, including CNG (Cloud Native GitLab) and Helm charts.

  • Google Spanner cannot be deployed using Helm charts. It would be best to deploy it using an infrastructure code tool such as Terraform.

⚠️ Low Operational Overhead: A managed solution or a database with low operational complexity is strongly preferred to minimize the on-call burden for our DBRE and Data Engineering teams. The chosen solution must be supportable by our internal teams.

  • The operational burden would mostly be offloaded to GCP. On our side, we would need to ensure correct routing to the database instances and resource scaling as we add more namespaces.

⚠️ Monitoring & Logging Integration: The database must integrate with our existing observability stack for monitoring, logging, and alerting - or at least be observable by the GKG Service.

  • Using Spanner would likely require us to have the server-side monitoring in GCP. On our side, we can collect some metrics about latency, connection pools, client-side errors, etc in our stack. The database instances health originally be in Google Cloud monitoring and would require a custom solution if we want to pull them into our stack.

FinOps - what will the associated costs to be run an instance at the gitlab-org scale?

Other characteristics

  • Spanner requires us to define a strict schema for each tables and join table, which can then be used to define a graph layer on top of the relational database. This means we need to handle migrations when we want to update the database.

Benchmark Results - COPY FROM

Dataset Configuration Bulk Import Time (s) Import Throughput (records/s) Final Database Size Uncompressed size Description
DatasetConfig { num_users: 50, num_groups: 5, num_projects: 20, num_issues_per_project: 10, num_mrs_per_project: 5, num_epics_per_group: 2, num_milestones_per_project: 2, long_description_ratio: 0.0, long_description_size_bytes: 0 } Small dataset with minimal entities.
DatasetConfig { num_users: 1000, num_groups: 50, num_projects: 500, num_issues_per_project: 100, num_mrs_per_project: 50, num_epics_per_group: 20, num_milestones_per_project: 10, long_description_ratio: 0.0, long_description_size_bytes: 0 } Large dataset with moderate complexity
DatasetConfig { num_users: 10000, num_groups: 200, num_projects: 6000, num_issues_per_project: 20, num_mrs_per_project: 100, num_epics_per_group: 10, num_milestones_per_project: 4, long_description_ratio: 0.05, long_description_size_bytes: 46080 } Huge dataset with large descriptions (46KB, 5% ratio)
DatasetConfig { num_users: 100000, num_groups: 2000, num_projects: 10000, num_issues_per_project: 200, num_mrs_per_project: 500, num_epics_per_group: 100, num_milestones_per_project: 4, long_description_ratio: 0.05, long_description_size_bytes: 46080 } Giant dataset with large descriptions (46KB, 5% ratio)

Giant dataset Benchmark Results - Individual Queries

Query Average Latency (ms) p95 Latency (ms) Throughput (queries/s) Description Cypher Query
Node Counts Count nodes by type (Users, Groups, Projects, Issues, MergeRequests, Epics, Milestones) MATCH (mr:GitLabMergeRequest) RETURN count(mr) (5 000 000 entities)
Issues Per Project (Top 10) Aggregation with relationship traversal and ordering: (Issue)-[:BELONGS_TO_PROJECT]->(Project) MATCH (i:GitLabIssue)-[:BELONGS_TO_PROJECT]->(p:GitLabProject) RETURN p.name, count(i) as issue_count ORDER BY issue_count DESC LIMIT 10
Most Active Users by Authored Issues Aggregation with relationship traversal: (Issue)-[:AUTHORED_BY]->(User) with ordering and limit MATCH (i:GitLabIssue)-[:AUTHORED_BY]->(u:GitLabUser) RETURN u.username, count(i) as issues_authored ORDER BY issues_authored DESC LIMIT 10
Assigned Issues Count Simple relationship count: (Issue)-[:ASSIGNED_TO]->(User) MATCH (i:GitLabIssue)-[:ASSIGNED_TO]->(u:GitLabUser) RETURN count(i) as assigned_issues_count
Merge Requests Closing Issues Relationship traversal count: (MergeRequest)-[:CLOSES_ISSUE]->(Issue) MATCH (mr:GitLabMergeRequest)-[:CLOSES_ISSUE]->(i:GitLabIssue) RETURN count(mr) as mrs_closing_issues
Issues in Milestones (Top 5) Aggregation with relationship traversal: (Issue)-[:IN_MILESTONE]->(Milestone) with ordering and limit MATCH (i:GitLabIssue)-[:IN_MILESTONE]->(m:GitLabMilestone) RETURN m.title, count(i) as issues_in_milestone ORDER BY issues_in_milestone DESC LIMIT 5
Epic to Issues Relationship (Top 5) Aggregation with relationship traversal: (Issue)-[:IN_EPIC]->(Epic) with ordering and limit MATCH (i:GitLabIssue)-[:IN_EPIC]->(e:GitLabEpic) RETURN e.title, count(i) as issues_in_epic ORDER BY issues_in_epic DESC LIMIT 5
Users with Most Projects Aggregation with relationship traversal: (User)-[:MEMBER_OF_PROJECT]->(Project) with ordering and limit MATCH (u:GitLabUser)-[:MEMBER_OF_PROJECT]->(p:GitLabProject) RETURN u.username, count(p) as project_count ORDER BY project_count DESC LIMIT 5
Project → Issue → Assignee Chain Multi-hop traversal: (Project)<-[:BELONGS_TO_PROJECT]-(Issue)-[:ASSIGNED_TO]->(User) MATCH (p:GitLabProject)<-[:BELONGS_TO_PROJECT]-(i:GitLabIssue)-[:ASSIGNED_TO]->(u:GitLabUser) RETURN p.name, i.title, u.username LIMIT 10
Relationship Counts by Type Global relationship aggregation by type across entire graph MATCH ()-[r]->() RETURN type(r) as rel_type, count(r) as count
Edited by Jean-Gabriel Doyon