[db] KG Database Selection - Google Spanner
Problem to Solve
Given the sudden deprecation of Kuzu, the GKG team will be accelerating it's efforts to consider alternative graph databases that strongly overlap with Kuzu's querying capabilities and general performance, while also looking to ameliorate some of the downsides of Kuzu.
The intent of this issue is to provide a broad analysis of the features deemed necessary to the GKGaaS effort, an overview of how Spanner Graph performs, and some notes on deployment/resource provisioning for the DB.
Spanner https://cloud.google.com/spanner?hl=en
Analysis
- Spanner Graph appears to have been built on top of the existing Spanner relational database offering. Spanner Graph is a feature which integrates graph capabilities with Spanner. It requires to define nodes and edges as table and then specify the graph based on the existing tables, primary keys and foreign keys.
- If we decide to go with a "slim" graph we can get away with two simple table (Node {id, traversal} and Edge { from, to }) and assign them different label. But this will get messy if we change our mind.
- Spanner Graph offers a querying language that is similar to openCypher with a few differences which are listed here https://cloud.google.com/spanner/docs/graph/opencypher-reference.
-
✅ Code Indexing: High volume of nodes with smaller data payloads (e.g., code symbols). -
✅ SDLC Data: Lower volume of nodes with larger data payloads (e.g., issue titles, descriptions, and other long-form text).
- Spanner supports importing from PostgreSQL via Datastream and from Avro and CSV files via the Import API. For ClickHouse we can use Avro files with the Import API for handling massive datasets with proper parallelization, error handling, and throughput optimization.
- The database is not embeddable. Furthermore, there isn't any supported way to run locally in a Docker image. There is an emulator https://docs.cloud.google.com/spanner/docs/emulator for local development, but it is not specified if it supports graph features and is not meant to be used as a local database.
Performance & Scalability Requirements
- Spanner is fundamentally an OLTP system designed for high-volume transactional writes. In August this year, they have introduced a columnar engine which speeds up the analytical queries. At the scale of multi-millions of nodes and edges, I think the columnar engine can easily do the job.
- Spanner does not appear to offer the same algorithm library as other graph database do. This is limiting if we want to run PageRank, shortest-path, etc.
- Spanner is built for global distribution with replication across zones and simple horizontal scaling providing 99.999% SLA. It automatically handles all those aspect which result in low maintenance costs on our side.
- If we're going to bulk import using files, we should use Avro files, which are the row-based counterpart of Parquet files.
- Spanner is built for scale. It without a doubt can store billions of nodes and edges if needs be.
- Here we should probably use the same approach as in the monolith for PostgreSQL to store the information.
Security & Authorization Requirements
- Spanner supports creating a limit of 100 databases per instance https://cloud.google.com/spanner/quotas#database-limits, which means if we want to roll this out to every top-level namespaces, we'd need (number of top-level namespaces / 100) instances.
- Spanner create logical separation between the database, ensuring the customer data is never mixed.
WHERE clauses) based on claims from a JWT (like user-specific traversal IDs). This allows the service to enforce fine-grained, row-level security at query time. This ensures users can only access data they are authorized to see.
Operational & Deployment Requirements
- There is not flexibility in the deployment model. The databases are managed by Google in their infrastructure.
- For self-managed customers who don't want to use GCP, we would need to offer the ability to start the Knowledge Graph with another database.
- Google Spanner cannot be deployed using Helm charts. It would be best to deploy it using an infrastructure code tool such as Terraform.
- The operational burden would mostly be offloaded to GCP. On our side, we would need to ensure correct routing to the database instances and resource scaling as we add more namespaces.
- Using Spanner would likely require us to have the server-side monitoring in GCP. On our side, we can collect some metrics about latency, connection pools, client-side errors, etc in our stack. The database instances health originally be in Google Cloud monitoring and would require a custom solution if we want to pull them into our stack.
gitlab-org scale?
❗ Other characteristics
- Spanner requires us to define a strict schema for each tables and join table, which can then be used to define a graph layer on top of the relational database. This means we need to handle migrations when we want to update the database.
Benchmark Results - COPY FROM
| Dataset Configuration | Bulk Import Time (s) | Import Throughput (records/s) | Final Database Size | Uncompressed size | Description |
|---|---|---|---|---|---|
DatasetConfig { num_users: 50, num_groups: 5, num_projects: 20, num_issues_per_project: 10, num_mrs_per_project: 5, num_epics_per_group: 2, num_milestones_per_project: 2, long_description_ratio: 0.0, long_description_size_bytes: 0 } |
Small dataset with minimal entities. | ||||
DatasetConfig { num_users: 1000, num_groups: 50, num_projects: 500, num_issues_per_project: 100, num_mrs_per_project: 50, num_epics_per_group: 20, num_milestones_per_project: 10, long_description_ratio: 0.0, long_description_size_bytes: 0 } |
Large dataset with moderate complexity | ||||
DatasetConfig { num_users: 10000, num_groups: 200, num_projects: 6000, num_issues_per_project: 20, num_mrs_per_project: 100, num_epics_per_group: 10, num_milestones_per_project: 4, long_description_ratio: 0.05, long_description_size_bytes: 46080 } |
Huge dataset with large descriptions (46KB, 5% ratio) | ||||
DatasetConfig { num_users: 100000, num_groups: 2000, num_projects: 10000, num_issues_per_project: 200, num_mrs_per_project: 500, num_epics_per_group: 100, num_milestones_per_project: 4, long_description_ratio: 0.05, long_description_size_bytes: 46080 } |
Giant dataset with large descriptions (46KB, 5% ratio) |
Giant dataset Benchmark Results - Individual Queries
| Query | Average Latency (ms) | p95 Latency (ms) | Throughput (queries/s) | Description | Cypher Query |
|---|---|---|---|---|---|
| Node Counts | Count nodes by type (Users, Groups, Projects, Issues, MergeRequests, Epics, Milestones) |
MATCH (mr:GitLabMergeRequest) RETURN count(mr) (5 000 000 entities) |
|||
| Issues Per Project (Top 10) | Aggregation with relationship traversal and ordering: (Issue)-[:BELONGS_TO_PROJECT]->(Project)
|
MATCH (i:GitLabIssue)-[:BELONGS_TO_PROJECT]->(p:GitLabProject) RETURN p.name, count(i) as issue_count ORDER BY issue_count DESC LIMIT 10 |
|||
| Most Active Users by Authored Issues | Aggregation with relationship traversal: (Issue)-[:AUTHORED_BY]->(User) with ordering and limit |
MATCH (i:GitLabIssue)-[:AUTHORED_BY]->(u:GitLabUser) RETURN u.username, count(i) as issues_authored ORDER BY issues_authored DESC LIMIT 10 |
|||
| Assigned Issues Count | Simple relationship count: (Issue)-[:ASSIGNED_TO]->(User)
|
MATCH (i:GitLabIssue)-[:ASSIGNED_TO]->(u:GitLabUser) RETURN count(i) as assigned_issues_count |
|||
| Merge Requests Closing Issues | Relationship traversal count: (MergeRequest)-[:CLOSES_ISSUE]->(Issue)
|
MATCH (mr:GitLabMergeRequest)-[:CLOSES_ISSUE]->(i:GitLabIssue) RETURN count(mr) as mrs_closing_issues |
|||
| Issues in Milestones (Top 5) | Aggregation with relationship traversal: (Issue)-[:IN_MILESTONE]->(Milestone) with ordering and limit |
MATCH (i:GitLabIssue)-[:IN_MILESTONE]->(m:GitLabMilestone) RETURN m.title, count(i) as issues_in_milestone ORDER BY issues_in_milestone DESC LIMIT 5 |
|||
| Epic to Issues Relationship (Top 5) | Aggregation with relationship traversal: (Issue)-[:IN_EPIC]->(Epic) with ordering and limit |
MATCH (i:GitLabIssue)-[:IN_EPIC]->(e:GitLabEpic) RETURN e.title, count(i) as issues_in_epic ORDER BY issues_in_epic DESC LIMIT 5 |
|||
| Users with Most Projects | Aggregation with relationship traversal: (User)-[:MEMBER_OF_PROJECT]->(Project) with ordering and limit |
MATCH (u:GitLabUser)-[:MEMBER_OF_PROJECT]->(p:GitLabProject) RETURN u.username, count(p) as project_count ORDER BY project_count DESC LIMIT 5 |
|||
| Project → Issue → Assignee Chain | Multi-hop traversal: (Project)<-[:BELONGS_TO_PROJECT]-(Issue)-[:ASSIGNED_TO]->(User)
|
MATCH (p:GitLabProject)<-[:BELONGS_TO_PROJECT]-(i:GitLabIssue)-[:ASSIGNED_TO]->(u:GitLabUser) RETURN p.name, i.title, u.username LIMIT 10 |
|||
| Relationship Counts by Type | Global relationship aggregation by type across entire graph | MATCH ()-[r]->() RETURN type(r) as rel_type, count(r) as count |