[db] KG Database Selection - Memgraph
Problem to Solve
Given the sudden deprecation of Kuzu, the GKG team will be accelerating it's efforts to consider alternative graph databases that strongly overlap with Kuzu's querying capabilities and general performance, while also looking to ameliorate some of the downsides of Kuzu.
The intent of this issue is to provide a broad analysis of the features deemed necessary to the GKGaaS effort, an overview of how Memgraph performs, and some notes on deployment/resource provisioning for the DB.
Memgraph https://memgraph.com/
Analysis
- As described in the documentation, Memgraph uses their own implementation of openCypher which is named Memgraph Cypher. They are trying to keep their query language as close as possible to openCypher but with some caveats. These difference are notably found in index creation, constraint creation, algorithms, functions and some syntax elements.
-
⚪ Code Indexing: High volume of nodes with smaller data payloads (e.g., code symbols). -
⚪ SDLC Data: Lower volume of nodes with larger data payloads (e.g., issue titles, descriptions, and other long-form text).
- Both Memgraph and ClickHouse support Apache Arrow, which could be used to query the relevant data against the data lake.
- Memgraph does not offer an embeddable database and their roadmap also makes no mention of it eventually being supported. Their C++ code is open source which means we could in theory create Rust bindings, although I'm not sure we want to take on the maintenance burden of wrapping a large C++ codebase that's actively evolving and invest resources in a solution that diverges from the vendor's intended use case.
Performance & Scalability Requirements
- Memgraph supports an analytical storage mode, which speeds up import and data analysis as the cost of having no ACID guarantees, which is fine in our query heavy context.
Security & Authorization Requirements
- Memgraph supports multi-tenancy in their Enterprise Edition package, which allows the creation of multiple isolated databases on a single instance. All the databases of a single instance share the same underlying resources, meaning there is no CPU or RAM isolation.
WHERE
clauses) based on claims from a JWT (like user-specific traversal IDs). This allows the service to enforce fine-grained, row-level security at query time. This ensures users can only access data they are authorized to see.
- This is supported as part of Memgraph Cypher being based on openCypher, which allows to query on specific properties. The support for indexes appears to be pretty flexible, which means we should be able to maintain efficiency even with filters.
Operational & Deployment Requirements
- Memgraph can be deployed in a high available self-managed cloud environment or leverage their cloud solution which handles the infrastructure for us.
- Memgraph adoption for self-managed customer will infer costs if they want to use the Enterprise Solution which offers metrics, monitoring, audit logs and multi-tenancy.
- Memgraph offers their own Helm charts https://memgraph.github.io/helm-charts/.
-
Real time logs: Memgraph supports streaming through WebSocket. Logs are also available as file on each Memgraph instance at
/var/log/memgraph
so using a service such as vector.dev would allow us to forward the logs directly to Elastic Search so they can be found in Kibana. -
Audit logs: Memgraph supports audit logs that contain information about queries run by users
<timestamp>,<address>,<username>,<query>,<params>
. This is only available as part of the Enterprise Edition. - Metrics: Memgraph offers an endpoint to get a lot of system metrics which can be used for observability when using the Enterprise Edition. For our observability stack (Grafana + Grafana Mimir), we will need to setup their prometheus-exporter which will need to know all the available instances.
gitlab-org
scale?
COPY FROM
Benchmark Results - Dataset Configuration | Bulk Import Time (s) | Import Throughput (records/s) | Final Database Size (MB) | Storage Efficiency (ratio) | Description |
---|---|---|---|---|---|
DatasetConfig { num_users: 50, num_groups: 5, num_projects: 20, num_issues_per_project: 10, num_mrs_per_project: 5, num_epics_per_group: 2, num_milestones_per_project: 2, long_description_ratio: 0.0, long_description_size_bytes: 0 } |
- | - | - | - | Small dataset with minimal entities |
DatasetConfig { num_users: 1000, num_groups: 50, num_projects: 500, num_issues_per_project: 100, num_mrs_per_project: 50, num_epics_per_group: 20, num_milestones_per_project: 10, long_description_ratio: 0.0, long_description_size_bytes: 0 } |
- | - | - | - | Large dataset with moderate complexity |
DatasetConfig { num_users: 10000, num_groups: 200, num_projects: 6000, num_issues_per_project: 20, num_mrs_per_project: 100, num_epics_per_group: 10, num_milestones_per_project: 4, long_description_ratio: 0.025, long_description_size_bytes: 393216 } |
- | - | - | - | Huge dataset with large descriptions (384KB, 2.5% ratio) |
DatasetConfig { num_users: 100000, num_groups: 2000, num_projects: 60000, num_issues_per_project: 200, num_mrs_per_project: 1000, num_epics_per_group: 100, num_milestones_per_project: 4, long_description_ratio: 0.05, long_description_size_bytes: 524288 } |
- | - | - | - | Giant dataset with large descriptions (512KB, 5% ratio) |
Benchmark Results - Individual Queries
Query | Average Latency (ms) | p95 Latency (ms) | Throughput (queries/s) | Description | Cypher Query |
---|---|---|---|---|---|
Node Counts |
- |
- |
- |
Count nodes by type (Users, Groups, Projects, Issues, MergeRequests, Epics, Milestones) |
|
Issues Per Project (Top 10) |
- |
- |
- |
Aggregation with relationship traversal and ordering: |
|
Most Active Users by Authored Issues |
- |
- |
- |
Aggregation with relationship traversal: |
|
Assigned Issues Count |
- |
- |
- |
Simple relationship count: |
|
Merge Requests Closing Issues |
- |
- |
- |
Relationship traversal count: |
|
Issues in Milestones (Top 5) |
- |
- |
- |
Aggregation with relationship traversal: |
|
Epic to Issues Relationship (Top 5) |
- |
- |
- |
Aggregation with relationship traversal: |
|
Users with Most Projects |
- |
- |
- |
Aggregation with relationship traversal: |
|
Project → Issue → Assignee Chain |
- |
- |
- |
Multi-hop traversal: |
|
Relationship Counts by Type |
- |
- |
- |
Global relationship aggregation by type across entire graph |
|