Skip to content

[db] KG Database Selection - Memgraph

Problem to Solve

Given the sudden deprecation of Kuzu, the GKG team will be accelerating it's efforts to consider alternative graph databases that strongly overlap with Kuzu's querying capabilities and general performance, while also looking to ameliorate some of the downsides of Kuzu.

The intent of this issue is to provide a broad analysis of the features deemed necessary to the GKGaaS effort, an overview of how Memgraph performs, and some notes on deployment/resource provisioning for the DB.

Memgraph https://memgraph.com/

Analysis

Property Graph Model: The database must be a native property graph database. Our data model relies on nodes, relationships (edges), and properties attached to both. Alternative models like RDF are not suitable.

Cypher Query Language Support: The database must support the OpenCypher query language. Cypher is a standardized, SQL-like language for graphs that integrates well with Large Language Models (LLMs) and provides a common interface for our services.

  • As described in the documentation, Memgraph uses their own implementation of openCypher which is named Memgraph Cypher. They are trying to keep their query language as close as possible to openCypher but with some caveats. These difference are notably found in index creation, constraint creation, algorithms, functions and some syntax elements.

Mixed Workload Versatility: The database must perform efficiently under two distinct workload profiles:

  • Code Indexing: High volume of nodes with smaller data payloads (e.g., code symbols).
  • SDLC Data: Lower volume of nodes with larger data payloads (e.g., issue titles, descriptions, and other long-form text).

Support for Indexing at Scale: The database must be compatible with an indexing pipeline that performs lookups against an external data lake (like ClickHouse or PostgreSQL) to enrich graph data with necessary context (e.g., top-level namespace IDs) before ingestion.

  • Both Memgraph and ClickHouse support Apache Arrow, which could be used to query the relevant data against the data lake.

(Nice-to-have) Embeddable: Enable embeddable distributions, like Kuzu, for the local code indexing Knowledge Graph edition.

  • Memgraph does not offer an embeddable database and their roadmap also makes no mention of it eventually being supported. Their C++ code is open source which means we could in theory create Rust bindings, although I'm not sure we want to take on the maintenance burden of wrapping a large C++ codebase that's actively evolving and invest resources in a solution that diverges from the vendor's intended use case.

Performance & Scalability Requirements

Optimized for Analytical (OLAP) Workloads: The Knowledge Graph is a read-heavy, analytical system. The database must be optimized for complex OLAP-style queries and does not require high-frequency, user-driven transactional (OLTP) write capabilities.

  • Memgraph supports an analytical storage mode, which speeds up import and data analysis as the cost of having no ACID guarantees, which is fine in our query heavy context.

High Availability & Horizontal Scalability: The solution must be architected for high availability with multi-node, multi-replica configurations to eliminate single points of failure and scale horizontally to handle the growth of GitLab.com.

High Write Speed & Bulk Import Capability: The database must demonstrate high-performance write throughput and, critically, support efficient bulk importing of data from prepared files (e.g., Parquet). This is essential for our ETL pipeline, which processes data from a data lake before loading it into the graph.

High Node Count Support: The database must be capable of storing and querying graphs containing an extremely high number of nodes and edges to accommodate enterprise-scale codebases and years of SDLC history.

Efficient Filtering by Traversal IDs: The database must be able to efficiently execute queries that filter nodes and relationships based on a list of traversal IDs to support our authorization model.

Security & Authorization Requirements

Robust Multi-tenancy & Data Isolation: The system must enforce strict data segregation between tenants (top-level namespaces). It must support creating logical or physical database separations per tenant to ensure customer data is never mixed, which is a critical security and compliance requirement.

  • Memgraph supports multi-tenancy in their Enterprise Edition package, which allows the creation of multiple isolated databases on a single instance. All the databases of a single instance share the same underlying resources, meaning there is no CPU or RAM isolation.

Query-Time Authorization via Dynamic Filtering: The query language (Cypher) must support the injection of dynamic filters (e.g., WHERE clauses) based on claims from a JWT (like user-specific traversal IDs). This allows the service to enforce fine-grained, row-level security at query time. This ensures users can only access data they are authorized to see.

  • This is supported as part of Memgraph Cypher being based on openCypher, which allows to query on specific properties. The support for indexes appears to be pretty flexible, which means we should be able to maintain efficiency even with filters.

Operational & Deployment Requirements

⚠️ Deployment Model Flexibility: The solution must be deployable across all GitLab offerings, including GitLab.com, GitLab Dedicated, and self-managed instances.

  • Memgraph can be deployed in a high available self-managed cloud environment or leverage their cloud solution which handles the infrastructure for us.
  • Memgraph adoption for self-managed customer will infer costs if they want to use the Enterprise Solution which offers metrics, monitoring, audit logs and multi-tenancy.

Integration with GitLab Deployment Tooling: The database must be deployable and manageable through our standard toolchain, including CNG (Cloud Native GitLab) and Helm charts.

Low Operational Overhead: A managed solution or a database with low operational complexity is strongly preferred to minimize the on-call burden for our DBRE and Data Engineering teams. The chosen solution must be supportable by our internal teams.

Monitoring & Logging Integration: The database must integrate with our existing observability stack for monitoring, logging, and alerting - or at least be observable by the GKG Service.

  • Real time logs: Memgraph supports streaming through WebSocket. Logs are also available as file on each Memgraph instance at /var/log/memgraph so using a service such as vector.dev would allow us to forward the logs directly to Elastic Search so they can be found in Kibana.
  • Audit logs: Memgraph supports audit logs that contain information about queries run by users <timestamp>,<address>,<username>,<query>,<params>. This is only available as part of the Enterprise Edition.
  • Metrics: Memgraph offers an endpoint to get a lot of system metrics which can be used for observability when using the Enterprise Edition. For our observability stack (Grafana + Grafana Mimir), we will need to setup their prometheus-exporter which will need to know all the available instances.

FinOps - what will the associated costs to be run an instance at the gitlab-org scale?

Benchmark Results - COPY FROM

Dataset Configuration Bulk Import Time (s) Import Throughput (records/s) Final Database Size (MB) Storage Efficiency (ratio) Description
DatasetConfig { num_users: 50, num_groups: 5, num_projects: 20, num_issues_per_project: 10, num_mrs_per_project: 5, num_epics_per_group: 2, num_milestones_per_project: 2, long_description_ratio: 0.0, long_description_size_bytes: 0 } - - - - Small dataset with minimal entities
DatasetConfig { num_users: 1000, num_groups: 50, num_projects: 500, num_issues_per_project: 100, num_mrs_per_project: 50, num_epics_per_group: 20, num_milestones_per_project: 10, long_description_ratio: 0.0, long_description_size_bytes: 0 } - - - - Large dataset with moderate complexity
DatasetConfig { num_users: 10000, num_groups: 200, num_projects: 6000, num_issues_per_project: 20, num_mrs_per_project: 100, num_epics_per_group: 10, num_milestones_per_project: 4, long_description_ratio: 0.025, long_description_size_bytes: 393216 } - - - - Huge dataset with large descriptions (384KB, 2.5% ratio)
DatasetConfig { num_users: 100000, num_groups: 2000, num_projects: 60000, num_issues_per_project: 200, num_mrs_per_project: 1000, num_epics_per_group: 100, num_milestones_per_project: 4, long_description_ratio: 0.05, long_description_size_bytes: 524288 } - - - - Giant dataset with large descriptions (512KB, 5% ratio)

Benchmark Results - Individual Queries

Query Average Latency (ms) p95 Latency (ms) Throughput (queries/s) Description Cypher Query
Node Counts

-

-

-

Count nodes by type (Users, Groups, Projects, Issues, MergeRequests, Epics, Milestones)

MATCH (u:GitLabUser) RETURN count(u) (and similar for each type)

Issues Per Project (Top 10)

-

-

-

Aggregation with relationship traversal and ordering: (Issue)-[:BELONGS_TO_PROJECT]->(Project)

MATCH (i:GitLabIssue)-[:BELONGS_TO_PROJECT]->(p:GitLabProject) RETURN p.name, count(i) as issue_count ORDER BY issue_count DESC LIMIT 10

Most Active Users by Authored Issues

-

-

-

Aggregation with relationship traversal: (Issue)-[:AUTHORED_BY]->(User) with ordering and limit

MATCH (i:GitLabIssue)-[:AUTHORED_BY]->(u:GitLabUser) RETURN u.username, count(i) as issues_authored ORDER BY issues_authored DESC LIMIT 10

Assigned Issues Count

-

-

-

Simple relationship count: (Issue)-[:ASSIGNED_TO]->(User)

MATCH (i:GitLabIssue)-[:ASSIGNED_TO]->(u:GitLabUser) RETURN count(i) as assigned_issues_count

Merge Requests Closing Issues

-

-

-

Relationship traversal count: (MergeRequest)-[:CLOSES_ISSUE]->(Issue)

MATCH (mr:GitLabMergeRequest)-[:CLOSES_ISSUE]->(i:GitLabIssue) RETURN count(mr) as mrs_closing_issues

Issues in Milestones (Top 5)

-

-

-

Aggregation with relationship traversal: (Issue)-[:IN_MILESTONE]->(Milestone) with ordering and limit

MATCH (i:GitLabIssue)-[:IN_MILESTONE]->(m:GitLabMilestone) RETURN m.title, count(i) as issues_in_milestone ORDER BY issues_in_milestone DESC LIMIT 5

Epic to Issues Relationship (Top 5)

-

-

-

Aggregation with relationship traversal: (Issue)-[:IN_EPIC]->(Epic) with ordering and limit

MATCH (i:GitLabIssue)-[:IN_EPIC]->(e:GitLabEpic) RETURN e.title, count(i) as issues_in_epic ORDER BY issues_in_epic DESC LIMIT 5

Users with Most Projects

-

-

-

Aggregation with relationship traversal: (User)-[:MEMBER_OF_PROJECT]->(Project) with ordering and limit

MATCH (u:GitLabUser)-[:MEMBER_OF_PROJECT]->(p:GitLabProject) RETURN u.username, count(p) as project_count ORDER BY project_count DESC LIMIT 5

Project → Issue → Assignee Chain

-

-

-

Multi-hop traversal: (Project)<-[:BELONGS_TO_PROJECT]-(Issue)-[:ASSIGNED_TO]->(User)

MATCH (p:GitLabProject)<-[:BELONGS_TO_PROJECT]-(i:GitLabIssue)-[:ASSIGNED_TO]->(u:GitLabUser) RETURN p.name, i.title, u.username LIMIT 10

Relationship Counts by Type

-

-

-

Global relationship aggregation by type across entire graph
  • MATCH ()-[r]->() RETURN type(r) as rel_type, count(r) as count
Edited by Jean-Gabriel Doyon