docs: move GKG design documents to knowledge-graph repository (41dd8da4) · Commits · GitLab.com / Content Sites / handbook

content/handbook/engineering/architecture/design-documents/gitlab_knowledge_graph/_index.md

+37 −439

File changed.

Preview size limit exceeded, changes collapsed.

content/handbook/engineering/architecture/design-documents/gitlab_knowledge_graph/data_model.md

deleted100644 → 0

+0 −147

Original line number	Diff line number	Diff line
		---
		title: Knowledge Graph Data Model
		description: An overview of the nodes and relationships in the GitLab Knowledge Graph, focusing on the Namespace and Code graphs.
		---

		## Overview

		The GitLab Knowledge Graph is composed of two primary sub-graphs that share a common schema foundation: the Namespace (SDLC) Graph and the Code Graph. This document details the data model for each, covering the nodes and relationships that constitute them.

		The data model is designed to be intuitive and to mirror the mental model that developers and users have of the GitLab platform. By representing entities as nodes and their interactions as relationships, we can perform complex queries that would be difficult or inefficient with a traditional relational database.

		The data model follows a [Property Graph](https://neo4j.com/blog/knowledge-graph/rdf-vs-property-graphs-knowledge-graphs/?utm_source=GSearch&utm_medium=PaidSearch&utm_campaign=CTEMEA_CRSearch_SREMEACentralDACH_Non-Brand_DSA&utm_content=PCCoreDB_SCCoreBrand_Misc&utm_term=&gad_source=1&gad_campaignid=20769286946&gclid=Cj0KCQjwo63HBhCKARIsAHOHV_VWAmKJQ19f0_UwVxL8wmIizWjsWahHddHN7Xs--Ao9FFd-wYQkBbMaApmGEALw_wcB) approach over an RDF approach, as GitLab data has strongly defined relationships between entities.

		> We can enable custom node and relationship expansion in the future by following the Property Graph approach and building the correct schema management capabilities.

		## Data Storage Location

		The Knowledge Graph data will be stored in tables that are separate from the existing tables used for analytics.

		- For GitLab.com, the Knowledge Graph tables will be stored in a dedicated ClickHouse instance.
		- For dedicated or self-managed instances, the Knowledge Graph tables can be stored in a separate database or instance. This will be left to the discretion of the instance owner.

		## Concepts to Know

		- Unified Schema: Both the Code Graph and the SDLC Graph are built on the same foundational schema provided by `crates/database`. This allows for linking between the two graphs (e.g., a `Project` node from the SDLC graph can be linked to a `File` node from the Code Graph).
		- Entity as Node: Every entity in the GitLab ecosystem (e.g., Project, Issue, File, Function Definition) is represented as a node.
		- Interaction as Edge: Relationships between these entities (e.g., a User `COMMENTS_ON` an Issue, a `File` `CONTAINS` a `Definition`) are represented as directed edges.

		---

		## The Namespace Graph Data Model

		The Namespace Graph represents the Software Development Life Cycle (SDLC) entities and their interactions within GitLab. It models how users, projects, issues, merge requests, and CI/CD components relate to one another.

		### Example Node Types

		\| Node Type \| Description \| Key Properties \|
		\| --------------------- \| ------------------------------------------------------------------------------------------------------- \| --------------------------------------------------------------------------- \|
		\| `Namespace` \| Represents a GitLab group or user namespace. \| `id`, `name`, `full_path`, `type` (Group or User) \|
		\| `Project` \| Represents a GitLab project/repository. \| `id`, `name`, `full_path`, `namespace_id` \|
		\| `Issue` \| Represents a GitLab issue. \| `id`, `iid`, `title`, `state`, `project_id`, `author_id` \|
		\| `MergeRequest` \| Represents a GitLab merge request. \| `id`, `iid`, `title`, `state`, `source_branch`, `target_branch`, `project_id` \|
		\| `Pipeline` \| Represents a CI/CD pipeline. \| `id`, `status`, `source`, `project_id`, `user_id` \|
		\| `Vulnerability` \| Represents a security vulnerability finding. \| `id`, `title`, `severity`, `state`, `project_id` \|
		\| `User` \| Represents a GitLab user. \| `id`, `username`, `name` \|
		\| `Note` \| Represents a comment on an issue, merge request, or epic. \| `id`, `body`, `author_id`, `project_id` \|
		\| `Epic` \| Represents a GitLab epic. \| `id`, `iid`, `title`, `state`, `group_id`, `author_id` \|
		\| `Branch` \| Represents a Git branch. \| `name`, `project_id` \|

		### Relationship Visualization

		```mermaid
		graph TD
		Namespace -- CONTAINS --> Project
		Project -- HAS_ISSUE --> Issue
		Project -- HAS_MERGE_REQUEST --> MergeRequest
		Project -- HAS_PIPELINE --> Pipeline
		Project -- HAS_VULNERABILITY --> Vulnerability
		Project -- HAS_BRANCH --> Branch

		User -- CREATED --> Issue
		User -- CREATED --> MergeRequest
		User -- CREATED --> Epic
		User -- COMMENTS_ON --> Issue
		User -- COMMENTS_ON --> MergeRequest
		User -- COMMENTS_ON --> Epic
		Note -- IS_COMMENT_ON --> Issue
		Note -- IS_COMMENT_ON --> MergeRequest
		Note -- IS_COMMENT_ON --> Epic

		MergeRequest -- MERGES_TO --> Branch
		MergeRequest -- RELATED_TO --> Issue
		Pipeline -- TRIGGERED_FOR --> MergeRequest
		Pipeline -- TRIGGERED_FOR --> Branch
		```

		### Relationship Types

		\| Relationship \| From Node \| To Node \| Description \|
		\| ----------------------------------- \| -------------- \| -------------- \| ------------------------------------------------------------------------------------------------------- \|
		\| `CONTAINS` \| `Namespace` \| `Project` \| A namespace contains a project. \|
		\| `HAS_ISSUE` \| `Project` \| `Issue` \| A project has an issue. \|
		\| `HAS_MERGE_REQUEST` \| `Project` \| `MergeRequest` \| A project has a merge request. \|
		\| `HAS_PIPELINE` \| `Project` \| `Pipeline` \| A project has a CI/CD pipeline. \|
		\| `HAS_VULNERABILITY` \| `Project` \| `Vulnerability`\| A project has a vulnerability finding. \|
		\| `HAS_BRANCH` \| `Project` \| `Branch` \| A project has a branch. \|
		\| `CREATED` \| `User` \| `Issue`, `MR`... \| A user created an entity. \|
		\| `COMMENTS_ON` \| `User` \| `Issue`, `MR`... \| A user commented on an entity (via a `Note`). \|
		\| `IS_COMMENT_ON` \| `Note` \| `Issue`, `MR`... \| A note is a comment on a specific entity. \|
		\| `MERGES_TO` \| `MergeRequest` \| `Branch` \| A merge request targets a specific branch for merging. \|
		\| `RELATED_TO` \| `MergeRequest` \| `Issue` \| A merge request is related to or closes an issue. \|
		\| `TRIGGERED_FOR` \| `Pipeline` \| `MR`, `Branch` \| A pipeline was triggered for a merge request or a branch push. \|

		---

		## The Code Graph Data Model

		The Code Graph represents the structure and relationships within the source code of a repository. It models the file system hierarchy, code definitions, and the call graph.

		### Node Types

		\| Node Type \| Description \| Key Properties \|
		\| --------------------- \| ------------------------------------------------------------------------------------------------------- \| --------------------------------------------------------------------------- \|
		\| `Directory` \| Represents a directory within a repository. \| `relative_path`, `absolute_path`, `repository_name` \|
		\| `File` \| Represents a file within a repository. \| `relative_path`, `absolute_path`, `language`, `repository_name` \|
		\| `Definition` \| Represents a code definition (e.g., class, function, method, module). \| `fully_qualified_name`, `display_name`, `definition_type`, `file_path` \|
		\| `ImportedSymbol` \| Represents an imported symbol or module within a file. \| `symbol_name`, `source_module`, `file_path` \|

		### Relationship Visualization

		```mermaid
		graph TD
		Directory -- DIR_CONTAINS_DIR --> Directory
		Directory -- DIR_CONTAINS_FILE --> File
		File -- FILE_DEFINES --> Definition
		File -- FILE_IMPORTS --> ImportedSymbol
		Definition -- DEFINITION_TO_DEFINITION --> Definition
		Definition -- DEFINES_IMPORTED_SYMBOL --> ImportedSymbol
		ImportedSymbol -- IMPORTED_SYMBOL_TO_DEFINITION --> Definition
		```

		### Relationship Types

		\| Relationship \| From Node \| To Node \| Description \|
		\| ----------------------------------- \| -------------- \| -------------- \| ------------------------------------------------------------------------------------------------------- \|
		\| `DIR_CONTAINS_DIR` \| `Directory` \| `Directory` \| A directory contains another directory. \|
		\| `DIR_CONTAINS_FILE` \| `Directory` \| `File` \| A directory contains a file. \|
		\| `FILE_DEFINES` \| `File` \| `Definition` \| A file contains a code definition. \|
		\| `FILE_IMPORTS` \| `File` \| `ImportedSymbol`\| A file imports a symbol. \|
		\| `DEFINITION_TO_DEFINITION` \| `Definition` \| `Definition` \| Represents a call graph edge (e.g., a function calls another function, a class inherits from another). \|
		\| `DEFINES_IMPORTED_SYMBOL` \| `Definition` \| `ImportedSymbol`\| A definition (e.g., an exported function) is the source of an imported symbol. \|
		\| `IMPORTED_SYMBOL_TO_DEFINITION` \| `ImportedSymbol`\| `Definition` \| An imported symbol resolves to a specific definition. \|

		---

		## Cross-Graph Relationships

		The power of the Knowledge Graph comes from its ability to link the SDLC and Code graphs. In future iterations, we will be able to link the two graphs together to create a unified graph. For the first iteration, we intend to keep the two graphs separate to keep the complexity of the engineering effort manageable.

		Here is an example of how we can link the two graphs together:

		\| Relationship \| From Node \| To Node \| Description \|
		\| ----------------------------------- \| -------------- \| -------------- \| ------------------------------------------------------------------------------------------------------- \|
		\| `HAS_FILE` \| `Project` \| `File` \| A project (from the Namespace Graph) contains a file (from the Code Graph). \|
		\| `HAS_DIRECTORY` \| `Project` \| `Directory` \| A project (from the Namespace Graph) contains a directory (from the Code Graph). \|

		These links allow for deep queries, like "Find all merge requests (`SDLC`) that touch files (`Code`) containing a specific function definition (`Code`)."

content/handbook/engineering/architecture/design-documents/gitlab_knowledge_graph/decisions/001_dedicated_processes.md

deleted100644 → 0

+0 −84

Original line number	Diff line number	Diff line
		---
		title: "ADR-001: FFI vs Dedicated Process Integration"
		owning-stage: "~devops::create"
		toc_hide: true
		---

		## Context

		The initial implementation used Foreign Function Interface (FFI) to integrate the
		Rust-based Knowledge Graph indexing functionality within the Go-based `gitlab-zoekt-indexer`
		service. This approach was chosen to simplify deployment within Omnibus and reduce
		operational complexity. Exposing knowledge graph for querying was planned to be done
		directly from `gitlab-zoekt` service (not through FFI).

		However, after committing to the [segmentation strategy](../../selfmanaged_segmentation/_index.md),
		the Omnibus constraints no longer apply to the Knowledge Graph service.

		## Problem

		The FFI approach presented several challenges:

		1. Querying Complexity: Knowledge Graph querying now requires additional
		query building, and result type mapping, which would require FFI usage
		also for query pre-processing and post-processing
		2. Separation of Concerns: The Knowledge Graph's increasing complexity for both
		indexing and querying makes will make it difficult to use from zoekt-indexer
		3. Independent Upgrades: The FFI model requires two upgrade maintenance points
		for any API changes - both the Knowledge Graph repo and the Zoekt indexer repo
		4. Observability: FFI makes it difficult to monitor and debug the Knowledge Graph
		components independently
		5. Safety Concerns: Unsafe FFI code poses potential security and stability risks
		6. Bindings Library Dependency Issues: Because Knowledge Graph static
		library is quite big, there were issues with distributing this as part of Go
		module. We also hit an issue with go-kuzu module which uses dynamically
		linked library which is then missing in zoekt-indexer binary -
		[issue 100](https://gitlab.com/gitlab-org/gitlab-zoekt-indexer/-/issues/100),
		we would have to build static library for go-kuzu.

		## Decision

		Move away from FFI-based integration and adopt a dedicated process model where:

		- Knowledge Graph functionality is encapsulated within its own long-running processes in a separate stateful set
		- Services are deployed as "sidecar" containers within the same Kubernetes pod as Zoekt container
		- The existing `gitlab-zoekt-indexer` service acts as a proxy which handles
		authentication, processing incoming requests, fetching repository using Gitaly
		and forwarding requests to the Knowledge Graph Service
		- Knowledge graph indexing and querying requests are forwarded to dedicated GKG
		service

		## Consequences

		### Benefits

		- Fault Isolation: KG crashes don't affect the Go service
		- Independent build lifecycle: Building a new version of KG won't requiring building and releasing a new `gitlab-zoekt` binary
		- Better Observability: Dedicated processes expose their own health, metrics, and logs
		- Conventional API semantics: HTTP/gRPC provides schema validation and versioning in a way that is common in the GitLab architecture which may make it easier to spot and resolve compatibility issues
		- Testability: Black-box testing directly against KG services

		### Concerns

		- Unlike Zoekt code search, this feature will not be available in Omnibus and
		will require advanced components or use of CNG. Dependence on these components
		may delay the dedicated release as we work on tooling as part of
		https://handbook.gitlab.com/handbook/engineering/architecture/design-documents/selfmanaged_segmentation/
		- Usage of a separate GKG service is a blocker for including Knowledge Graph in
		Omnibus in future if needed
		- More complex deployment - instead of deploying single Zoekt service, two
		separate services will be deployed
		- Additional complexity in inter-service communication
		- Need to build and maintain separate container images
		- Different programming language - GKG is written in Rust

		## Implementation

		This decision was made based on discussion in [issue #168](https://gitlab.com/gitlab-org/rust/knowledge-graph/-/issues/168).

		## References

		- [Knowledge Graph Server Epic](https://gitlab.com/groups/gitlab-org/-/epics/17518)
		- [FFI vs Dedicated Process Discussion](https://gitlab.com/gitlab-org/rust/knowledge-graph/-/issues/168)
		- [Segmentation Strategy](../../selfmanaged_segmentation/_index.md)
		- [Knowledge Graph First iteration](https://gitlab.com/groups/gitlab-org/-/epics/17514) - a top-level epic for all Knowledge Graph work

content/handbook/engineering/architecture/design-documents/gitlab_knowledge_graph/indexing/_index.md

deleted100644 → 0

+0 −142

Original line number	Diff line number	Diff line
		---
		title: GitLab Knowledge Graph Indexing Architecture
		status: ongoing
		description: How the Knowledge Graph indexing service is designed to index SDLC metadata and code.
		creation-date: "2025-10-12"
		authors: ["@michaelangelo", "@michaelusa", "@jgdoyon1", "@bohdanpk", "@dgruzd", "@OmarQunsulGitlab"]
		coaches: [ "@ahegyi", "@shekharpatnaik", "@andrewn" ]
		dris: [ "@michaelangelo", "@mnohr" ]
		owning-stage: "~devops::create"
		participating-stages: ["~devops::analytics"]
		---

		## Overview

		The Knowledge Graph indexing architecture transforms GitLab's SDLC metadata and code repositories into a queryable property graph. The indexing service operates as a distributed ETL (Extract, Transform, Load) pipeline that leverages the Data Insights Platform to process both SDLC events and code changes.

		This document outlines the general architecture, shared patterns, and components across both indexing domains. For detailed implementation specifics, see:

		- [Code Indexing Architecture](code_indexing.md)
		- [SDLC Indexing Architecture](sdlc_indexing.md)

		## Architecture Goals

		The indexing architecture achieves the following:

		- Process GitLab SDLC metadata and code repositories into a unified property graph model
		- Scale horizontally across multiple indexer workers without overlapping work
		- Minimize operational load on production PostgreSQL databases
		- Share infrastructure and patterns between code and SDLC indexing to reduce operational complexity
		- Support incremental indexing for efficient updates

		## Shared Architecture Components

		Both code and SDLC indexing leverage the same foundational infrastructure from the Data Insights Platform and share patterns for distributed coordination, data storage, and query access.

		```mermaid
		flowchart TD
		%% === Shared Infrastructure ===
		subgraph SOURCES["GitLab Sources"]
		PG["PostgreSQL (SDLC)"]
		Gitaly["Gitaly (Code)"]
		end

		subgraph DIP["Data Insights Platform (Shared)"]
		SYP["Siphon (CDC)"]
		JS["NATS JetStream"]
		KV["NATS KV (Locks & State)"]
		CH_RAW["ClickHouse (Raw Data Lake)"]
		end

		subgraph INDEXERS["gkg-indexer workers"]
		direction TB
		SDLC_IDX["SDLC Indexer"]
		CODE_IDX["Code Indexer"]
		end

		subgraph STORAGE["Shared Graph Storage"]
		CH_GRAPH["ClickHouse (Graph Tables)"]
		end

		subgraph QUERY["gkg-webserver"]
		WEB["Web Server"]
		QE["Graph Query Engine"]
		end

		%% === Data Flow ===
		PG -- Logical Replication --> SYP
		SYP -- CDC Events --> JS
		JS -- Event Streams --> SDLC_IDX
		JS -- Push Events --> CODE_IDX
		CODE_IDX -- Git RPC --> Gitaly

		SDLC_IDX -- Queries --> CH_RAW
		SDLC_IDX -- Writes --> CH_GRAPH
		CODE_IDX -- Writes --> CH_GRAPH

		SDLC_IDX -.-> KV
		CODE_IDX -.-> KV

		WEB -- Queries --> CH_GRAPH
		WEB -- Status --> KV
		WEB --> QE

		%% === Styling ===
		classDef source fill:#f5f5f5,stroke:#9ca3af,color:#111
		classDef platform fill:#fff7ed,stroke:#fb923c,color:#7c2d12
		classDef indexer fill:#eef2ff,stroke:#4338ca,color:#111
		classDef storage fill:#ecfeff,stroke:#06b6d4,color:#0e7490
		classDef query fill:#f0fdf4,stroke:#16a34a,color:#065f46

		class PG,Gitaly source
		class SYP,JS,KV,CH_RAW platform
		class SDLC_IDX,CODE_IDX indexer
		class CH_GRAPH storage
		class WEB,QE query
		```

		### 1. Siphon (Change Data Capture)

		Role: Streams data from PostgreSQL to NATS without impacting production database performance.

		Shared Use:

		- SDLC indexing: Receives events for issues, merge requests, pipelines, projects, namespaces, and other SDLC entities
		- Code indexing: Receives `push_event_payloads` to trigger repository indexing

		Siphon uses PostgreSQL's logical replication to capture changes from the write-ahead log (WAL), publishing them as protobuf messages to NATS JetStream. This decouples the Knowledge Graph from the production database.

		### 2. NATS JetStream and NATS KV (Event Broker and Distributed Coordination)

		Role: Provides durable event streaming and distributed coordination.

		Shared Use:

		- Delivers and distributes needed CDC events (like `events` and `push_event_payloads`) to indexing workers via NATS JetStream subjects.
		- Distributes workload across multiple indexer replicas
		- Provides NATS KV for distributed locking and state management for both code and SDLC indexing.

		Both indexing pipelines subscribe to relevant NATS subjects and use the same NATS deployment for event distribution and coordination.

		### 3. ClickHouse (Data Lake and Graph Storage)

		Role: Acts as both the raw data lake and the final graph storage layer.

		Shared Use:

		- Raw Data Lake: SDLC indexers query raw CDC data from ClickHouse to build graph transformations
		- Graph Storage: Both indexers write to property graph tables (nodes and edges) in ClickHouse
		- Query Backend: The web server queries the same ClickHouse tables for both code and SDLC graphs

		We will leverage ClickHouse's columnar storage and merge tree engines to provide bulk inserts, background merging, and adjacency list optimizations for graph traversals.

		### 4. Shared Schema and Data Model

		Role: Property graph schema that both indexers write to.

		Shared Tables:

		- Node tables for entities (e.g., `projects`, `files`, `definitions`, `issues`, `merge_requests`)
		- Edge tables for relationships (e.g., `project_has_file`, `mr_closes_issue`, `definition_calls_definition`)

		We will use the same schema defined in `crates/database` for both code and SDLC indexing. This allows for linking between the two graphs (e.g., a `Project` node from the SDLC graph can be linked to a `File` node from the Code Graph).

content/handbook/engineering/architecture/design-documents/gitlab_knowledge_graph/indexing/code_indexing.md

deleted100644 → 0

+0 −353

File deleted.

Preview size limit exceeded, changes collapsed.