Commit 41dd8da4 authored by Michael Angelo Rivera's avatar Michael Angelo Rivera
Browse files

docs: move GKG design documents to knowledge-graph repository

parent ca41956d
Loading
Loading
Loading
Loading
+37 −439

File changed.

Preview size limit exceeded, changes collapsed.

+0 −147
Original line number Diff line number Diff line
---
title: Knowledge Graph Data Model
description: An overview of the nodes and relationships in the GitLab Knowledge Graph, focusing on the Namespace and Code graphs.
---

## Overview

The GitLab Knowledge Graph is composed of two primary sub-graphs that share a common schema foundation: the **Namespace (SDLC) Graph** and the **Code Graph**. This document details the data model for each, covering the nodes and relationships that constitute them.

The data model is designed to be intuitive and to mirror the mental model that developers and users have of the GitLab platform. By representing entities as nodes and their interactions as relationships, we can perform complex queries that would be difficult or inefficient with a traditional relational database.

The data model follows a [Property Graph](https://neo4j.com/blog/knowledge-graph/rdf-vs-property-graphs-knowledge-graphs/?utm_source=GSearch&utm_medium=PaidSearch&utm_campaign=CTEMEA_CRSearch_SREMEACentralDACH_Non-Brand_DSA&utm_content=PCCoreDB_SCCoreBrand_Misc&utm_term=&gad_source=1&gad_campaignid=20769286946&gclid=Cj0KCQjwo63HBhCKARIsAHOHV_VWAmKJQ19f0_UwVxL8wmIizWjsWahHddHN7Xs--Ao9FFd-wYQkBbMaApmGEALw_wcB) approach over an RDF approach, as GitLab data has strongly defined relationships between entities.

> We can enable custom node and relationship expansion in the future by following the Property Graph approach and building the correct schema management capabilities.

## Data Storage Location

The Knowledge Graph data will be stored in tables that are separate from the existing tables used for analytics.

- For GitLab.com, the Knowledge Graph tables will be stored in a dedicated ClickHouse instance.
- For dedicated or self-managed instances, the Knowledge Graph tables can be stored in a separate database or instance. This will be left to the discretion of the instance owner.

## Concepts to Know

- **Unified Schema**: Both the Code Graph and the SDLC Graph are built on the same foundational schema provided by `crates/database`. This allows for linking between the two graphs (e.g., a `Project` node from the SDLC graph can be linked to a `File` node from the Code Graph).
- **Entity as Node**: Every entity in the GitLab ecosystem (e.g., Project, Issue, File, Function Definition) is represented as a node.
- **Interaction as Edge**: Relationships between these entities (e.g., a User `COMMENTS_ON` an Issue, a `File` `CONTAINS` a `Definition`) are represented as directed edges.

---

## The Namespace Graph Data Model

The Namespace Graph represents the Software Development Life Cycle (SDLC) entities and their interactions within GitLab. It models how users, projects, issues, merge requests, and CI/CD components relate to one another.

### Example Node Types

| Node Type             | Description                                                                                             | Key Properties                                                              |
| --------------------- | ------------------------------------------------------------------------------------------------------- | --------------------------------------------------------------------------- |
| `Namespace`           | Represents a GitLab group or user namespace.                                                            | `id`, `name`, `full_path`, `type` (Group or User)                             |
| `Project`             | Represents a GitLab project/repository.                                                                 | `id`, `name`, `full_path`, `namespace_id`                                   |
| `Issue`               | Represents a GitLab issue.                                                                              | `id`, `iid`, `title`, `state`, `project_id`, `author_id`                      |
| `MergeRequest`        | Represents a GitLab merge request.                                                                      | `id`, `iid`, `title`, `state`, `source_branch`, `target_branch`, `project_id` |
| `Pipeline`            | Represents a CI/CD pipeline.                                                                            | `id`, `status`, `source`, `project_id`, `user_id`                             |
| `Vulnerability`       | Represents a security vulnerability finding.                                                            | `id`, `title`, `severity`, `state`, `project_id`                              |
| `User`                | Represents a GitLab user.                                                                               | `id`, `username`, `name`                                                    |
| `Note`                | Represents a comment on an issue, merge request, or epic.                                               | `id`, `body`, `author_id`, `project_id`                                     |
| `Epic`                | Represents a GitLab epic.                                                                               | `id`, `iid`, `title`, `state`, `group_id`, `author_id`                        |
| `Branch`              | Represents a Git branch.                                                                                | `name`, `project_id`                                                        |

### Relationship Visualization

```mermaid
graph TD
    Namespace -- CONTAINS --> Project
    Project -- HAS_ISSUE --> Issue
    Project -- HAS_MERGE_REQUEST --> MergeRequest
    Project -- HAS_PIPELINE --> Pipeline
    Project -- HAS_VULNERABILITY --> Vulnerability
    Project -- HAS_BRANCH --> Branch

    User -- CREATED --> Issue
    User -- CREATED --> MergeRequest
    User -- CREATED --> Epic
    User -- COMMENTS_ON --> Issue
    User -- COMMENTS_ON --> MergeRequest
    User -- COMMENTS_ON --> Epic
    Note -- IS_COMMENT_ON --> Issue
    Note -- IS_COMMENT_ON --> MergeRequest
    Note -- IS_COMMENT_ON --> Epic

    MergeRequest -- MERGES_TO --> Branch
    MergeRequest -- RELATED_TO --> Issue
    Pipeline -- TRIGGERED_FOR --> MergeRequest
    Pipeline -- TRIGGERED_FOR --> Branch
```

### Relationship Types

| Relationship                        | From Node      | To Node        | Description                                                                                             |
| ----------------------------------- | -------------- | -------------- | ------------------------------------------------------------------------------------------------------- |
| `CONTAINS`                          | `Namespace`    | `Project`      | A namespace contains a project.                                                                         |
| `HAS_ISSUE`                         | `Project`      | `Issue`        | A project has an issue.                                                                                 |
| `HAS_MERGE_REQUEST`                 | `Project`      | `MergeRequest` | A project has a merge request.                                                                          |
| `HAS_PIPELINE`                      | `Project`      | `Pipeline`     | A project has a CI/CD pipeline.                                                                         |
| `HAS_VULNERABILITY`                 | `Project`      | `Vulnerability`| A project has a vulnerability finding.                                                                  |
| `HAS_BRANCH`                        | `Project`      | `Branch`       | A project has a branch.                                                                                 |
| `CREATED`                           | `User`         | `Issue`, `MR`... | A user created an entity.                                                                               |
| `COMMENTS_ON`                       | `User`         | `Issue`, `MR`... | A user commented on an entity (via a `Note`).                                                           |
| `IS_COMMENT_ON`                     | `Note`         | `Issue`, `MR`... | A note is a comment on a specific entity.                                                               |
| `MERGES_TO`                         | `MergeRequest` | `Branch`       | A merge request targets a specific branch for merging.                                                  |
| `RELATED_TO`                        | `MergeRequest` | `Issue`        | A merge request is related to or closes an issue.                                                       |
| `TRIGGERED_FOR`                     | `Pipeline`     | `MR`, `Branch` | A pipeline was triggered for a merge request or a branch push.                                          |

---

## The Code Graph Data Model

The Code Graph represents the structure and relationships within the source code of a repository. It models the file system hierarchy, code definitions, and the call graph.

### Node Types

| Node Type             | Description                                                                                             | Key Properties                                                              |
| --------------------- | ------------------------------------------------------------------------------------------------------- | --------------------------------------------------------------------------- |
| `Directory`           | Represents a directory within a repository.                                                             | `relative_path`, `absolute_path`, `repository_name`                         |
| `File`                | Represents a file within a repository.                                                                  | `relative_path`, `absolute_path`, `language`, `repository_name`             |
| `Definition`          | Represents a code definition (e.g., class, function, method, module).                                   | `fully_qualified_name`, `display_name`, `definition_type`, `file_path`      |
| `ImportedSymbol`      | Represents an imported symbol or module within a file.                                                  | `symbol_name`, `source_module`, `file_path`                                 |

### Relationship Visualization

```mermaid
graph TD
    Directory -- DIR_CONTAINS_DIR --> Directory
    Directory -- DIR_CONTAINS_FILE --> File
    File -- FILE_DEFINES --> Definition
    File -- FILE_IMPORTS --> ImportedSymbol
    Definition -- DEFINITION_TO_DEFINITION --> Definition
    Definition -- DEFINES_IMPORTED_SYMBOL --> ImportedSymbol
    ImportedSymbol -- IMPORTED_SYMBOL_TO_DEFINITION --> Definition
```

### Relationship Types

| Relationship                        | From Node      | To Node        | Description                                                                                             |
| ----------------------------------- | -------------- | -------------- | ------------------------------------------------------------------------------------------------------- |
| `DIR_CONTAINS_DIR`                  | `Directory`    | `Directory`    | A directory contains another directory.                                                                 |
| `DIR_CONTAINS_FILE`                 | `Directory`    | `File`         | A directory contains a file.                                                                            |
| `FILE_DEFINES`                      | `File`         | `Definition`   | A file contains a code definition.                                                                      |
| `FILE_IMPORTS`                      | `File`         | `ImportedSymbol`| A file imports a symbol.                                                                                |
| `DEFINITION_TO_DEFINITION`          | `Definition`   | `Definition`   | Represents a call graph edge (e.g., a function calls another function, a class inherits from another).    |
| `DEFINES_IMPORTED_SYMBOL`           | `Definition`   | `ImportedSymbol`| A definition (e.g., an exported function) is the source of an imported symbol.                          |
| `IMPORTED_SYMBOL_TO_DEFINITION`     | `ImportedSymbol`| `Definition`   | An imported symbol resolves to a specific definition.                                                   |

---

## Cross-Graph Relationships

The power of the Knowledge Graph comes from its ability to link the SDLC and Code graphs. In future iterations, we will be able to link the two graphs together to create a unified graph. For the first iteration, we intend to keep the two graphs separate to keep the complexity of the engineering effort manageable.

Here is an example of how we can link the two graphs together:

| Relationship                        | From Node      | To Node        | Description                                                                                             |
| ----------------------------------- | -------------- | -------------- | ------------------------------------------------------------------------------------------------------- |
| `HAS_FILE`                          | `Project`      | `File`         | A project (from the Namespace Graph) contains a file (from the Code Graph).                             |
| `HAS_DIRECTORY`                     | `Project`      | `Directory`    | A project (from the Namespace Graph) contains a directory (from the Code Graph).                        |

These links allow for deep queries, like "Find all merge requests (`SDLC`) that touch files (`Code`) containing a specific function definition (`Code`)."
+0 −84
Original line number Diff line number Diff line
---
title: "ADR-001: FFI vs Dedicated Process Integration"
owning-stage: "~devops::create"
toc_hide: true
---

## Context

The initial implementation used Foreign Function Interface (FFI) to integrate the
Rust-based Knowledge Graph indexing functionality within the Go-based `gitlab-zoekt-indexer`
service. This approach was chosen to simplify deployment within Omnibus and reduce
operational complexity. Exposing knowledge graph for querying was planned to be done
directly from `gitlab-zoekt` service (not through FFI).

However, after committing to the [segmentation strategy](../../selfmanaged_segmentation/_index.md),
the Omnibus constraints no longer apply to the Knowledge Graph service.

## Problem

The FFI approach presented several challenges:

1. **Querying Complexity**: Knowledge Graph querying now requires additional
   query building, and result type mapping, which would require FFI usage
   also for query pre-processing and post-processing
2. **Separation of Concerns**: The Knowledge Graph's increasing complexity for both
   indexing and querying makes will make it difficult to use from zoekt-indexer
3. **Independent Upgrades**: The FFI model requires two upgrade maintenance points
   for any API changes - both the Knowledge Graph repo and the Zoekt indexer repo
4. **Observability**: FFI makes it difficult to monitor and debug the Knowledge Graph
   components independently
5. **Safety Concerns**: Unsafe FFI code poses potential security and stability risks
6. **Bindings Library Dependency Issues**: Because Knowledge Graph static
   library is quite big, there were issues with distributing this as part of Go
   module. We also hit an issue with go-kuzu module which uses dynamically
   linked library which is then missing in zoekt-indexer binary -
   [issue 100](https://gitlab.com/gitlab-org/gitlab-zoekt-indexer/-/issues/100),
   we would have to build static library for go-kuzu.

## Decision

Move away from FFI-based integration and adopt a dedicated process model where:

- Knowledge Graph functionality is encapsulated within its own long-running processes in a separate stateful set
- Services are deployed as "sidecar" containers within the same Kubernetes pod as Zoekt container
- The existing `gitlab-zoekt-indexer` service acts as a proxy which handles
  authentication, processing incoming requests, fetching repository using Gitaly
  and forwarding requests to the Knowledge Graph Service
- Knowledge graph indexing and querying requests are forwarded to dedicated GKG
  service

## Consequences

### Benefits

- Fault Isolation: KG crashes don't affect the Go service
- Independent build lifecycle: Building a new version of KG won't requiring building and releasing a new `gitlab-zoekt` binary
- Better Observability: Dedicated processes expose their own health, metrics, and logs
- Conventional API semantics: HTTP/gRPC provides schema validation and versioning in a way that is common in the GitLab architecture which may make it easier to spot and resolve compatibility issues
- Testability: Black-box testing directly against KG services

### Concerns

- Unlike Zoekt code search, this feature will not be available in Omnibus and
  will require advanced components or use of CNG. Dependence on these components
  may delay the dedicated release as we work on tooling as part of
  https://handbook.gitlab.com/handbook/engineering/architecture/design-documents/selfmanaged_segmentation/
- Usage of a separate GKG service is a blocker for including Knowledge Graph in
  Omnibus in future if needed
- More complex deployment - instead of deploying single Zoekt service, two
  separate services will be deployed
- Additional complexity in inter-service communication
- Need to build and maintain separate container images
- Different programming language - GKG is written in Rust

## Implementation

This decision was made based on discussion in [issue #168](https://gitlab.com/gitlab-org/rust/knowledge-graph/-/issues/168).

## References

- [Knowledge Graph Server Epic](https://gitlab.com/groups/gitlab-org/-/epics/17518)
- [FFI vs Dedicated Process Discussion](https://gitlab.com/gitlab-org/rust/knowledge-graph/-/issues/168)
- [Segmentation Strategy](../../selfmanaged_segmentation/_index.md)
- [Knowledge Graph First iteration](https://gitlab.com/groups/gitlab-org/-/epics/17514) - a top-level epic for all Knowledge Graph work
+0 −142
Original line number Diff line number Diff line
---
title: GitLab Knowledge Graph Indexing Architecture
status: ongoing
description: How the Knowledge Graph indexing service is designed to index SDLC metadata and code.
creation-date: "2025-10-12"
authors: ["@michaelangelo", "@michaelusa", "@jgdoyon1", "@bohdanpk", "@dgruzd", "@OmarQunsulGitlab"]
coaches: [ "@ahegyi", "@shekharpatnaik", "@andrewn" ]
dris: [ "@michaelangelo", "@mnohr" ]
owning-stage: "~devops::create"
participating-stages: ["~devops::analytics"]
---

## Overview

The Knowledge Graph indexing architecture transforms GitLab's SDLC metadata and code repositories into a queryable property graph. The indexing service operates as a distributed ETL (Extract, Transform, Load) pipeline that leverages the Data Insights Platform to process both SDLC events and code changes.

This document outlines the general architecture, shared patterns, and components across both indexing domains. For detailed implementation specifics, see:

- [Code Indexing Architecture](code_indexing.md)
- [SDLC Indexing Architecture](sdlc_indexing.md)

## Architecture Goals

The indexing architecture achieves the following:

- Process GitLab SDLC metadata and code repositories into a unified property graph model
- Scale horizontally across multiple indexer workers without overlapping work
- Minimize operational load on production PostgreSQL databases
- Share infrastructure and patterns between code and SDLC indexing to reduce operational complexity
- Support incremental indexing for efficient updates

## Shared Architecture Components

Both code and SDLC indexing leverage the same foundational infrastructure from the Data Insights Platform and share patterns for distributed coordination, data storage, and query access.

```mermaid
flowchart TD
  %% === Shared Infrastructure ===
  subgraph SOURCES["GitLab Sources"]
    PG["PostgreSQL (SDLC)"]
    Gitaly["Gitaly (Code)"]
  end

  subgraph DIP["Data Insights Platform (Shared)"]
    SYP["Siphon (CDC)"]
    JS["NATS JetStream"]
    KV["NATS KV (Locks & State)"]
    CH_RAW["ClickHouse (Raw Data Lake)"]
  end

  subgraph INDEXERS["gkg-indexer workers"]
    direction TB
    SDLC_IDX["SDLC Indexer"]
    CODE_IDX["Code Indexer"]
  end

  subgraph STORAGE["Shared Graph Storage"]
    CH_GRAPH["ClickHouse (Graph Tables)"]
  end

  subgraph QUERY["gkg-webserver"]
    WEB["Web Server"]
    QE["Graph Query Engine"]
  end

  %% === Data Flow ===
  PG -- Logical Replication --> SYP
  SYP -- CDC Events --> JS
  JS -- Event Streams --> SDLC_IDX
  JS -- Push Events --> CODE_IDX
  CODE_IDX -- Git RPC --> Gitaly

  SDLC_IDX -- Queries --> CH_RAW
  SDLC_IDX -- Writes --> CH_GRAPH
  CODE_IDX -- Writes --> CH_GRAPH

  SDLC_IDX -.-> KV
  CODE_IDX -.-> KV

  WEB -- Queries --> CH_GRAPH
  WEB -- Status --> KV
  WEB --> QE

  %% === Styling ===
  classDef source fill:#f5f5f5,stroke:#9ca3af,color:#111
  classDef platform fill:#fff7ed,stroke:#fb923c,color:#7c2d12
  classDef indexer fill:#eef2ff,stroke:#4338ca,color:#111
  classDef storage fill:#ecfeff,stroke:#06b6d4,color:#0e7490
  classDef query fill:#f0fdf4,stroke:#16a34a,color:#065f46

  class PG,Gitaly source
  class SYP,JS,KV,CH_RAW platform
  class SDLC_IDX,CODE_IDX indexer
  class CH_GRAPH storage
  class WEB,QE query
```

### 1. Siphon (Change Data Capture)

**Role**: Streams data from PostgreSQL to NATS without impacting production database performance.

**Shared Use**:

- SDLC indexing: Receives events for issues, merge requests, pipelines, projects, namespaces, and other SDLC entities
- Code indexing: Receives `push_event_payloads` to trigger repository indexing

Siphon uses PostgreSQL's logical replication to capture changes from the write-ahead log (WAL), publishing them as protobuf messages to NATS JetStream. This decouples the Knowledge Graph from the production database.

### 2. NATS JetStream and NATS KV (Event Broker and Distributed Coordination)

**Role**: Provides durable event streaming and distributed coordination.

**Shared Use**:

- Delivers and distributes needed CDC events (like `events` and `push_event_payloads`) to indexing workers via NATS JetStream subjects.
- Distributes workload across multiple indexer replicas
- Provides NATS KV for distributed locking and state management for both code and SDLC indexing.

Both indexing pipelines subscribe to relevant NATS subjects and use the same NATS deployment for event distribution and coordination.

### 3. ClickHouse (Data Lake and Graph Storage)

**Role**: Acts as both the raw data lake and the final graph storage layer.

**Shared Use**:

- **Raw Data Lake**: SDLC indexers query raw CDC data from ClickHouse to build graph transformations
- **Graph Storage**: Both indexers write to property graph tables (nodes and edges) in ClickHouse
- **Query Backend**: The web server queries the same ClickHouse tables for both code and SDLC graphs

We will leverage ClickHouse's columnar storage and merge tree engines to provide bulk inserts, background merging, and adjacency list optimizations for graph traversals.

### 4. Shared Schema and Data Model

**Role**: Property graph schema that both indexers write to.

**Shared Tables**:

- Node tables for entities (e.g., `projects`, `files`, `definitions`, `issues`, `merge_requests`)
- Edge tables for relationships (e.g., `project_has_file`, `mr_closes_issue`, `definition_calls_definition`)

We will use the same schema defined in `crates/database` for both code and SDLC indexing. This allows for linking between the two graphs (e.g., a `Project` node from the SDLC graph can be linked to a `File` node from the Code Graph).
+0 −353

File deleted.

Preview size limit exceeded, changes collapsed.

Loading