[experiment] [gkg] Knowledge Graph Service Hybrid Architecture Experiment

Problem to Solve

As part of the Knowledge Graph as a service discovery work, we need to understand what the component deployments for Knowledge Graph would look like in a real self-managed environment.

We need to build a full, end-to-end, working version of the Knowledge Graph, using the architecture described in draft: gitlab knowledge graph service design doc (gitlab-com/content-sites/handbook!16424) for both Code Indexing and SDLC Indexing.

Important Note: The Knowledge Graph will use the Data Insights Platform.

Experiment High-level Overview

We've created https://gitlab.com/michaelangeloio/gkg-cluster to deploy all the code changes under the following branches:

The full end-to-end working Knowledge Graph is accessible at https://kg-api.gitlab-knowledge-graph.michaelangelo.io/

Watch the full presentation here:

Meeting Recording Link

Much of this will be replaced by @WarheadsSE, @andrewn, and @fforster's work for OAK and Runway for Self-Managed.

Some of the GKG architecture, like delta lake and Kuzu, is subject to change.

❗ See https://gitlab.com/michaelangeloio/gkg-cluster for the full cluster config and details.

Good to Know:

This environment intentionally mixes a GitLab VM (Omnibus install) with a cloud native GKE cluster in us-central1-b so we can evaluate production-like topologies without moving the full GitLab stack into Kubernetes.
The VM retains responsibility for GitLab Rails, Workhorse, Gitaly, and PostgreSQL, while the GKE cluster runs the Knowledge Graph services (Siphon CDC, NATS JetStream, indexer, web API, MCP endpoints, and a shared GitLab Runner).
Everything is wired together through VPC firewall rules, logical replication slots, and a shared Artifact Registry, letting us validate operations, cost, and performance before formal productisation.

Component Map

GitLab VM
- PostgreSQL emits logical replication streams (siphon_main, siphon_ci publications) consumed by the cluster.
- Gitaly serves repository RPCs; the VM hosts the git-data volume that the indexer reaches via HMAC-authenticated calls.
- Rails + Workhorse expose REST/GraphQL, handle OAuth, and seed repository metadata for the indexer.
GKE Workloads (namespace gkg-system)
- nats StatefulSet provides JetStream storage for CDC messages.
- siphon-main and siphon-ci Deployments pump WAL changes and CI-specific tables into NATS streams.
- gkg-indexer Deployment (single replica) converts Delta-parquet batches into Kùzu databases on a zonal PersistentDisk.
- gkg-webserver Deployment (two replicas) mounts the PD read-only and serves HTTPS/MCP requests via the exposed ingress.
- gitlab-runner Helm release offers a Kubernetes executor for pipelines tagged gke or group-runner.
- Snapshot jobs (k8s/jobs/snapshot-trigger-main.yaml) trigger Siphon snapshot helpers to refill Delta tables on demand.
Shared Infrastructure
- PersistentDisk gkg-delta-disk hosts /var/lib/gkg/delta as ReadWriteOnce for the indexer and ReadOnlyMany for web pods pinned to the same node.
- Secrets are injected via scripts/apply-secrets.sh, sourcing gitignored plaintext files to populate database users, replication credentials, and Gitaly tokens.
- HTTPS ingress relies on Google Managed Certificates, BackendConfig session affinity, and a dedicated FrontendConfig to honour MCP WebSocket requirements.

flowchart TD
    subgraph GitLab_VM["GitLab VM (Omnibus)"]
        PG[(PostgreSQL<br/>Logical Replication)]
        Rails[GitLab Rails & Workhorse]
        Gitaly[(Gitaly RPC)]
    end

    subgraph GKE["GKE gkg-system namespace"]
        SMain[siphon-main]
        SCI[siphon-ci]
        NATS[(NATS JetStream)]
        Indexer[gkg-indexer<br/>single writer]
        Web1[gkg-webserver pod A]
        Web2[gkg-webserver pod B]
        Runner[gitlab-runner]
        Snapshot[snapshot-trigger jobs]
        subgraph Storage["Persistent Disk gkg-delta-disk"]
            Delta[(Delta Lake & Kùzu files)]
        end
    end

    Artifact[Artifact Registry] -. images .-> SMain
    Artifact -. images .-> SCI
    Artifact -. images .-> Indexer
    Artifact -. images .-> Web1

    PG -->|publications + slots| SMain
    PG -->|CI publications| SCI
    SMain -->|CDC batches| NATS
    SCI -->|CI batches| NATS
    NATS -->|Delta writer jobs| Indexer
    Indexer -->|RW mount| Delta
    Delta -->|RO mount| Web1
    Delta -->|RO mount| Web2
    Web1 -->|HTTPS + MCP| Users
    Web2 -->|HTTPS + MCP| Users
    Rails -->|API + Git metadata| Indexer
    Gitaly -->|Repository RPC| Indexer
    Runner -->|Pipeline jobs| Rails
    Snapshot -->|manual snapshots| NATS

Data and Control Flow

Replication setup – setup_siphon_source_db.sh provisions logical replication users and publications on the VM. Firewall rules (10.80.0.0/14) allow GKE nodes to reach Postgres (5432) and Gitaly (8075).
CDC ingestion – Siphon deployments read WAL segments, wrap them into Arrow batches, and publish to NATS streams partitioned by domain (main, CI).
Delta processing – The indexer consumes NATS messages, writes Delta Lake tables on the PD, and immediately imports the data into Kùzu graphs used by the webservers.
Query serving – Web pods mount the PD read-only, expose REST, MCP, and health endpoints through the ingress, and rely on BackendConfig session affinity to keep long-lived MCP sessions sticky.
Pipeline integration – The Kubernetes GitLab Runner executes CI jobs that interact with the Knowledge Graph APIs, validating that the hybrid stack supports GitLab-native workflows.

Networking and Security

Ingress – k8s/overlays/test/ingress.yaml plus frontend-config.yaml enable HTTPS with Google-managed certificates and WebSocket upgrades. BackendConfig sets sessionAffinity: CLIENT_IP with a one-hour TTL to stabilise MCP conversations.
Northbound access – Users reach gkg-webserver via kg-api.gitlab-knowledge-graph.michaelangelo.io; health checks hit /health.
Southbound access – GKE workloads call back to the GitLab VM over private IP. Secrets inject the HMAC token used to authorise Gitaly RPCs (authorization: Bearer v2.<mac>.<issued>).
Secrets lifecycle – Plaintext credential files live only under secrets/ (gitignored). apply-secrets.sh runs kubectl create secret --dry-run to update Kubernetes without exposing values in commits.

Storage Strategy

Single-writer assumption keeps the zonal PD simple: the indexer is pinned to one replica, and web pods schedule onto the same node to reuse the attached disk.
Delta directories (/var/lib/gkg/delta/<table>/) accumulate both live parquet files and _delta_log/ metadata, allowing time travel if needed.
Kùzu databases (namespace_graph_v*.kz, project-level .kuzu) sit beside Delta exports on the same disk so the web tier can serve both SDLC and code graphs with consistent latency.

NOTE: This is an experimental hybrid architecture built for proof-of-concept purposes (time-boxed to one week). The single-writer, single-node storage pattern was chosen for demo simplicity. Production deployments will require high-availability, horizontal scaling through multi-node and multi-replica architecture, with workload orchestration likely managed via NATS KV.

Operational Workflows

Image updates – Build Knowledge Graph and Siphon images from the knowledge-graph-workspace root, tag with the source commit, push to Artifact Registry, and bump deployment manifests before rolling workloads (kubectl rollout restart).
Reindexing – Call the index_project MCP tool or reapply the siphon snapshot job to refresh Delta tables; the indexer reacts automatically because it tails the NATS stream.
Secret rotation – Update secrets/gitlab-db-credentials.txt and secrets/current-gitlab-tokens.txt, rerun apply-secrets.sh, then restart dependent deployments to reload environment variables.
Diagnostics – Use k9s or kubectl logs against gkg-indexer for ingestion errors, nats pod logs for stream health, and grpcurl tests with a freshly minted HMAC header to confirm Gitaly connectivity.

Lessons from the Hybrid Experiment

Maintaining GitLab on a VM while running Knowledge Graph on GKE is viable when firewall rules, replication slots, and secret rotation are scripted.
PersistentDisk sharing with a single writer proved enough for indexer throughput; read-only mounts scale web traffic horizontally.
Session affinity at the load balancer is mandatory for MCP clients—without it, multi-pod web deployments drop long-lived agent sessions.

Edited Oct 20, 2025 by Michael Angelo Rivera