Skip to content

[experiment] [gkg] Knowledge Graph Service Hybrid Architecture Experiment

Problem to Solve

As part of the Knowledge Graph as a service discovery work, we need to understand what the component deployments for Knowledge Graph would look like in a real self-managed environment.

We need to build a full, end-to-end, working version of the Knowledge Graph, using the architecture described in draft: gitlab knowledge graph service design doc (gitlab-com/content-sites/handbook!16424) for both Code Indexing and SDLC Indexing.

Important Note: The Knowledge Graph will use the Data Insights Platform.

Experiment High-level Overview

We've created https://gitlab.com/michaelangeloio/gkg-cluster to deploy all the code changes under the following branches:

The full end-to-end working Knowledge Graph is accessible at https://kg-api.gitlab-knowledge-graph.michaelangelo.io/

Watch the full presentation here:

Presentation

Much of this will be replaced by @WarheadsSE, @andrewn, and @fforster's work for OAK and Runway for Self-Managed.

Some of the GKG architecture, like delta lake and Kuzu, is subject to change.

See https://gitlab.com/michaelangeloio/gkg-cluster for the full cluster config and details.

Good to Know:

  • This environment intentionally mixes a GitLab VM (Omnibus install) with a cloud native GKE cluster in us-central1-b so we can evaluate production-like topologies without moving the full GitLab stack into Kubernetes.
  • The VM retains responsibility for GitLab Rails, Workhorse, Gitaly, and PostgreSQL, while the GKE cluster runs the Knowledge Graph services (Siphon CDC, NATS JetStream, indexer, web API, MCP endpoints, and a shared GitLab Runner).
  • Everything is wired together through VPC firewall rules, logical replication slots, and a shared Artifact Registry, letting us validate operations, cost, and performance before formal productisation.

Component Map

  • GitLab VM
    • PostgreSQL emits logical replication streams (siphon_main, siphon_ci publications) consumed by the cluster.
    • Gitaly serves repository RPCs; the VM hosts the git-data volume that the indexer reaches via HMAC-authenticated calls.
    • Rails + Workhorse expose REST/GraphQL, handle OAuth, and seed repository metadata for the indexer.
  • GKE Workloads (namespace gkg-system)
    • nats StatefulSet provides JetStream storage for CDC messages.
    • siphon-main and siphon-ci Deployments pump WAL changes and CI-specific tables into NATS streams.
    • gkg-indexer Deployment (single replica) converts Delta-parquet batches into Kùzu databases on a zonal PersistentDisk.
    • gkg-webserver Deployment (two replicas) mounts the PD read-only and serves HTTPS/MCP requests via the exposed ingress.
    • gitlab-runner Helm release offers a Kubernetes executor for pipelines tagged gke or group-runner.
    • Snapshot jobs (k8s/jobs/snapshot-trigger-main.yaml) trigger Siphon snapshot helpers to refill Delta tables on demand.
  • Shared Infrastructure
    • PersistentDisk gkg-delta-disk hosts /var/lib/gkg/delta as ReadWriteOnce for the indexer and ReadOnlyMany for web pods pinned to the same node.
    • Secrets are injected via scripts/apply-secrets.sh, sourcing gitignored plaintext files to populate database users, replication credentials, and Gitaly tokens.
    • HTTPS ingress relies on Google Managed Certificates, BackendConfig session affinity, and a dedicated FrontendConfig to honour MCP WebSocket requirements.
flowchart TD
    subgraph GitLab_VM["GitLab VM (Omnibus)"]
        PG[(PostgreSQL<br/>Logical Replication)]
        Rails[GitLab Rails & Workhorse]
        Gitaly[(Gitaly RPC)]
    end

    subgraph GKE["GKE gkg-system namespace"]
        SMain[siphon-main]
        SCI[siphon-ci]
        NATS[(NATS JetStream)]
        Indexer[gkg-indexer<br/>single writer]
        Web1[gkg-webserver pod A]
        Web2[gkg-webserver pod B]
        Runner[gitlab-runner]
        Snapshot[snapshot-trigger jobs]
        subgraph Storage["Persistent Disk gkg-delta-disk"]
            Delta[(Delta Lake & Kùzu files)]
        end
    end

    Artifact[Artifact Registry] -. images .-> SMain
    Artifact -. images .-> SCI
    Artifact -. images .-> Indexer
    Artifact -. images .-> Web1

    PG -->|publications + slots| SMain
    PG -->|CI publications| SCI
    SMain -->|CDC batches| NATS
    SCI -->|CI batches| NATS
    NATS -->|Delta writer jobs| Indexer
    Indexer -->|RW mount| Delta
    Delta -->|RO mount| Web1
    Delta -->|RO mount| Web2
    Web1 -->|HTTPS + MCP| Users
    Web2 -->|HTTPS + MCP| Users
    Rails -->|API + Git metadata| Indexer
    Gitaly -->|Repository RPC| Indexer
    Runner -->|Pipeline jobs| Rails
    Snapshot -->|manual snapshots| NATS

Data and Control Flow

  1. Replication setupsetup_siphon_source_db.sh provisions logical replication users and publications on the VM. Firewall rules (10.80.0.0/14) allow GKE nodes to reach Postgres (5432) and Gitaly (8075).
  2. CDC ingestion – Siphon deployments read WAL segments, wrap them into Arrow batches, and publish to NATS streams partitioned by domain (main, CI).
  3. Delta processing – The indexer consumes NATS messages, writes Delta Lake tables on the PD, and immediately imports the data into Kùzu graphs used by the webservers.
  4. Query serving – Web pods mount the PD read-only, expose REST, MCP, and health endpoints through the ingress, and rely on BackendConfig session affinity to keep long-lived MCP sessions sticky.
  5. Pipeline integration – The Kubernetes GitLab Runner executes CI jobs that interact with the Knowledge Graph APIs, validating that the hybrid stack supports GitLab-native workflows.

Networking and Security

  • Ingressk8s/overlays/test/ingress.yaml plus frontend-config.yaml enable HTTPS with Google-managed certificates and WebSocket upgrades. BackendConfig sets sessionAffinity: CLIENT_IP with a one-hour TTL to stabilise MCP conversations.
  • Northbound access – Users reach gkg-webserver via kg-api.gitlab-knowledge-graph.michaelangelo.io; health checks hit /health.
  • Southbound access – GKE workloads call back to the GitLab VM over private IP. Secrets inject the HMAC token used to authorise Gitaly RPCs (authorization: Bearer v2.<mac>.<issued>).
  • Secrets lifecycle – Plaintext credential files live only under secrets/ (gitignored). apply-secrets.sh runs kubectl create secret --dry-run to update Kubernetes without exposing values in commits.

Storage Strategy

  • Single-writer assumption keeps the zonal PD simple: the indexer is pinned to one replica, and web pods schedule onto the same node to reuse the attached disk.
  • Delta directories (/var/lib/gkg/delta/<table>/) accumulate both live parquet files and _delta_log/ metadata, allowing time travel if needed.
  • Kùzu databases (namespace_graph_v*.kz, project-level .kuzu) sit beside Delta exports on the same disk so the web tier can serve both SDLC and code graphs with consistent latency.

NOTE: This is an experimental hybrid architecture built for proof-of-concept purposes (time-boxed to one week). The single-writer, single-node storage pattern was chosen for demo simplicity. Production deployments will require high-availability, horizontal scaling through multi-node and multi-replica architecture, with workload orchestration likely managed via NATS KV.

Operational Workflows

  • Image updates – Build Knowledge Graph and Siphon images from the knowledge-graph-workspace root, tag with the source commit, push to Artifact Registry, and bump deployment manifests before rolling workloads (kubectl rollout restart).
  • Reindexing – Call the index_project MCP tool or reapply the siphon snapshot job to refresh Delta tables; the indexer reacts automatically because it tails the NATS stream.
  • Secret rotation – Update secrets/gitlab-db-credentials.txt and secrets/current-gitlab-tokens.txt, rerun apply-secrets.sh, then restart dependent deployments to reload environment variables.
  • Diagnostics – Use k9s or kubectl logs against gkg-indexer for ingestion errors, nats pod logs for stream health, and grpcurl tests with a freshly minted HMAC header to confirm Gitaly connectivity.

Lessons from the Hybrid Experiment

  • Maintaining GitLab on a VM while running Knowledge Graph on GKE is viable when firewall rules, replication slots, and secret rotation are scripted.
  • PersistentDisk sharing with a single writer proved enough for indexer throughput; read-only mounts scale web traffic horizontally.
  • Session affinity at the load balancer is mandatory for MCP clients—without it, multi-pod web deployments drop long-lived agent sessions.
Edited by Michael Angelo Rivera