[experiment] [gkg] Knowledge Graph Service Hybrid Architecture Experiment
Problem to Solve
As part of the Knowledge Graph as a service discovery work, we need to understand what the component deployments for Knowledge Graph would look like in a real self-managed environment.
We need to build a full, end-to-end, working version of the Knowledge Graph, using the architecture described in draft: gitlab knowledge graph service design doc (gitlab-com/content-sites/handbook!16424) for both Code Indexing and SDLC Indexing.
Important Note: The Knowledge Graph will use the Data Insights Platform.
Experiment High-level Overview
We've created https://gitlab.com/michaelangeloio/gkg-cluster to deploy all the code changes under the following branches:
- feat: Knowledge Graph as a Service (!379)
- gkg poc siphon (gitlab-org/analytics-section/siphon!242)
- feat: gkg poc (gitlab-org/gitlab!204096)
The full end-to-end working Knowledge Graph is accessible at https://kg-api.gitlab-knowledge-graph.michaelangelo.io/
Watch the full presentation here:
Much of this will be replaced by @WarheadsSE, @andrewn, and @fforster's work for OAK and Runway for Self-Managed.
Some of the GKG architecture, like delta lake and Kuzu, is subject to change.
❗ See https://gitlab.com/michaelangeloio/gkg-cluster for the full cluster config and details.
Good to Know:
- This environment intentionally mixes a GitLab VM (Omnibus install) with a cloud native GKE cluster in
us-central1-bso we can evaluate production-like topologies without moving the full GitLab stack into Kubernetes. - The VM retains responsibility for GitLab Rails, Workhorse, Gitaly, and PostgreSQL, while the GKE cluster runs the Knowledge Graph services (Siphon CDC, NATS JetStream, indexer, web API, MCP endpoints, and a shared GitLab Runner).
- Everything is wired together through VPC firewall rules, logical replication slots, and a shared Artifact Registry, letting us validate operations, cost, and performance before formal productisation.
Component Map
-
GitLab VM
- PostgreSQL emits logical replication streams (
siphon_main,siphon_cipublications) consumed by the cluster. - Gitaly serves repository RPCs; the VM hosts the git-data volume that the indexer reaches via HMAC-authenticated calls.
- Rails + Workhorse expose REST/GraphQL, handle OAuth, and seed repository metadata for the indexer.
- PostgreSQL emits logical replication streams (
-
GKE Workloads (namespace
gkg-system)-
natsStatefulSet provides JetStream storage for CDC messages. -
siphon-mainandsiphon-ciDeployments pump WAL changes and CI-specific tables into NATS streams. -
gkg-indexerDeployment (single replica) converts Delta-parquet batches into Kùzu databases on a zonal PersistentDisk. -
gkg-webserverDeployment (two replicas) mounts the PD read-only and serves HTTPS/MCP requests via the exposed ingress. -
gitlab-runnerHelm release offers a Kubernetes executor for pipelines taggedgkeorgroup-runner. - Snapshot jobs (
k8s/jobs/snapshot-trigger-main.yaml) trigger Siphon snapshot helpers to refill Delta tables on demand.
-
-
Shared Infrastructure
- PersistentDisk
gkg-delta-diskhosts/var/lib/gkg/deltaas ReadWriteOnce for the indexer and ReadOnlyMany for web pods pinned to the same node. - Secrets are injected via
scripts/apply-secrets.sh, sourcing gitignored plaintext files to populate database users, replication credentials, and Gitaly tokens. - HTTPS ingress relies on Google Managed Certificates, BackendConfig session affinity, and a dedicated FrontendConfig to honour MCP WebSocket requirements.
- PersistentDisk
flowchart TD
subgraph GitLab_VM["GitLab VM (Omnibus)"]
PG[(PostgreSQL<br/>Logical Replication)]
Rails[GitLab Rails & Workhorse]
Gitaly[(Gitaly RPC)]
end
subgraph GKE["GKE gkg-system namespace"]
SMain[siphon-main]
SCI[siphon-ci]
NATS[(NATS JetStream)]
Indexer[gkg-indexer<br/>single writer]
Web1[gkg-webserver pod A]
Web2[gkg-webserver pod B]
Runner[gitlab-runner]
Snapshot[snapshot-trigger jobs]
subgraph Storage["Persistent Disk gkg-delta-disk"]
Delta[(Delta Lake & Kùzu files)]
end
end
Artifact[Artifact Registry] -. images .-> SMain
Artifact -. images .-> SCI
Artifact -. images .-> Indexer
Artifact -. images .-> Web1
PG -->|publications + slots| SMain
PG -->|CI publications| SCI
SMain -->|CDC batches| NATS
SCI -->|CI batches| NATS
NATS -->|Delta writer jobs| Indexer
Indexer -->|RW mount| Delta
Delta -->|RO mount| Web1
Delta -->|RO mount| Web2
Web1 -->|HTTPS + MCP| Users
Web2 -->|HTTPS + MCP| Users
Rails -->|API + Git metadata| Indexer
Gitaly -->|Repository RPC| Indexer
Runner -->|Pipeline jobs| Rails
Snapshot -->|manual snapshots| NATS
Data and Control Flow
-
Replication setup –
setup_siphon_source_db.shprovisions logical replication users and publications on the VM. Firewall rules (10.80.0.0/14) allow GKE nodes to reach Postgres (5432) and Gitaly (8075). - CDC ingestion – Siphon deployments read WAL segments, wrap them into Arrow batches, and publish to NATS streams partitioned by domain (main, CI).
- Delta processing – The indexer consumes NATS messages, writes Delta Lake tables on the PD, and immediately imports the data into Kùzu graphs used by the webservers.
- Query serving – Web pods mount the PD read-only, expose REST, MCP, and health endpoints through the ingress, and rely on BackendConfig session affinity to keep long-lived MCP sessions sticky.
- Pipeline integration – The Kubernetes GitLab Runner executes CI jobs that interact with the Knowledge Graph APIs, validating that the hybrid stack supports GitLab-native workflows.
Networking and Security
-
Ingress –
k8s/overlays/test/ingress.yamlplusfrontend-config.yamlenable HTTPS with Google-managed certificates and WebSocket upgrades. BackendConfig setssessionAffinity: CLIENT_IPwith a one-hour TTL to stabilise MCP conversations. -
Northbound access – Users reach
gkg-webserverviakg-api.gitlab-knowledge-graph.michaelangelo.io; health checks hit/health. -
Southbound access – GKE workloads call back to the GitLab VM over private IP. Secrets inject the HMAC token used to authorise Gitaly RPCs (
authorization: Bearer v2.<mac>.<issued>). -
Secrets lifecycle – Plaintext credential files live only under
secrets/(gitignored).apply-secrets.shrunskubectl create secret --dry-runto update Kubernetes without exposing values in commits.
Storage Strategy
- Single-writer assumption keeps the zonal PD simple: the indexer is pinned to one replica, and web pods schedule onto the same node to reuse the attached disk.
- Delta directories (
/var/lib/gkg/delta/<table>/) accumulate both live parquet files and_delta_log/metadata, allowing time travel if needed. - Kùzu databases (
namespace_graph_v*.kz, project-level.kuzu) sit beside Delta exports on the same disk so the web tier can serve both SDLC and code graphs with consistent latency.
NOTE: This is an experimental hybrid architecture built for proof-of-concept purposes (time-boxed to one week). The single-writer, single-node storage pattern was chosen for demo simplicity. Production deployments will require high-availability, horizontal scaling through multi-node and multi-replica architecture, with workload orchestration likely managed via NATS KV.
Operational Workflows
-
Image updates – Build Knowledge Graph and Siphon images from the
knowledge-graph-workspaceroot, tag with the source commit, push to Artifact Registry, and bump deployment manifests before rolling workloads (kubectl rollout restart). -
Reindexing – Call the
index_projectMCP tool or reapply thesiphonsnapshot job to refresh Delta tables; the indexer reacts automatically because it tails the NATS stream. -
Secret rotation – Update
secrets/gitlab-db-credentials.txtandsecrets/current-gitlab-tokens.txt, rerunapply-secrets.sh, then restart dependent deployments to reload environment variables. -
Diagnostics – Use
k9sorkubectl logsagainstgkg-indexerfor ingestion errors,natspod logs for stream health, andgrpcurltests with a freshly minted HMAC header to confirm Gitaly connectivity.
Lessons from the Hybrid Experiment
- Maintaining GitLab on a VM while running Knowledge Graph on GKE is viable when firewall rules, replication slots, and secret rotation are scripted.
- PersistentDisk sharing with a single writer proved enough for indexer throughput; read-only mounts scale web traffic horizontally.
- Session affinity at the load balancer is mandatory for MCP clients—without it, multi-pod web deployments drop long-lived agent sessions.
