Write design document and architecture reference for the code indexer

Context

The Knowledge Graph Core Indexer epic (gitlab-org#17517 (closed)) is closed. The indexer is working and deployed, but the design and architecture documents still describe the old system. The epic description still references KuzuDB, Parquet-based bulk imports to Kuzu, and client-side CLI/LSP integration plans that no longer apply.

We need a design document and architecture reference that matches what shipped.

What needs documenting

Code parser (`crates/code-parser/`)

The parser supports seven languages through three different parsing backends:

Ruby via ruby_prism
TypeScript/JavaScript via SWC
Python, Kotlin, Java, C#, Rust via tree-sitter (GenericParser)

Each language has its own analyzer that extracts definitions (classes, modules, methods, functions, constants), imported symbols, and references. The parser outputs structured DefinitionInfo, ImportedSymbolInfo, and ReferenceInfo types with language-specific metadata.

The LanguageParser trait and Analyzer generic provide the shared interface. Language detection runs off file extensions with an exclusion list (e.g., .min.js files are skipped).

Key files:

crates/code-parser/src/parser.rs - language detection, ParserType enum, SupportedLanguage
crates/code-parser/src/analyzer.rs - generic Analyzer with DefinitionLookup and ImportLookup traits
crates/code-parser/src/definitions.rs - DefinitionInfo generic structure
crates/code-parser/src/imports.rs - ImportedSymbolInfo generic structure
crates/code-parser/src/references.rs - ReferenceInfo generic structure
crates/code-parser/src/ruby/, python/, typescript/, etc. - language-specific implementations

Code indexer (`crates/code-indexer/`)

The indexer runs an async streaming pipeline:

File discovery via DirectoryFileSource using the ignore crate (respects .gitignore)
Bounded async file reads (buffer_unordered, IO concurrency = max(worker_threads * 2, 8))
CPU-bound parsing on Rayon thread pool via tokio_rayon::spawn, bounded by a semaphore (num_cpus::get())
Analysis phase groups results by language and builds the graph

The AnalysisService holds per-language analyzers (Ruby, Python, TypeScript, Kotlin, Java, C#, Rust) and produces GraphData containing directory nodes, file nodes, definition nodes, imported symbol nodes, and 100+ relationship types.

Key files:

crates/code-indexer/src/indexer.rs - RepositoryIndexer, IndexingConfig, streaming pipeline
crates/code-indexer/src/graph.rs - RelationshipType enum (100+ variants)
crates/code-indexer/src/loading/mod.rs - DirectoryFileSource, FileSource trait
crates/code-indexer/src/parsing/processor.rs - FileProcessor, per-file processing
crates/code-indexer/src/analysis/mod.rs - AnalysisService, graph building orchestration
crates/code-indexer/src/analysis/types.rs - GraphData, node types, ID assignment

Arrow adapter and storage layer

ArrowConverter transforms GraphData into Apache Arrow RecordBatch objects. These batches go to ClickHouse via ArrowClickHouseClient, which implements the pluggable Destination/BatchWriter trait pair.

Key files:

crates/gkg-server/src/indexer/modules/code/arrow_converter.rs - graph data to Arrow conversion
crates/clickhouse-client/src/arrow_client.rs - Arrow IPC serialization to ClickHouse
crates/etl-engine/src/destination.rs - pluggable storage traits

ETL engine and server integration

The gkg-server indexer uses NATS for message brokering and the ETL engine's module/handler system. Code indexing and SDLC indexing are separate modules registered with the engine.

Key files:

crates/gkg-server/src/indexer/mod.rs - indexer initialization, module registration
crates/etl-engine/src/engine.rs - Engine, EngineBuilder
crates/etl-engine/src/module.rs - Module, Handler traits

What the design doc should cover

Updated architecture diagram reflecting the current pipeline (file discovery -> parse -> analyze -> Arrow -> ClickHouse), not the old Kuzu-based flow
Parser architecture: the three backends, language support matrix, extraction types
Indexer pipeline: streaming concurrency model, the semaphore/rayon approach, GraphData output
Relationship type catalog: the 100+ relationship types and what they represent
Storage layer: the Arrow adapter, pluggable Destination/BatchWriter traits, ClickHouse integration
Server integration: NATS, ETL engine, module system
How this differs from the original epic description (Kuzu is gone, ClickHouse is in, server-side architecture changed)

/cc @michaelusa @jgdoyon1 @bohdanpk

Edited Feb 08, 2026 by Michael Angelo Rivera

Write design document and architecture reference for the code indexer

Context

What needs documenting

Code parser (crates/code-parser/)

Code indexer (crates/code-indexer/)

Arrow adapter and storage layer

ETL engine and server integration

What the design doc should cover

Code parser (`crates/code-parser/`)

Code indexer (`crates/code-indexer/`)