Write design document and architecture reference for the code indexer
Context
The Knowledge Graph Core Indexer epic (gitlab-org#17517 (closed)) is closed. The indexer is working and deployed, but the design and architecture documents still describe the old system. The epic description still references KuzuDB, Parquet-based bulk imports to Kuzu, and client-side CLI/LSP integration plans that no longer apply.
We need a design document and architecture reference that matches what shipped.
What needs documenting
Code parser (crates/code-parser/)
The parser supports seven languages through three different parsing backends:
- Ruby via
ruby_prism - TypeScript/JavaScript via SWC
- Python, Kotlin, Java, C#, Rust via tree-sitter (GenericParser)
Each language has its own analyzer that extracts definitions (classes, modules, methods, functions, constants), imported symbols, and references. The parser outputs structured DefinitionInfo, ImportedSymbolInfo, and ReferenceInfo types with language-specific metadata.
The LanguageParser trait and Analyzer generic provide the shared interface. Language detection runs off file extensions with an exclusion list (e.g., .min.js files are skipped).
Key files:
-
crates/code-parser/src/parser.rs- language detection,ParserTypeenum,SupportedLanguage -
crates/code-parser/src/analyzer.rs- genericAnalyzerwithDefinitionLookupandImportLookuptraits -
crates/code-parser/src/definitions.rs-DefinitionInfogeneric structure -
crates/code-parser/src/imports.rs-ImportedSymbolInfogeneric structure -
crates/code-parser/src/references.rs-ReferenceInfogeneric structure -
crates/code-parser/src/ruby/,python/,typescript/, etc. - language-specific implementations
Code indexer (crates/code-indexer/)
The indexer runs an async streaming pipeline:
- File discovery via
DirectoryFileSourceusing theignorecrate (respects.gitignore) - Bounded async file reads (
buffer_unordered, IO concurrency =max(worker_threads * 2, 8)) - CPU-bound parsing on Rayon thread pool via
tokio_rayon::spawn, bounded by a semaphore (num_cpus::get()) - Analysis phase groups results by language and builds the graph
The AnalysisService holds per-language analyzers (Ruby, Python, TypeScript, Kotlin, Java, C#, Rust) and produces GraphData containing directory nodes, file nodes, definition nodes, imported symbol nodes, and 100+ relationship types.
Key files:
-
crates/code-indexer/src/indexer.rs-RepositoryIndexer,IndexingConfig, streaming pipeline -
crates/code-indexer/src/graph.rs-RelationshipTypeenum (100+ variants) -
crates/code-indexer/src/loading/mod.rs-DirectoryFileSource,FileSourcetrait -
crates/code-indexer/src/parsing/processor.rs-FileProcessor, per-file processing -
crates/code-indexer/src/analysis/mod.rs-AnalysisService, graph building orchestration -
crates/code-indexer/src/analysis/types.rs-GraphData, node types, ID assignment
Arrow adapter and storage layer
ArrowConverter transforms GraphData into Apache Arrow RecordBatch objects. These batches go to ClickHouse via ArrowClickHouseClient, which implements the pluggable Destination/BatchWriter trait pair.
Key files:
-
crates/gkg-server/src/indexer/modules/code/arrow_converter.rs- graph data to Arrow conversion -
crates/clickhouse-client/src/arrow_client.rs- Arrow IPC serialization to ClickHouse -
crates/etl-engine/src/destination.rs- pluggable storage traits
ETL engine and server integration
The gkg-server indexer uses NATS for message brokering and the ETL engine's module/handler system. Code indexing and SDLC indexing are separate modules registered with the engine.
Key files:
-
crates/gkg-server/src/indexer/mod.rs- indexer initialization, module registration -
crates/etl-engine/src/engine.rs-Engine,EngineBuilder -
crates/etl-engine/src/module.rs-Module,Handlertraits
What the design doc should cover
- Updated architecture diagram reflecting the current pipeline (file discovery -> parse -> analyze -> Arrow -> ClickHouse), not the old Kuzu-based flow
- Parser architecture: the three backends, language support matrix, extraction types
- Indexer pipeline: streaming concurrency model, the semaphore/rayon approach,
GraphDataoutput - Relationship type catalog: the 100+ relationship types and what they represent
- Storage layer: the Arrow adapter, pluggable
Destination/BatchWritertraits, ClickHouse integration - Server integration: NATS, ETL engine, module system
- How this differs from the original epic description (Kuzu is gone, ClickHouse is in, server-side architecture changed)