Write design document and architecture reference for the code indexer

Context

The Knowledge Graph Core Indexer epic (gitlab-org#17517 (closed)) is closed. The indexer is working and deployed, but the design and architecture documents still describe the old system. The epic description still references KuzuDB, Parquet-based bulk imports to Kuzu, and client-side CLI/LSP integration plans that no longer apply.

We need a design document and architecture reference that matches what shipped.

What needs documenting

Code parser (crates/code-parser/)

The parser supports seven languages through three different parsing backends:

  • Ruby via ruby_prism
  • TypeScript/JavaScript via SWC
  • Python, Kotlin, Java, C#, Rust via tree-sitter (GenericParser)

Each language has its own analyzer that extracts definitions (classes, modules, methods, functions, constants), imported symbols, and references. The parser outputs structured DefinitionInfo, ImportedSymbolInfo, and ReferenceInfo types with language-specific metadata.

The LanguageParser trait and Analyzer generic provide the shared interface. Language detection runs off file extensions with an exclusion list (e.g., .min.js files are skipped).

Key files:

  • crates/code-parser/src/parser.rs - language detection, ParserType enum, SupportedLanguage
  • crates/code-parser/src/analyzer.rs - generic Analyzer with DefinitionLookup and ImportLookup traits
  • crates/code-parser/src/definitions.rs - DefinitionInfo generic structure
  • crates/code-parser/src/imports.rs - ImportedSymbolInfo generic structure
  • crates/code-parser/src/references.rs - ReferenceInfo generic structure
  • crates/code-parser/src/ruby/, python/, typescript/, etc. - language-specific implementations

Code indexer (crates/code-indexer/)

The indexer runs an async streaming pipeline:

  1. File discovery via DirectoryFileSource using the ignore crate (respects .gitignore)
  2. Bounded async file reads (buffer_unordered, IO concurrency = max(worker_threads * 2, 8))
  3. CPU-bound parsing on Rayon thread pool via tokio_rayon::spawn, bounded by a semaphore (num_cpus::get())
  4. Analysis phase groups results by language and builds the graph

The AnalysisService holds per-language analyzers (Ruby, Python, TypeScript, Kotlin, Java, C#, Rust) and produces GraphData containing directory nodes, file nodes, definition nodes, imported symbol nodes, and 100+ relationship types.

Key files:

  • crates/code-indexer/src/indexer.rs - RepositoryIndexer, IndexingConfig, streaming pipeline
  • crates/code-indexer/src/graph.rs - RelationshipType enum (100+ variants)
  • crates/code-indexer/src/loading/mod.rs - DirectoryFileSource, FileSource trait
  • crates/code-indexer/src/parsing/processor.rs - FileProcessor, per-file processing
  • crates/code-indexer/src/analysis/mod.rs - AnalysisService, graph building orchestration
  • crates/code-indexer/src/analysis/types.rs - GraphData, node types, ID assignment

Arrow adapter and storage layer

ArrowConverter transforms GraphData into Apache Arrow RecordBatch objects. These batches go to ClickHouse via ArrowClickHouseClient, which implements the pluggable Destination/BatchWriter trait pair.

Key files:

  • crates/gkg-server/src/indexer/modules/code/arrow_converter.rs - graph data to Arrow conversion
  • crates/clickhouse-client/src/arrow_client.rs - Arrow IPC serialization to ClickHouse
  • crates/etl-engine/src/destination.rs - pluggable storage traits

ETL engine and server integration

The gkg-server indexer uses NATS for message brokering and the ETL engine's module/handler system. Code indexing and SDLC indexing are separate modules registered with the engine.

Key files:

  • crates/gkg-server/src/indexer/mod.rs - indexer initialization, module registration
  • crates/etl-engine/src/engine.rs - Engine, EngineBuilder
  • crates/etl-engine/src/module.rs - Module, Handler traits

What the design doc should cover

  1. Updated architecture diagram reflecting the current pipeline (file discovery -> parse -> analyze -> Arrow -> ClickHouse), not the old Kuzu-based flow
  2. Parser architecture: the three backends, language support matrix, extraction types
  3. Indexer pipeline: streaming concurrency model, the semaphore/rayon approach, GraphData output
  4. Relationship type catalog: the 100+ relationship types and what they represent
  5. Storage layer: the Arrow adapter, pluggable Destination/BatchWriter traits, ClickHouse integration
  6. Server integration: NATS, ETL engine, module system
  7. How this differs from the original epic description (Kuzu is gone, ClickHouse is in, server-side architecture changed)

/cc @michaelusa @jgdoyon1 @bohdanpk

Edited Feb 08, 2026 by Michael Angelo Rivera
Assignee Loading
Time tracking Loading