[gkg][indexer] high memory utilization on gitlab monolith

Problem Statement

The Knowledge Graph indexer (gkg) currently exhibits high memory utilization when processing large repositories, such as the GitLab monolith (gitlab-org/gitlab). Recent performance profiling shows that a full indexing job can consume over 4 GB of RAM, with anecdotal reports from team members observing peaks as high as 6 GB on their machines.

This level of memory consumption poses two significant risks:

Blocker for Server-Side Deployment: As discussed in the Knowledge Graph Server Architecture (&17518), such high memory usage is likely unsustainable in a production server environment where resources are shared. This could prevent the Knowledge Graph from being deployed at scale on GitLab.com.
User Experience: For developers running the indexer locally, especially on resource-constrained machines, a multi-gigabyte memory spike can degrade system performance and lead to poor user experience.

Addressing this is critical for the success of both the client-side tool and its future integration into the broader GitLab platform.

Analysis & Root Cause

Investigations and performance profiling sessions have identified the root cause of the high memory usage.

Performance Profile Breakdown: A profiling run breaks down the indexing job into three main phases:
- File Discovery (gitalisk): ~10% of runtime. Low memory and CPU usage.
- Parsing Phase: ~40-45% of runtime. This phase is highly parallelized and CPU-intensive, as it parses file ASTs.
- Analysis Phase: ~50-55% of runtime. This phase is single-threaded and has low CPU utilization. Critically, it holds all parsed data structures (definitions, imports, and references) from the previous phase in memory to build the graph relationships. The memory peak occurs during this phase.
Identifying the Culprit (Reference Collection): An experiment conducted during a team sync involved disabling the collection of code references and running the indexer again. The results were dramatic: peak memory usage dropped from ~5-9 GB down to ~1.7 GB. This confirms that the primary source of memory allocation is the storage and processing of ReferenceInfo data structures.
Root Cause: FQN Data Duplication: The underlying issue is a massive duplication of data within our Fqn (Fully Qualified Name) structs. The current implementation results in:
- For every single reference found in the codebase (which can number in the millions), a new Fqn struct is allocated.
- Each Fqn struct contains a Vec<FqnPart>, which stores the string components of the name (e.g., ["GitLab", "User", "find_by_id"]).
- This means that common FQN prefixes (e.g., the module or class names) are duplicated in memory thousands or millions of times, leading to an explosion of Vec and String allocations.

Supporting Evidence & Statistics

The following statistics were captured from a full indexing run on the GitLab monolith (gdk/gitlab), which consists of ~57,000 files.

Click to expand gkg index gitlab --stats

2025-09-02T23:16:08.753136Z  INFO Database loading completed successfully:
2025-09-02T23:16:08.753149Z  INFO Schema Stats: 7 tables (4 node, 3 rel), 343266 nodes, 2115502 relationships
Tables: DirectoryNode, DefinitionNode, ImportedSymbolNode, FileNode, DIRECTORY_RELATIONSHIPS, FILE_RELATIONSHIPS, DEFINITION_RELATIONSHIPS
2025-09-02T23:16:08.787234Z  INFO ✅ Repository indexing completed for 'gitlab' in 7.691678459s
2025-09-02T23:16:08.787247Z  INFO 📊 Final results: 100.0% complete - 57131 processed, 3 skipped, 0 errors
...
2025-09-02T23:16:09.266119Z  INFO ✅ Workspace indexing completed in 8.63 seconds
...
        Command being timed: "gkg index gitlab --stats"
        User time (seconds): 43.00
        System time (seconds): 11.37
        Percent of CPU this job got: 620%
        Elapsed (wall clock) time (h:mm:ss or m:ss): 0:08.76
        Maximum resident set size (kbytes): 6701024
...

Key Metrics:

Peak Memory: 6,701,024 KB (~6.4 GB)
Total Time: 8.76 seconds
CPU Utilization: 620% (indicating heavy parallel processing during the parsing phase)

Proposed Solution

As discussed during our technical sync, the most effective solution is to de-duplicate the FQN data using a combination of a central interned store and smart pointers.

Option 1: String Interning for FqnParts: Within the scope of a single file analysis in the gitlab-code-parser, we will introduce a central store (e.g., a HashSet) for all unique FqnPart instances. This ensures that each unique string component of an FQN (like "User", "ActiveRecord", etc.) is allocated only once.
Option 2: Refactor Fqn to use Arc: The Fqn struct will be refactored from Vec<FqnPart> to Vec<Arc<FqnPart>>. Arc (Atomically Reference Counted pointer) is a lightweight smart pointer (typically 8 bytes). Instead of cloning the entire FqnPart struct, we will only clone the Arc, which is a cheap operation that simply increments a reference counter.

Expected Outcome: This change will dramatically reduce memory consumption. Instead of storing millions of duplicated strings, we will store each unique string once and reference it via inexpensive pointers. This should bring memory usage down to a level acceptable for both server-side deployment and local developer machines.

Related Epics & Issues

Parent Epic: Knowledge Graph Core Indexer (&17517)
Top-Level Epic: Knowledge Graph First Iteration (&17514)
Feature Epic: Knowledge Graph Phase 3 – Inter-file references (&9) - This is the phase most impacted by the high memory usage from reference collection.
Blocked Epic: Knowledge Graph Server Architecture (&17518) - Server-side deployment is blocked until this is resolved.

Edited Sep 02, 2025 by Michael Angelo Rivera