[gkg][indexer] high memory utilization on gitlab monolith
Problem Statement
The Knowledge Graph indexer (gkg
) currently exhibits high memory utilization when processing large repositories, such as the GitLab monolith (gitlab-org/gitlab
). Recent performance profiling shows that a full indexing job can consume over 4 GB of RAM, with anecdotal reports from team members observing peaks as high as 6 GB on their machines.
This level of memory consumption poses two significant risks:
- Blocker for Server-Side Deployment: As discussed in the Knowledge Graph Server Architecture (&17518), such high memory usage is likely unsustainable in a production server environment where resources are shared. This could prevent the Knowledge Graph from being deployed at scale on GitLab.com.
- User Experience: For developers running the indexer locally, especially on resource-constrained machines, a multi-gigabyte memory spike can degrade system performance and lead to poor user experience.
Addressing this is critical for the success of both the client-side tool and its future integration into the broader GitLab platform.
Analysis & Root Cause
Investigations and performance profiling sessions have identified the root cause of the high memory usage.
-
Performance Profile Breakdown: A profiling run breaks down the indexing job into three main phases:
-
File Discovery (
gitalisk
): ~10% of runtime. Low memory and CPU usage. - Parsing Phase: ~40-45% of runtime. This phase is highly parallelized and CPU-intensive, as it parses file ASTs.
- Analysis Phase: ~50-55% of runtime. This phase is single-threaded and has low CPU utilization. Critically, it holds all parsed data structures (definitions, imports, and references) from the previous phase in memory to build the graph relationships. The memory peak occurs during this phase.
-
File Discovery (
-
Identifying the Culprit (Reference Collection): An experiment conducted during a team sync involved disabling the collection of code references and running the indexer again. The results were dramatic: peak memory usage dropped from ~5-9 GB down to ~1.7 GB. This confirms that the primary source of memory allocation is the storage and processing of
ReferenceInfo
data structures. -
Root Cause: FQN Data Duplication: The underlying issue is a massive duplication of data within our
Fqn
(Fully Qualified Name) structs. The current implementation results in:- For every single reference found in the codebase (which can number in the millions), a new
Fqn
struct is allocated. - Each
Fqn
struct contains aVec<FqnPart>
, which stores the string components of the name (e.g.,["GitLab", "User", "find_by_id"]
). - This means that common FQN prefixes (e.g., the module or class names) are duplicated in memory thousands or millions of times, leading to an explosion of
Vec
andString
allocations.
- For every single reference found in the codebase (which can number in the millions), a new
Supporting Evidence & Statistics
The following statistics were captured from a full indexing run on the GitLab monolith (gdk/gitlab
), which consists of ~57,000 files.
Click to expand gkg index gitlab --stats
2025-09-02T23:16:08.753136Z INFO Database loading completed successfully:
2025-09-02T23:16:08.753149Z INFO Schema Stats: 7 tables (4 node, 3 rel), 343266 nodes, 2115502 relationships
Tables: DirectoryNode, DefinitionNode, ImportedSymbolNode, FileNode, DIRECTORY_RELATIONSHIPS, FILE_RELATIONSHIPS, DEFINITION_RELATIONSHIPS
2025-09-02T23:16:08.787234Z INFO ✅ Repository indexing completed for 'gitlab' in 7.691678459s
2025-09-02T23:16:08.787247Z INFO 📊 Final results: 100.0% complete - 57131 processed, 3 skipped, 0 errors
...
2025-09-02T23:16:09.266119Z INFO ✅ Workspace indexing completed in 8.63 seconds
...
Command being timed: "gkg index gitlab --stats"
User time (seconds): 43.00
System time (seconds): 11.37
Percent of CPU this job got: 620%
Elapsed (wall clock) time (h:mm:ss or m:ss): 0:08.76
Maximum resident set size (kbytes): 6701024
...
Key Metrics:
-
Peak Memory:
6,701,024
KB (~6.4 GB) - Total Time: 8.76 seconds
- CPU Utilization: 620% (indicating heavy parallel processing during the parsing phase)
Proposed Solution
As discussed during our technical sync, the most effective solution is to de-duplicate the FQN data using a combination of a central interned store and smart pointers.
-
Option 1: String Interning for
FqnPart
s: Within the scope of a single file analysis in thegitlab-code-parser
, we will introduce a central store (e.g., aHashSet
) for all uniqueFqnPart
instances. This ensures that each unique string component of an FQN (like"User"
,"ActiveRecord"
, etc.) is allocated only once. -
Option 2: Refactor
Fqn
to useArc
: TheFqn
struct will be refactored fromVec<FqnPart>
toVec<Arc<FqnPart>>
.Arc
(Atomically Reference Counted pointer) is a lightweight smart pointer (typically 8 bytes). Instead of cloning the entireFqnPart
struct, we will only clone theArc
, which is a cheap operation that simply increments a reference counter.
Expected Outcome: This change will dramatically reduce memory consumption. Instead of storing millions of duplicated strings, we will store each unique string once and reference it via inexpensive pointers. This should bring memory usage down to a level acceptable for both server-side deployment and local developer machines.
Related Epics & Issues
- Parent Epic: Knowledge Graph Core Indexer (&17517)
- Top-Level Epic: Knowledge Graph First Iteration (&17514)
- Feature Epic: Knowledge Graph Phase 3 – Inter-file references (&9) - This is the phase most impacted by the high memory usage from reference collection.
- Blocked Epic: Knowledge Graph Server Architecture (&17518) - Server-side deployment is blocked until this is resolved.