Zoekt: Allow configurable indexing Parallelism (Zoekt library, builder)
Summary
When using zoekt as a library with multiple concurrent indexing operations, memory usage can grow excessively due to the default Parallelism setting (4) multiplying across all indexing operations.
Steps to reproduce
- Run multiple concurrent indexing operations using zoekt as a library (e.g., 20 concurrent operations)
- Each operation uses the default Parallelism setting of 4
- Monitor memory usage as indexing progresses
What is the current behavior?
With the default Parallelism=4 and multiple concurrent indexing operations, memory usage grows very high:
- 20 concurrent indexing operations × Parallelism of 4 = 80 indexing goroutines
- Each goroutine can process over 100MB of Gitaly and shard data in memory
- This can lead to OOM issues in production environments
What is the expected correct behavior?
The Parallelism setting should:
- Be configurable when the library is used by external applications
- Have a sensible default that doesn't cause excessive memory usage when multiple instances are running
Technical details
The Parallelism option in Zoekt controls the maximum number of concurrent shards that can be indexed in parallel within a single Builder instance:
-
In
builder.go, theNewBuilderfunction creates a throttle channel with a buffer size equal to the Parallelism setting:throttle: make(chan int, opts.Parallelism), -
When flushing documents to disk, if Parallelism > 1, it uses this throttle to limit concurrent goroutines:
if b.opts.Parallelism > 1 { b.building.Add(1) b.throttle <- 1 go func() { done, err := b.buildShard(todo, shard) <-b.throttle // ... b.building.Done() }() } else { // No goroutines when we're not parallel // ... } -
By default, if not specified, Parallelism is set to 4 in the
SetDefaultsmethod.
Proposed solution
- In gitlab-zoekt-indexer: Make Parallelism configurable through the API
- Set a lower default value (1 or 2) for Parallelism when using zoekt as a library
- Add documentation about the memory implications of the Parallelism setting
- Consider adding optional runtime detection of concurrent operations to auto-scale Parallelism