Zoekt: Speed up zero-downtime reindexing
Background
The current zero-downtime reindexing feature for Zoekt is experiencing performance bottlenecks that make it insufficient for large-scale deployments. The current implementation in RepoToReindexEventWorker processes repositories sequentially with a batch size of 500, regardless of which Zoekt nodes they belong to.
Current Flow:
-
SchedulingServicechecks if repositories need reindexing viaRepository.should_be_reindexedscope - Emits a single
RepoToReindexEvent -
RepoToReindexEventWorkerprocesses up to 500 repositories globally viaRepository.should_be_reindexed.limit(BATCH_SIZE).create_bulk_tasks - Tasks are created for repositories across all nodes, but processing happens sequentially
Performance Issues:
- Single global event creates a bottleneck for large numbers of repositories requiring reindexing
- No parallelization across different Zoekt nodes that could process their repositories independently
- The
should_be_reindexedscope joins repositories with nodes to find schema version mismatches:joins(zoekt_index: :node).where("#{table_name}.schema_version != #{Node.table_name}.schema_version"), but doesn't leverage node-level parallelization - All reindexing work is funneled through one worker instance, limiting throughput
This becomes particularly problematic during schema version updates where potentially thousands of repositories across multiple nodes need reindexing simultaneously.
Proposal
Implement a node-based parallel reindexing approach that creates one event per Zoekt node with mismatched schema versions, allowing for massive performance improvements through parallelization.
Proposed Changes:
1. Update SchedulingService to emit per-node events
Modify repo_to_reindex_check in SchedulingService to:
- Identify nodes that have repositories with mismatched schema versions
- Emit one
RepoToReindexEventper node instead of a single global event - Pass the node ID in the event data
2. Update RepoToReindexEvent schema
Modify the event to accept node-specific data:
def schema
{
'type' => 'object',
'properties' => {
'zoekt_node_id' => { 'type' => ['integer', 'null'] }
},
'additionalProperties' => false
}
end
3. Update RepoToReindexEventWorker for node-scoped processing
Modify the worker to process repositories for a specific node when node ID is provided:
def handle_event(event)
return false unless ::Search::Zoekt.licensed_and_indexing_enabled?
node_id = event.data[:zoekt_node_id]
# Scope repositories to specific node if provided, otherwise use current global behavior
scope = if node_id.present?
Repository.should_be_reindexed.joins(zoekt_index: :node)
.where(zoekt_nodes: { id: node_id })
else
Repository.should_be_reindexed
end
return false if scope.with_pending_or_processing_tasks.exists?
scope.limit(BATCH_SIZE).create_bulk_tasks
end
Benefits:
- Massive performance improvement: Multiple nodes can process their repositories in parallel instead of sequentially
- Better resource utilization: Each Zoekt node can work on its own repositories simultaneously
- Scalability: Performance scales linearly with the number of nodes
- Reduced bottlenecks: Eliminates the single global event bottleneck
- Backward compatibility: Falls back to current behavior when no node ID is specified
Implementation considerations:
- Maintain existing batch size limits per node to prevent overwhelming individual nodes
- Add feature flag for gradual rollout and easy rollback
This approach transforms the reindexing process from a serial operation to a highly parallel one, potentially improving performance by orders of magnitude depending on the number of Zoekt nodes in the deployment.