Zoekt: Speed up zero-downtime reindexing

Background

The current zero-downtime reindexing feature for Zoekt is experiencing performance bottlenecks that make it insufficient for large-scale deployments. The current implementation in RepoToReindexEventWorker processes repositories sequentially with a batch size of 500, regardless of which Zoekt nodes they belong to.

Current Flow:

SchedulingService checks if repositories need reindexing via Repository.should_be_reindexed scope
Emits a single RepoToReindexEvent
RepoToReindexEventWorker processes up to 500 repositories globally via Repository.should_be_reindexed.limit(BATCH_SIZE).create_bulk_tasks
Tasks are created for repositories across all nodes, but processing happens sequentially

Performance Issues:

Single global event creates a bottleneck for large numbers of repositories requiring reindexing
No parallelization across different Zoekt nodes that could process their repositories independently
The should_be_reindexed scope joins repositories with nodes to find schema version mismatches: joins(zoekt_index: :node).where("#{table_name}.schema_version != #{Node.table_name}.schema_version"), but doesn't leverage node-level parallelization
All reindexing work is funneled through one worker instance, limiting throughput

This becomes particularly problematic during schema version updates where potentially thousands of repositories across multiple nodes need reindexing simultaneously.

Proposal

Implement a node-based parallel reindexing approach that creates one event per Zoekt node with mismatched schema versions, allowing for massive performance improvements through parallelization.

Proposed Changes:

1. Update SchedulingService to emit per-node events

Modify repo_to_reindex_check in SchedulingService to:

Identify nodes that have repositories with mismatched schema versions
Emit one RepoToReindexEvent per node instead of a single global event
Pass the node ID in the event data

2. Update RepoToReindexEvent schema

Modify the event to accept node-specific data:

def schema
  {
    'type' => 'object',
    'properties' => {
      'zoekt_node_id' => { 'type' => ['integer', 'null'] }
    },
    'additionalProperties' => false
  }
end

3. Update RepoToReindexEventWorker for node-scoped processing

Modify the worker to process repositories for a specific node when node ID is provided:

def handle_event(event)
  return false unless ::Search::Zoekt.licensed_and_indexing_enabled?
  
  node_id = event.data[:zoekt_node_id]
  
  # Scope repositories to specific node if provided, otherwise use current global behavior
  scope = if node_id.present?
    Repository.should_be_reindexed.joins(zoekt_index: :node)
              .where(zoekt_nodes: { id: node_id })
  else
    Repository.should_be_reindexed
  end
  
  return false if scope.with_pending_or_processing_tasks.exists?
  
  scope.limit(BATCH_SIZE).create_bulk_tasks
end

Benefits:

Massive performance improvement: Multiple nodes can process their repositories in parallel instead of sequentially
Better resource utilization: Each Zoekt node can work on its own repositories simultaneously
Scalability: Performance scales linearly with the number of nodes
Reduced bottlenecks: Eliminates the single global event bottleneck
Backward compatibility: Falls back to current behavior when no node ID is specified

Implementation considerations:

Maintain existing batch size limits per node to prevent overwhelming individual nodes
Add feature flag for gradual rollout and easy rollback

This approach transforms the reindexing process from a serial operation to a highly parallel one, potentially improving performance by orders of magnitude depending on the number of Zoekt nodes in the deployment.