Fix Zoekt indexing for namespaces with missing indices

What does this MR do and why?

Fix Zoekt indexing for namespaces with missing indices

The RolloutWorker was only selecting namespaces with mismatched replica counts, causing namespaces with correct replicas but missing indices to never be processed for indexing.

This fix updates SelectionService to also select namespaces with missing indices, ensuring all enabled namespaces eventually get their indices created.

Changes:

  • Add each_batch_with_mismatched_replicas_or_missing_indices method
  • Update SelectionService to use the new selection method
  • Add comprehensive test coverage for the new method
  • Optimize performance by batching on base scope before filtering

Performance optimization: The method applies each_batch on the base scope first, then filters within each batch. This is more efficient than batching through already-filtered scopes with complex GROUP BY and HAVING clauses, as it allows the query optimizer to use simple primary key ordering for batching while only applying aggregation queries to small batches.

References

Implementation Details

Method: each_batch_with_mismatched_replicas_or_missing_indices

The method processes namespaces in batches of 5000 (configurable):

def self.each_batch_with_mismatched_replicas_or_missing_indices(batch_size: 5000)
  processed_ids = Set.new
  
  each_batch(of: batch_size) do |batch|
    # Process namespaces with mismatched replicas in this batch
    batch.with_mismatched_replicas.each do |ns|
      processed_ids << ns.id
      yield(ns)
    end
    
    # Process namespaces with missing indices in this batch (skip duplicates)
    batch.with_missing_indices.each do |ns|
      next if processed_ids.include?(ns.id)
      yield(ns)
    end
  end
end

Why this approach?

  1. Performance: Batching happens on simple primary key ordering, not on aggregated results
  2. Efficiency: Complex GROUP BY queries only run on small batches (5000 records)
  3. Deduplication: Uses Set to track processed IDs and avoid duplicates
  4. Consistency: Follows the same pattern as other batch methods in the codebase

Alternative approaches considered:

  • Using with_mismatched_replicas.or(with_missing_indices) scope: Failed due to query structure mismatch between the two scopes
  • Calling each_batch on filtered scopes: Poor performance due to batching through aggregated results

SelectionService Changes

The SelectionService now calls the new method:

def fetch_enabled_namespace_for_indexing
  [].tap do |batch|
    ::Search::Zoekt::EnabledNamespace
      .with_rollout_allowed
      .each_batch_with_mismatched_replicas_or_missing_indices do |ns|
        batch << ns
        break if batch.count >= max_batch_size
      end
  end
end

Database Impact

The query uses existing indexes on foreign keys and is limited by:

  • with_rollout_allowed scope (filters by rollout status)
  • each_batch processing (default 5000 records per batch on base scope)
  • SelectionService max_batch_size (default 128 namespaces)

No new indexes required. The performance characteristics are:

  • Base scope batching: O(n/5000) iterations through primary key
  • Per-batch filtering: Small aggregation queries on 5000 records max
  • Memory usage: Minimal (one batch + Set of processed IDs)

How to set up and validate locally

  1. Enable Zoekt in GDK

  2. In Rails console, create a namespace with replicas but no indices:

    # Create an enabled namespace with replicas but no indices
    namespace = Group.first
    enabled_ns = Search::Zoekt::EnabledNamespace.create!(
      namespace: namespace,
      root_namespace_id: namespace.id
    )
    
    # Create a replica to match expected count
    node = Search::Zoekt::Node.first
    Search::Zoekt::Replica.create!(
      zoekt_enabled_namespace: enabled_ns,
      zoekt_node: node
    )
  3. Verify the namespace is selected by the new method:

    # Count how many namespaces would be selected
    count = 0
    Search::Zoekt::EnabledNamespace
      .with_rollout_allowed
      .each_batch_with_mismatched_replicas_or_missing_indices do |ns|
        count += 1
      end
    count # => Should include enabled_ns
  4. Run SelectionService and verify it selects the namespace:

    pool = Search::Zoekt::SelectionService.execute
    pool.enabled_namespaces.include?(enabled_ns)
    # => true
  5. Verify performance with a larger dataset:

    # Create multiple namespaces with missing indices
    10.times do |i|
      group = Group.create!(name: "test-group-#{i}", path: "test-group-#{i}")
      enabled_ns = Search::Zoekt::EnabledNamespace.create!(
        namespace: group,
        root_namespace_id: group.id
      )
      Search::Zoekt::Replica.create!(
        zoekt_enabled_namespace: enabled_ns,
        zoekt_node: Search::Zoekt::Node.first
      )
    end
    
    # Measure performance
    require 'benchmark'
    Benchmark.measure do
      Search::Zoekt::SelectionService.execute
    end

MR acceptance checklist

Evaluate this MR against the MR acceptance checklist. It helps you analyze changes to reduce risks in quality, performance, reliability, security, and maintainability.

Edited by Dmitry Gruzd

Merge request reports

Loading