Skip to content

Proposal: Implement leader-follower architecture for zoekt nodes to separate indexing and search workloads

Summary

Implement a leader-follower architecture for zoekt nodes to separate indexing and search workloads, eliminating CPU contention and improving search performance.

Problem

Currently, zoekt nodes handle both indexing and search operations on the same hardware, causing:

  • CPU contention between resource-intensive indexing and search operations
  • Degraded search performance during heavy indexing periods
  • No redundancy for high-availability search

Proposed Solution

Architecture Changes

  • Leader nodes (default mode): Handle indexing and generate zoekt shard files. Can serve search requests as fallback.
  • Follower nodes (new follow mode): Sync indices from leaders and dedicate 100% CPU to serving searches. Request sync indexing tasks from Rails.
  • Rails tasks: Creates new sync indexing task type. Follower nodes fetch index manifests and download shards from leader.
  • Search routing: Prefer follower replicas when available for optimal performance, fallback to leaders for backward compatibility and redundancy.

Benefits

  • Improved search performance during indexing operations
  • Independent scaling of search vs indexing capacity
  • Redundancy for high-availability search
  • Non-breaking change - gradual rollout possible
  • Existing searches against leader nodes continue working unchanged

Implementation Details

Sync Mechanism

  • HTTP-based file transfer: Leverages existing nginx gateway
  • Pull-based model: Followers control their sync rate via the existing heartbeat / task polling process
  • Index manifest: Leaders provide list of shard files and sizes over http endpoint
  • Incremental sync: Only changed files are downloaded
  • Atomic updates: Uses .new temporary files for safe swaps
  • Simple comparison: File size-based change detection. We could do checksum checks in the future.

Data Model

  • Nodes: Designated as leader or follower with new follow indexer mode
  • Replicas: Designated as leader or follower. Follower replicas can only have follower indices assigned to them.
  • Indices: Designated as leader or follower. Follower indices can only be assigned to follower nodes.

Sequence Diagram

Click to expand
sequenceDiagram
    participant LN as Leader Node
    participant R as Rails
    participant FN as Follower Node

    Note over LN: Indexing Process
    LN->>LN: Index zoekt shard
    LN->>LN: Generate shard files<br/>(multiple .zoekt files)

    Note over LN,R: Notify Rails
    LN->>R: Indexing complete callback<br/>(leader index updated)
    R->>R: Update leader index record
    R->>R: Find corresponding follower indices<br/>(on follower nodes)
    R->>R: Create sync indexing tasks<br/>for follower indices

    Note over FN,R: Follower Node Polling (Heartbeat)
    loop Periodic heartbeat
        FN->>R: Poll for sync tasks<br/>(follower node heartbeat)
        alt Sync task available
            R-->>FN: Return sync task<br/>(sync follower index from leader index)
            
            Note over FN,LN: Index Manifest Exchange
            FN->>LN: Request index manifest<br/>(from leader node)
            LN->>LN: List shard files and sizes
            LN-->>FN: Return index manifest
            
            Note over FN: Sync Process
            FN->>FN: Compare local vs index manifest<br/>Identify changed/new files
            
            loop For each changed file
                FN->>LN: Request individual file
                LN-->>FN: Stream file content
                FN->>FN: Save to temporary location
            end
            
            Note over FN: Atomic Update
            FN->>FN: Swap temporary files to final location
            FN->>FN: Clean up obsolete files
            FN->>FN: Reload indices
            
            Note over FN,R: Task Completion
            FN->>R: Indexing complete callback<br/>(follower index synced)
            R->>R: Update follower index status
            R->>R: Mark sync task complete
        else No tasks
            R-->>FN: Empty response (heartbeat acknowledged)
        end
    end

    Note over R: Rails Tracking
    R->>R: Track node heartbeats<br/>Monitor leader/follower designations

Risks and Mitigations

Risk Impact Mitigation
Network bandwidth for file transfers High network utilization during sync Implement compression, rate limiting if needed
Eventual consistency Stale search results on followers Monitor sync lag metrics, alert on thresholds
Storage costs doubling Increased infrastructure costs Start with critical indices, expand gradually
Follower node failures Reduced search capacity Automatic fallback to leader nodes

Success Metrics

  • Sync lag between leader and follower indices
  • CPU utilization separation (indexing vs search)

Open Questions

  1. What should the default heartbeat/polling interval for follower nodes be?
  2. Should we implement checksum verification for file transfers?
  3. How should we handle partial sync failures?
  4. Should we support multiple followers per leader in the future? This could improve search performance.
  5. If a leader node goes offline, what happens?
Edited by 🤖 GitLab Bot 🤖