Proposal: Implement leader-follower architecture for zoekt nodes to separate indexing and search workloads

Summary

Implement a leader-follower architecture for zoekt nodes to separate indexing and search workloads, eliminating CPU contention and improving search performance.

Problem

Currently, zoekt nodes handle both indexing and search operations on the same hardware, causing:

CPU contention between resource-intensive indexing and search operations
Degraded search performance during heavy indexing periods
No redundancy for high-availability search

Proposed Solution

Architecture Changes

Leader nodes (default mode): Handle indexing and generate zoekt shard files. Can serve search requests as fallback.
Follower nodes (new follow mode): Sync indices from leaders and dedicate 100% CPU to serving searches. Request sync indexing tasks from Rails.
Rails tasks: Creates new sync indexing task type. Follower nodes fetch index manifests and download shards from leader.
Search routing: Prefer follower replicas when available for optimal performance, fallback to leaders for backward compatibility and redundancy.

Benefits

✅ Improved search performance during indexing operations
✅ Independent scaling of search vs indexing capacity
✅ Redundancy for high-availability search
✅ Non-breaking change - gradual rollout possible
✅ Existing searches against leader nodes continue working unchanged

Implementation Details

Sync Mechanism

HTTP-based file transfer: Leverages existing nginx gateway
Pull-based model: Followers control their sync rate via the existing heartbeat / task polling process
Index manifest: Leaders provide list of shard files and sizes over http endpoint
Incremental sync: Only changed files are downloaded
Atomic updates: Uses .new temporary files for safe swaps
Simple comparison: File size-based change detection. We could do checksum checks in the future.

Data Model

Nodes: Designated as leader or follower with new follow indexer mode
Replicas: Designated as leader or follower. Follower replicas can only have follower indices assigned to them.
Indices: Designated as leader or follower. Follower indices can only be assigned to follower nodes.

Sequence Diagram

Click to expand

sequenceDiagram
    participant LN as Leader Node
    participant R as Rails
    participant FN as Follower Node

    Note over LN: Indexing Process
    LN->>LN: Index zoekt shard
    LN->>LN: Generate shard files<br/>(multiple .zoekt files)

    Note over LN,R: Notify Rails
    LN->>R: Indexing complete callback<br/>(leader index updated)
    R->>R: Update leader index record
    R->>R: Find corresponding follower indices<br/>(on follower nodes)
    R->>R: Create sync indexing tasks<br/>for follower indices

    Note over FN,R: Follower Node Polling (Heartbeat)
    loop Periodic heartbeat
        FN->>R: Poll for sync tasks<br/>(follower node heartbeat)
        alt Sync task available
            R-->>FN: Return sync task<br/>(sync follower index from leader index)
            
            Note over FN,LN: Index Manifest Exchange
            FN->>LN: Request index manifest<br/>(from leader node)
            LN->>LN: List shard files and sizes
            LN-->>FN: Return index manifest
            
            Note over FN: Sync Process
            FN->>FN: Compare local vs index manifest<br/>Identify changed/new files
            
            loop For each changed file
                FN->>LN: Request individual file
                LN-->>FN: Stream file content
                FN->>FN: Save to temporary location
            end
            
            Note over FN: Atomic Update
            FN->>FN: Swap temporary files to final location
            FN->>FN: Clean up obsolete files
            FN->>FN: Reload indices
            
            Note over FN,R: Task Completion
            FN->>R: Indexing complete callback<br/>(follower index synced)
            R->>R: Update follower index status
            R->>R: Mark sync task complete
        else No tasks
            R-->>FN: Empty response (heartbeat acknowledged)
        end
    end

    Note over R: Rails Tracking
    R->>R: Track node heartbeats<br/>Monitor leader/follower designations

Risks and Mitigations

Risk	Impact	Mitigation
Network bandwidth for file transfers	High network utilization during sync	Implement compression, rate limiting if needed
Eventual consistency	Stale search results on followers	Monitor sync lag metrics, alert on thresholds
Storage costs doubling	Increased infrastructure costs	Start with critical indices, expand gradually
Follower node failures	Reduced search capacity	Automatic fallback to leader nodes

Success Metrics

Sync lag between leader and follower indices
CPU utilization separation (indexing vs search)

Open Questions

What should the default heartbeat/polling interval for follower nodes be?
Should we implement checksum verification for file transfers?
How should we handle partial sync failures?
Should we support multiple followers per leader in the future? This could improve search performance.
If a leader node goes offline, what happens?

Edited Sep 19, 2025 by 🤖 GitLab Bot 🤖