Proposal: Implement leader-follower architecture for zoekt nodes to separate indexing and search workloads
Summary
Implement a leader-follower architecture for zoekt nodes to separate indexing and search workloads, eliminating CPU contention and improving search performance.
Problem
Currently, zoekt nodes handle both indexing and search operations on the same hardware, causing:
- CPU contention between resource-intensive indexing and search operations
- Degraded search performance during heavy indexing periods
- No redundancy for high-availability search
Proposed Solution
Architecture Changes
- Leader nodes (default mode): Handle indexing and generate zoekt shard files. Can serve search requests as fallback.
-
Follower nodes (new
followmode): Sync indices from leaders and dedicate 100% CPU to serving searches. Requestsyncindexing tasks from Rails. -
Rails tasks: Creates new
syncindexing task type. Follower nodes fetch index manifests and download shards from leader. - Search routing: Prefer follower replicas when available for optimal performance, fallback to leaders for backward compatibility and redundancy.
Benefits
-
✅ Improved search performance during indexing operations -
✅ Independent scaling of search vs indexing capacity -
✅ Redundancy for high-availability search -
✅ Non-breaking change - gradual rollout possible -
✅ Existing searches against leader nodes continue working unchanged
Implementation Details
Sync Mechanism
- HTTP-based file transfer: Leverages existing nginx gateway
- Pull-based model: Followers control their sync rate via the existing heartbeat / task polling process
- Index manifest: Leaders provide list of shard files and sizes over http endpoint
- Incremental sync: Only changed files are downloaded
-
Atomic updates: Uses
.newtemporary files for safe swaps - Simple comparison: File size-based change detection. We could do checksum checks in the future.
Data Model
-
Nodes: Designated as leader or follower with new
followindexer mode - Replicas: Designated as leader or follower. Follower replicas can only have follower indices assigned to them.
- Indices: Designated as leader or follower. Follower indices can only be assigned to follower nodes.
Sequence Diagram
Click to expand
sequenceDiagram
participant LN as Leader Node
participant R as Rails
participant FN as Follower Node
Note over LN: Indexing Process
LN->>LN: Index zoekt shard
LN->>LN: Generate shard files<br/>(multiple .zoekt files)
Note over LN,R: Notify Rails
LN->>R: Indexing complete callback<br/>(leader index updated)
R->>R: Update leader index record
R->>R: Find corresponding follower indices<br/>(on follower nodes)
R->>R: Create sync indexing tasks<br/>for follower indices
Note over FN,R: Follower Node Polling (Heartbeat)
loop Periodic heartbeat
FN->>R: Poll for sync tasks<br/>(follower node heartbeat)
alt Sync task available
R-->>FN: Return sync task<br/>(sync follower index from leader index)
Note over FN,LN: Index Manifest Exchange
FN->>LN: Request index manifest<br/>(from leader node)
LN->>LN: List shard files and sizes
LN-->>FN: Return index manifest
Note over FN: Sync Process
FN->>FN: Compare local vs index manifest<br/>Identify changed/new files
loop For each changed file
FN->>LN: Request individual file
LN-->>FN: Stream file content
FN->>FN: Save to temporary location
end
Note over FN: Atomic Update
FN->>FN: Swap temporary files to final location
FN->>FN: Clean up obsolete files
FN->>FN: Reload indices
Note over FN,R: Task Completion
FN->>R: Indexing complete callback<br/>(follower index synced)
R->>R: Update follower index status
R->>R: Mark sync task complete
else No tasks
R-->>FN: Empty response (heartbeat acknowledged)
end
end
Note over R: Rails Tracking
R->>R: Track node heartbeats<br/>Monitor leader/follower designations
Risks and Mitigations
| Risk | Impact | Mitigation |
|---|---|---|
| Network bandwidth for file transfers | High network utilization during sync | Implement compression, rate limiting if needed |
| Eventual consistency | Stale search results on followers | Monitor sync lag metrics, alert on thresholds |
| Storage costs doubling | Increased infrastructure costs | Start with critical indices, expand gradually |
| Follower node failures | Reduced search capacity | Automatic fallback to leader nodes |
Success Metrics
- Sync lag between leader and follower indices
- CPU utilization separation (indexing vs search)
Open Questions
- What should the default heartbeat/polling interval for follower nodes be?
- Should we implement checksum verification for file transfers?
- How should we handle partial sync failures?
- Should we support multiple followers per leader in the future? This could improve search performance.
- If a leader node goes offline, what happens?
Edited by 🤖 GitLab Bot 🤖