Zoekt: sharding strategy
Background
Currently, we can't index some large namespaces because the repository data can't fit on the Zoekt node.
Proposal
We can add a layer of Zoekt::Replica
to connect Zoekt::Index
with Zoekt::EnabledNamespace
. Each replica can have many indices. Each index will be pointing to the repositories of a large namespace. A Zoekt::Replica
will have an enum field state
with the values ready
and pending
. A replica will be considered as ready only when all it's indices are in ready
state.
flowchart TD
E[Zoekt::EnabledNamespace] --> R1["Zoekt::Replica(id: 1, state: :ready)"]
E --> R2["Zoekt::Replica(id: 2, state: :ready)"]
R1 --> I1("Zoekt::Index(id: 1)")
R1 --> I2("Zoekt::Index(id: 2)")
R1 --> I4("Zoekt::Index(id: 20) reallocation")
R2 --> I3("Zoekt::Index(id: 3)")
I1 --> PA("Zoekt::Repository A")
I2 --> PB("Zoekt::Repository B")
I2 --> PC("Zoekt::Repository C")
I3 --> PA2("Zoekt::Repository A")
I3 --> PB2("Zoekt::Repository B")
I3 --> PC2("Zoekt::Repository C")
I4 --> PB3("Zoekt::Repository B")
I4 --> PC3("Zoekt::Repository C")
Later, we're going to introduce replicas and start having different replica
indices for the same enabled namespace. Each index of a replica will
High level changes require:
- Update
Search::Zoekt::SchedulingService
to handle/assign replicated indices properly - Update search code. Every time we perform a group-level search, we'll need to query all replicas. For example,
zoekt_enabled_namespace.replicas.ready.first
(we can re-use some logic from !147077 (merged)) - Most likely, we'll need to update the indexing code as well
- TBD
Edited by Ravi Kumar