Zoekt: sharding strategy

Background

Currently, we can't index some large namespaces because the repository data can't fit on the Zoekt node.

Proposal

We can add a layer of Zoekt::Replica to connect Zoekt::Index with Zoekt::EnabledNamespace. Each replica can have many indices. Each index will be pointing to the repositories of a large namespace. A Zoekt::Replica will have an enum field state with the values ready and pending. A replica will be considered as ready only when all it's indices are in ready state.

flowchart TD
    E[Zoekt::EnabledNamespace] --> R1["Zoekt::Replica(id: 1, state: :ready)"]
    E --> R2["Zoekt::Replica(id: 2, state: :ready)"]
    R1 --> I1("Zoekt::Index(id: 1)")
    R1 --> I2("Zoekt::Index(id: 2)")
    R1 --> I4("Zoekt::Index(id: 20) reallocation")
    R2 --> I3("Zoekt::Index(id: 3)")

    I1 --> PA("Zoekt::Repository A")
    I2 --> PB("Zoekt::Repository B")
    I2 --> PC("Zoekt::Repository C")

    I3 --> PA2("Zoekt::Repository A")
    I3 --> PB2("Zoekt::Repository B")
    I3 --> PC2("Zoekt::Repository C")

    I4 --> PB3("Zoekt::Repository B")
    I4 --> PC3("Zoekt::Repository C")

Later, we're going to introduce replicas and start having different replica indices for the same enabled namespace. Each index of a replica will

High level changes require:

Update Search::Zoekt::SchedulingService to handle/assign replicated indices properly
Update search code. Every time we perform a group-level search, we'll need to query all replicas. For example, zoekt_enabled_namespace.replicas.ready.first (we can re-use some logic from !147077 (merged))
Most likely, we'll need to update the indexing code as well
TBD

Edited Jun 04, 2024 by Ravi Kumar