Zoekt Sharding and Replication

Problem

Without implementing replication in some way we have no way to do zero downtime deployments of our Zoekt infrastructure and we have no resilience against failure.

Proposal

This is described in https://docs.gitlab.com/ee/architecture/blueprints/code_search_with_zoekt/#sharding-and-replication-strategy .

There was some good discussion at !107891 (comment 1264814513) . In particular we might consider if we can have the Zoekt shards replicating between themselves at the disk level and they would only need to replicate the .zoekt index files rather than the bare repos. This lowers CPU usage on Zoekt nodes, load on Gitaly and disk usage on Zoekt nodes. This does, however, increase the difficulty of failing over from the replica to the primary since it won't have any bare repos.

This could be combined, however, with the idea of purging inactive bare repos because we could consider the bare repos as just a cache. If they are there then they save cloning the whole repo again. So failing over to a replica just means that the cache of bare repos is empty.

There is also a lot of good ideas in &8903 (comment 1196736579) that may be applicable.

Kubernetes changes

This will also likely require changes to the Helm chart since it didn't really logically support multiple replicas to begin with. This is because we only have 1 ingress and individual replicas need to be individually addressable. We'll need to resolve this somehow.

Also see gitlab-org/cloud-native/charts/gitlab-zoekt!1 (comment 1297788474) and gitlab-org/cloud-native/charts/gitlab-zoekt!1 (comment 1294116525)

Rolling/Blue-Green deployments

As part of this change (or extracted to a separate issue depending if we get it "for free") we should have rolling deployments such that there is always at least 1 readable replica for any search available at any time. Since we are pretty resilient about write downtime (ie. retries in Sidekiq) we should be OK if there is only 1 primary that gets restarted and fails writes for a minute or so.

Cleanup of past partial implementation

In the GitLab codebase we already half implemented support for multiple shards with the zoekt_shards table. Our implementation of sharding/replication will likely end up being outside of the rails application and this database model may not make sense anymore. At this point we should try and clean up the table in the DB somehow to not imply that we support multiple shards. Additionally we should consider if using the database to store this data is even appropriate anymore. Many other services are configured in config/gitlab.yml and this allows us to automatically wire things together in the GitLab chart and omnibus (for example basic auth in gitlab-org/charts/gitlab!3184 (merged) ). If we keep leaning into the direction of deploying and automically wiring GitLab -> Zoekt then the DB modelling will just get in the way and we may as well move it all to config/gitlab.yml.

Zoekt Sharding and Replication

Problem

Proposal

Kubernetes changes

Rolling/Blue-Green deployments

Cleanup of past partial implementation

More reading

Horizontally Scaling Zoekt in Sourcegraph

Gitaly Plan

Elasticsearch Consensus/Replication