Zoekt Sharding and Replication
Problem
Without implementing replication in some way we have no way to do zero downtime deployments of our Zoekt infrastructure and we have no resilience against failure.
Proposal
This is described in https://docs.gitlab.com/ee/architecture/blueprints/code_search_with_zoekt/#sharding-and-replication-strategy .
There was some good discussion at !107891 (comment 1264814513) . In particular we might consider if we can have the Zoekt shards replicating between themselves at the disk level and they would only need to replicate the .zoekt
index files rather than the bare repos. This lowers CPU usage on Zoekt nodes, load on Gitaly and disk usage on Zoekt nodes. This does, however, increase the difficulty of failing over from the replica to the primary since it won't have any bare repos.
This could be combined, however, with the idea of purging inactive bare repos because we could consider the bare repos as just a cache. If they are there then they save cloning the whole repo again. So failing over to a replica just means that the cache of bare repos is empty.
There is also a lot of good ideas in &8903 (comment 1196736579) that may be applicable.
Kubernetes changes
This will also likely require changes to the Helm chart since it didn't really logically support multiple replicas to begin with. This is because we only have 1 ingress and individual replicas need to be individually addressable. We'll need to resolve this somehow.
Also see gitlab-org/cloud-native/charts/gitlab-zoekt!1 (comment 1297788474) and gitlab-org/cloud-native/charts/gitlab-zoekt!1 (comment 1294116525)
Rolling/Blue-Green deployments
As part of this change (or extracted to a separate issue depending if we get it "for free") we should have rolling deployments such that there is always at least 1 readable replica for any search available at any time. Since we are pretty resilient about write downtime (ie. retries in Sidekiq) we should be OK if there is only 1 primary that gets restarted and fails writes for a minute or so.
Cleanup of past partial implementation
In the GitLab codebase we already half implemented support for multiple shards with the zoekt_shards
table. Our implementation of sharding/replication will likely end up being outside of the rails application and this database model may not make sense anymore. At this point we should try and clean up the table in the DB somehow to not imply that we support multiple shards. Additionally we should consider if using the database to store this data is even appropriate anymore. Many other services are configured in config/gitlab.yml
and this allows us to automatically wire things together in the GitLab chart and omnibus (for example basic auth in gitlab-org/charts/gitlab!3184 (merged) ). If we keep leaning into the direction of deploying and automically wiring GitLab -> Zoekt then the DB modelling will just get in the way and we may as well move it all to config/gitlab.yml
.
More reading
Horizontally Scaling Zoekt in Sourcegraph
- Sharding (non-HA)
Gitaly Plan
Gitaly is already planning on rebuilding their replication using RAFT. Based on my reading we might save ourselves a lot of pain if we follow a similar approach rather than trying to implement a distributed system from scratch. Reading the following in order may illuminate how the Gitaly plan works and inspire how that might fit our needs:
Now I think we will have very analogous needs to Gitaly. We will have leaders and replicas for our data. We will update repository data and replicate files similarly to Gitaly and we'll need ways to track the current state of replicas.
Elasticsearch Consensus/Replication
- https://www.elastic.co/elasticon/conf/2017/sf/consensus-and-replication-in-elasticsearch (slides https://speakerdeck.com/elastic/consensus-and-replication-in-elasticsearch )
- https://github.com/elastic/elasticsearch/issues/10708
- https://github.com/elastic/elasticsearch-formal-models
Elasticsearch uses:
- Quorum based algorithm for cluster state
- Primary/backup for data