Zero-downtime upgrades in Gitaly
What this is about: https://docs.gitlab.com/ee/update/zero_downtime.html
## Current state (2023.04.13)
- Praefect restarts are bridged through gRPC-level _client retries_ if https://docs.gitlab.com/ee/administration/gitaly/praefect.html#service-discovery is configured.
- Gitaly restarts are not bridged by retries from Praefect. Instead they rely on starting a new gitaly process alongside the old, and using `tableflip` to take over its sockets.
- The latter means that taking down the Gitaly VM for upgrades (or for OS upgrades) causes downtime, even in Cluster.
## Planned state
- With Raft (https://gitlab.com/groups/gitlab-org/-/epics/8903, https://about.gitlab.com/direction/gitaly/#1-year-plan) there will be no Praefect, Gitalys will coordinate replication. Therefore, clients will retry the next Gitaly node available.
- Uninterrupted functionality will require more than one node being able to serve a given repository.
- Even so, some connections are running for a long time, for example if the client bandwidth is limited. Discussions about whether and how these can be interrupted are ongoing.
- We will investigate a new solution to "happy path" restarts. `tableflip` is costly to maintain, only covers part of the use cases, and is often mistaken for a full-on solution (setting the wrong expectations).
- It also doesn't work in Kubernetes (as `tableflip` assumes it can restart the process).
- The current direction is removing `tableflip` while maintaining no user-visible errors.
- We have no plans of backporting retries to Praefect at this time, as it would be prohibitively costly to do so.
- The main open question is how long we should wait to cut long-running connections. Discussion in https://gitlab.com/gitlab-org/gitaly/-/issues/4934
## Next steps
- Define policy wrt interrupting ongoing connections in https://gitlab.com/gitlab-org/gitaly/-/issues/4934
- Make Gitalys able to shut down cleanly https://gitlab.com/gitlab-org/gitaly/-/issues/4747
- Continue working towards Raft via https://gitlab.com/groups/gitlab-org/-/epics/8903 and in particular https://gitlab.com/groups/gitlab-org/-/epics/8911
epic