Zero-downtime upgrades in Gitaly
What this is about: https://docs.gitlab.com/ee/update/zero_downtime.html ## Current state (2023.04.13) - Praefect restarts are bridged through gRPC-level _client retries_ if https://docs.gitlab.com/ee/administration/gitaly/praefect.html#service-discovery is configured. - Gitaly restarts are not bridged by retries from Praefect. Instead they rely on starting a new gitaly process alongside the old, and using `tableflip` to take over its sockets. - The latter means that taking down the Gitaly VM for upgrades (or for OS upgrades) causes downtime, even in Cluster. ## Planned state - With Raft (https://gitlab.com/groups/gitlab-org/-/epics/8903, https://about.gitlab.com/direction/gitaly/#1-year-plan) there will be no Praefect, Gitalys will coordinate replication. Therefore, clients will retry the next Gitaly node available. - Uninterrupted functionality will require more than one node being able to serve a given repository. - Even so, some connections are running for a long time, for example if the client bandwidth is limited. Discussions about whether and how these can be interrupted are ongoing. - We will investigate a new solution to "happy path" restarts. `tableflip` is costly to maintain, only covers part of the use cases, and is often mistaken for a full-on solution (setting the wrong expectations). - It also doesn't work in Kubernetes (as `tableflip` assumes it can restart the process). - The current direction is removing `tableflip` while maintaining no user-visible errors. - We have no plans of backporting retries to Praefect at this time, as it would be prohibitively costly to do so. - The main open question is how long we should wait to cut long-running connections. Discussion in https://gitlab.com/gitlab-org/gitaly/-/issues/4934 ## Next steps - Define policy wrt interrupting ongoing connections in https://gitlab.com/gitlab-org/gitaly/-/issues/4934 - Make Gitalys able to shut down cleanly https://gitlab.com/gitlab-org/gitaly/-/issues/4747 - Continue working towards Raft via https://gitlab.com/groups/gitlab-org/-/epics/8903 and in particular https://gitlab.com/groups/gitlab-org/-/epics/8911
epic