Define what is meant by Zero downtime upgrades
As part of our upgrading documentation we have the option of upgrading with Zero downtime, going by the name alone this gives the strong impression that a user should be able to use GitLab without issue during an upgrade. One example would be a user cloning a large repository over several minutes which suddenly fails, they retry and it now passes. Is this just an acceptable blip or should a user see no issues throughout the process?
We also call out Redis as having minimal downtime, could we define what is meant by minimal and how it could impact a user. Similar for other node types it would help to define what is expected whilst updating. If cloning a large repository and the primary Gitaly cluster node is updated, is the process expected to finish without issue?
When using Puma a hot reload is no longer possible, updating each node will cause downtime. We call out in the docs that each rails node should be removed from the load balancer before updating and added back in before moving onto the next. Whilst looking at automating this, as our customers might, the best way to achieve this was to have the load balancer automatically remove the node on a failed health check. With this approach there will always be a small window between checks where someone could be sent to node currently being updated, however a single retry will send them to a different node.
It would be beneficial to get a better understanding of what is expected during a zero downtime to help define what is a bug and what is acceptable downtime.