Zero downtime upgrades broke between GitLab 16.8 and 16.9
Problem Description
We advertise support for zero downtime upgrades through a couple of channels, customers will leverage this for large installations, and the Dedicated product leverages to minimize downtime as much as possible during maintenance windows. This is managed using the GitLab Environment Toolkit. While Gitaly restarts are super quick, we may not restart Gitaly immediately after a new package is installed. It's required to prevent the automatic reconfigure
to run to support Zero Downtime Upgrades. Due to this, when a package is installed, binaries that the current running version of Gitaly are relying on may disappear. This leads to Gitaly being unable to operate, leading to customer facing HTTP500's. This appears to be a problem with the way the Gitaly Wrapper links binaries together and may be specific to when the version of git
is updated. Please reference the investigation noted below which covers a recreation of the scenario after an upgrade to GitLab.
Original Issue: gitlab-com/gl-infra/delivery#20115 (closed)
Investigation that lead to Gitaly as Root Cause: https://gitlab.com/gitlab-com/gl-infra/gitlab-dedicated/team/-/issues/4450#note_1839796911
Documentation specific to Gitaly with Zero Downtime Upgrades: https://docs.gitlab.com/ee/update/zero_downtime.html#gitaly
This appears to be similar to a Production Incident back in version 14 of GitLab: gitlab-com/gl-infra/production#4810 (closed)
Related Epic: &6155