Zero-downtime updated for a Gitaly cluster on secondary Geo site cause read-only repositories
Summary
When Gitaly cluster is updated on a secondary Geo site using an automated approach to executing https://docs.gitlab.com/omnibus/update/README.html#gitaly-cluster it results in read-only repositories. The read-only repositories on the secondary Geo service occurs because sometimes there is a primary node switchover during the upgrade.
Steps to reproduce
As far as I can tell, we are following the process as described. We have a primary and a secondary GEO service (MAIN and EU-MAIN respectively). For each service we have 3 Praefect nodes (*-Praefect-A, *-Praefect-B and .Praefect-C) and 3 Gitaly nodes (-Gitaly-A, *-Gitaly-B and *-Gitaly-C), where * is MAIN or EU_MAIN respectively. All Praefect and Gitaly servers have the /etc/gitlab/skip-autoreconfigure file defined. On the Praefect-A server we have praefect['auto_migrate'] = true set and on the Praefect-B and Praefect-C servers we have praefect['auto_migrate'] = false set. On all 3 Gitaly nodes we have gitlab_rails['auto_migrate'] = false and gitlab_rails['rake_cache_clear'] = false set.
We update each of the Gitaly and Praefect servers one at a time in order. An update consists of running sudo yum install gitlab-ee-13.9.1-ee.0.el7.x86_64 (changing the version accordingly) followed by running sudo gitlab-ctl reconfigure. These steps are automated using Ansible, which enforces the serial mode to ensure only one server is updated at a time. Because we are using Ansible, the server order is not guaranteed except that Gitaly servers are updated before Praefect servers. So Gitaly-B might be updated before Gitaly-A, Praefect-C before Praefect-A, etc.
What is the current bug behavior?
Updates cause a failover, which results in read-only repositories.
What is the expected correct behavior?
Updates should not cause a failover and no downtime should be executed
Relevant logs and/or screenshots
Output of checks
Results of GitLab environment info
Expand for output related to GitLab environment info
(For installations with omnibus-gitlab package run and paste the output of: `sudo gitlab-rake gitlab:env:info`) (For installations from source run and paste the output of: `sudo -u git -H bundle exec rake gitlab:env:info RAILS_ENV=production`)
Results of GitLab application Check
Expand for output related to the GitLab application check
(For installations with omnibus-gitlab package run and paste the output of:
sudo gitlab-rake gitlab:check SANITIZE=true
)(For installations from source run and paste the output of:
sudo -u git -H bundle exec rake gitlab:check RAILS_ENV=production SANITIZE=true
)(we will only investigate if the tests are passing)
Possible fixes
Current workaround is to pause geo-replication via 'sudo gitlab-rake geo:replication:pause' but there is a related issue because DB replication is not paused for external databases, see #324172