Zero-downtime updated for a Gitaly cluster on secondary Geo site cause read-only repositories

Everyone can contribute. Help move this issue forward while earning points, leveling up and collecting rewards.

Close this issue

Summary

When Gitaly cluster is updated on a secondary Geo site using an automated approach to executing https://docs.gitlab.com/omnibus/update/README.html#gitaly-cluster it results in read-only repositories. The read-only repositories on the secondary Geo service occurs because sometimes there is a primary node switchover during the upgrade.

Steps to reproduce

As far as I can tell, we are following the process as described. We have a primary and a secondary GEO service (MAIN and EU-MAIN respectively). For each service we have 3 Praefect nodes (*-Praefect-A, *-Praefect-B and .Praefect-C) and 3 Gitaly nodes (-Gitaly-A, *-Gitaly-B and *-Gitaly-C), where * is MAIN or EU_MAIN respectively. All Praefect and Gitaly servers have the /etc/gitlab/skip-autoreconfigure file defined. On the Praefect-A server we have praefect['auto_migrate'] = true set and on the Praefect-B and Praefect-C servers we have praefect['auto_migrate'] = false set. On all 3 Gitaly nodes we have gitlab_rails['auto_migrate'] = false and gitlab_rails['rake_cache_clear'] = false set.

We update each of the Gitaly and Praefect servers one at a time in order. An update consists of running sudo yum install gitlab-ee-13.9.1-ee.0.el7.x86_64 (changing the version accordingly) followed by running sudo gitlab-ctl reconfigure. These steps are automated using Ansible, which enforces the serial mode to ensure only one server is updated at a time. Because we are using Ansible, the server order is not guaranteed except that Gitaly servers are updated before Praefect servers. So Gitaly-B might be updated before Gitaly-A, Praefect-C before Praefect-A, etc.

What is the current bug behavior?

Updates cause a failover, which results in read-only repositories.

What is the expected correct behavior?

Updates should not cause a failover and no downtime should be executed

Relevant logs and/or screenshots

Output of checks

Results of GitLab environment info

Expand for output related to GitLab environment info


(For installations with omnibus-gitlab package run and paste the output of:
`sudo gitlab-rake gitlab:env:info`)

(For installations from source run and paste the output of:
`sudo -u git -H bundle exec rake gitlab:env:info RAILS_ENV=production`)

Results of GitLab application Check

Expand for output related to the GitLab application check

(For installations with omnibus-gitlab package run and paste the output of:
sudo gitlab-rake gitlab:check SANITIZE=true)
(For installations from source run and paste the output of:
sudo -u git -H bundle exec rake gitlab:check RAILS_ENV=production SANITIZE=true)
(we will only investigate if the tests are passing)

Possible fixes

Current workaround is to pause geo-replication via 'sudo gitlab-rake geo:replication:pause' but there is a related issue because DB replication is not paused for external databases, see #324172

Edited Aug 28, 2025 by 🤖 GitLab Bot 🤖