recovery of gitaly cluster owing to chart upgrade storage changes
summary
Upgrading an existing Gitaly Cluster in Kubernetes to v4.8 of the helm charts (GitLab 13.8) renames the pods, persistent volumes, and persistent volume claims.
A customer raised a ticket for this (GitLab team members can find out more in Zendesk and SF) and there's also been an issued raised about it.
It is likely to cause all repositories to become read only and inaccessible.
The procedure to recover access to the volumes works, but further steps are required to get each repository working again via praefect:
What's the correct way to recover the cluster in this situation?
steps to reproduce
- Deploy Gitaly cluster using the v4.7 (13.7) Helm chart - postgres, two praefect pods, three Gitaly pods.
- At this point, only
default
is supported. - Praefect will be aware of three storages:
gitlab-gitaly-0
,gitlab-gitaly-1
,gitlab-gitaly-2
and there are pods with the same names. - Upgrade to v4.8 (13.8)
- This chart will deploy
default
, but it also supports additional shards. So, the pods and storage names are all modified so this scales. The new pods and storages are called:gitlab-gitaly-default-0, gitlab-gitaly-default-1, gitlab-gitaly-default-2
- New persistent volumes are created for these. The original PVs for
gitlab-gitaly-0
,gitlab-gitaly-1
,gitlab-gitaly-2
remain in the cluster. - The new PVs are empty, so what praefect sees is that all the cluster members are 100% out of sync. All repos are made read only.
- The persistent volume claims are manually 'fixed' to reattach the original persistent volumes.
- Assume a complete restart of praefect and Gitaly.
The state will remain as shown in the screenshot above, with every repository reporting:
$ praefect -config /etc/gitaly/config.toml dataloss
Virtual storage: default
Outdated repositories
@hashed/7a/61/7a61b53701befdae0eeeffaecc73f14e20b537bb0f8b91ad7c2936dc63562b25.git (read-only):
Primary: gitlab-gitaly-default-X
In-Sync Storages:
gitlab-gitaly-0
gitlab-gitaly-1
gitlab-gitaly-2
Outdated Storages:
gitlab-gitaly-default-0 is behind by Y changes or less, assigned host
gitlab-gitaly-default-1 is behind by Y changes or less, assigned host
gitlab-gitaly-default-2 is behind by Y changes or less, assigned host
workaround
All primary repositories, wikis, snippets, design repos etc etc will be read-only.
They can be fixed one by one this way:
praefect -config /etc/gitaly/config.toml accept-dataloss -virtual-storage default -authoritative-storage gitlab-gitaly-default-0 -repository aabb....git
Is accept-dataloss
ok in this situation? Update: yes
Is there a better way? Update: a direct change in the database would also work
The deployed configuration for Praefect and Gitaly has no reference to the old storages gitlab-gitaly-0
, gitlab-gitaly-1
, gitlab-gitaly-2
. So, this must be coming from the database.
If there's no way with existing userspace tools, is there a supportable way to tackle this in the database? Update: see this comment
Also, a closely related issue is autoscaling - this situation is a bit like scaling the cluster up and down by three nodes each. Could manual reconciliation handle this?