recovery of gitaly cluster owing to chart upgrade storage changes

summary

Upgrading an existing Gitaly Cluster in Kubernetes to v4.8 of the helm charts (GitLab 13.8) renames the pods, persistent volumes, and persistent volume claims.

A customer raised a ticket for this (GitLab team members can find out more in Zendesk and SF) and there's also been an issued raised about it.

It is likely to cause all repositories to become read only and inaccessible.

The procedure to recover access to the volumes works, but further steps are required to get each repository working again via praefect:

What's the correct way to recover the cluster in this situation?

steps to reproduce

Deploy Gitaly cluster using the v4.7 (13.7) Helm chart - postgres, two praefect pods, three Gitaly pods.
At this point, only default is supported.
Praefect will be aware of three storages: gitlab-gitaly-0, gitlab-gitaly-1, gitlab-gitaly-2 and there are pods with the same names.
Upgrade to v4.8 (13.8)
This chart will deploy default, but it also supports additional shards. So, the pods and storage names are all modified so this scales. The new pods and storages are called: gitlab-gitaly-default-0, gitlab-gitaly-default-1, gitlab-gitaly-default-2
New persistent volumes are created for these. The original PVs for gitlab-gitaly-0, gitlab-gitaly-1, gitlab-gitaly-2 remain in the cluster.
The new PVs are empty, so what praefect sees is that all the cluster members are 100% out of sync. All repos are made read only.
The persistent volume claims are manually 'fixed' to reattach the original persistent volumes.
Assume a complete restart of praefect and Gitaly.

The state will remain as shown in the screenshot above, with every repository reporting:

$ praefect -config /etc/gitaly/config.toml dataloss
Virtual storage: default
  Outdated repositories
    @hashed/7a/61/7a61b53701befdae0eeeffaecc73f14e20b537bb0f8b91ad7c2936dc63562b25.git (read-only):
      Primary: gitlab-gitaly-default-X
      In-Sync Storages:
        gitlab-gitaly-0
        gitlab-gitaly-1
        gitlab-gitaly-2
      Outdated Storages:
        gitlab-gitaly-default-0 is behind by Y changes or less, assigned host
        gitlab-gitaly-default-1 is behind by Y changes or less, assigned host
        gitlab-gitaly-default-2 is behind by Y changes or less, assigned host

workaround

All primary repositories, wikis, snippets, design repos etc etc will be read-only.

They can be fixed one by one this way:

praefect -config /etc/gitaly/config.toml accept-dataloss -virtual-storage default -authoritative-storage gitlab-gitaly-default-0 -repository aabb....git

Is accept-dataloss ok in this situation? Update: yes

Is there a better way? Update: a direct change in the database would also work

The deployed configuration for Praefect and Gitaly has no reference to the old storages gitlab-gitaly-0, gitlab-gitaly-1, gitlab-gitaly-2. So, this must be coming from the database.

If there's no way with existing userspace tools, is there a supportable way to tackle this in the database? Update: see this comment

Also, a closely related issue is autoscaling - this situation is a bit like scaling the cluster up and down by three nodes each. Could manual reconciliation handle this?

references

13.8 release notes

Edited Mar 01, 2021 by Ben Prescott_

To upload designs, you'll need to enable LFS and have an admin enable hashed storage. More information