15.9 onwards: Praefect metadata loss and inaccessible repositories if customer misconfigures Gitaly
Summary
GitLab / Gitaly 15.9 introduce two changes
- Metadata deletion is enabled by default
- Significant changes to how Gitaly (and Praefect) are configured in Omnibus
-
if note, data paths on Gitaly requires the addition of a directory while also translating the configuration
# git_data_dirs[<name>]['path']. Use the value from git_data_dirs[<name>]['path'] and append '/repositories' to it. # # For example, if the path in 'git_data_dirs' was '/var/opt/gitlab/git-data', use # '/var/opt/gitlab/git-data/repositories'. The '/repositories' extension was automatically # appended to the path configured in `git_data_dirs`. path: ...,
If a mistake is made in the configuration change, for example: omitting repositories
, then when metadata verification runs, Gitaly will report the repositories as missing, resulting in the meta data being deleted.
It is unlikely .. but possible .. for metadata verification to run when Praefect initialises. Meaning zero opportunity to correct the mistake. I triggered this in testing by setting the verification_interval
low.
In reality, it's then down to luck how quickly the verifier will run down the clock from the last time it ran. Customers might have days to notice the mistake, or just minutes.
If the affected repositories are using @cluster
paths, then data recovery then becomes exceedingly difficult:
-
Praefect's track-repository/track-repositories ... (#5402 - closed) prevents automatic recovery of the data via
praefect track-repositories
- Cluster paths do not map to what's in the Rails database, so this provides no way to remap from what's on disk
This is likely to co-incide with the 16.0 major upgrade since this was the cut-off for implementing the configuration changes.
Workaround
Disable verification.
Customers would need to know to set this in advance on their Praefect nodes:
praefect['configuration'] = {
background_verification: {
verification_interval: '0',
},
}
This would need to be brought very clearly to customers' attention.
Steps to reproduce
-
15.9 - 15.11: Reconfigure Praefect and Gitaly to use the new configuration structures
-
Test
-
Back up your Praefect database. If using
pg_dump
, include-c
(docs ref) so the restore drops and recreates the database verbatim. -
Test the backup works (you will need the backup, and if you don't test it, you don't have one)
-
Reconfigure Gitaly storage paths, making a mistake in
gitaly['configuration'] = { storage: [ { path : } ] }
such as omittingrepositories
-
Reduce the
verification_interval
and apply withgitlab-ctl reconfigure
(restarting praefect)- optional step; you can just wait up to 168 hours for the next scheduled run
praefect['configuration'] = { background_verification: { verification_interval: '2h', }, }
-
Unless verification happened to trigger within the 2 hours prior to performing this test, the verifier will trigger immediately and metadata gets deleted.
What is the current bug behavior?
Changes in GitLab 15.9 onwards, mandated from 16.0, combine with human error to result in almost unrecoverable data loss.
- The repos are there, but reuniting them with the RPCs from Rails is difficult.
- Recovery of
@cluster
paths cannot take advantage of the automation that is available, and these would be the hardest to fix manually.
What is the expected correct behavior?
These features and changes need to account better for the possibility of human error.
gitlab-backup
does not back up the Praefect metadata, and we do not make any mention of backing up the database in the documentation.
In light of this, I think this should be the default:
praefect['configuration'] = {
background_verification: {
verification_interval: '0',
},
}
Relevant logs and/or screenshots
see #5529 (comment 1519680603)