Move GitLab data integrity checks into default installation
Release notes
Problem to solve
Long-running installations of GitLab accumulate inconsistencies such as missing or orphaned files.
Orphaned files
Orphaned files are a result of a missing rails DB record for a file on storage. The file consumes storage unnecessarily since the missing DB record means it can never be accessed via GitLab.
Missing files
Missing files are instances where a file has been deleted from storage but the rails DB record was left behind. This can happen due to a bug in the implementation of a feature, the associated storage failing or being removed, or the files being deleted as they are no longer needed without removing the associated DB record. This can lead to a broken user experience where the files cannot be accessed via the GitLab interface
Impact on backups
The above types of issues can lead to backups failing, consuming more storage than needed and a general lack of trust in backed-up data due to inconsistencies where it may not be clear if the backup failed to backup the data or this was the state of the data at the time the backup was taken.
Compromises the Geo first-time experience
The first time Geo experience is compromised with these issues being flagged soon after setting up Geo. Geo runs a consistency check as part of its verification logic. By this time, it may be too late to recover any lost data and it prolongs the time to have a reliable and stable Geo deployment.
Delays infrastructure migrations
Some customers use Geo to migrate to new infrastructure. Typically this is performed on a tight timeline with limited headroom for troubleshooting issues since the customer is unaware of issues related to orphaned and missing files. When Geo identifies these issues, it takes days and sometimes weeks to resolve thus resulting in the customer missing their migration deadline. Until the root cause of the errors are identified, the customer runs the risk of losing data during the migration.
In general, the earlier these problems are caught the higher the chance the systems administrator is able to recover the data. There will be fewer errors each time and less overwhelming to tackle. It will also help detect a subset of malicious or unintentional deletions.
Intended users
User experience goal
- The integrity checks should run periodically and automatically in the background.
- The systems administrator will be able to view the results via a dashboard in the admin area of the UI.
- The summary of the results will also be available via the console (similar to Geo)
Proposal
Geo can already verify the integrity of GitLab data through its verification logic on the primary Geo site. It typically flags missing and orphaned files and other inconsistencies in the data as a verification failure.
Moving this functionality to the default non-Geo installation.
-
Have it as a configurable setting, on by default.
-
It will have a UI similar to that on the Geo dashboard for the primary site with the additional ability to drill down into each component (similar to the Geo replica view).
-
It will be possible to filter objects according to their verification state; verified, pending, failed.
-
It will show the reason for any failures guiding the systems administrator to resolving the problem.
-
It will show the last time an object was verified
-
The verification interval is configurable (similar to Geo)
-
Since verification can consume resources, the concurrency of verification will be configurable (similar to Geo)
Permissions and Security
The dashboard related to this feature will be in the Admin Area
and accessible to anyone with GitLab administrator permissions.
Documentation
Document feature as a new page under menu Administer --> Maintain your installation.
Availability & Testing
Available Tier
- Premium
- Ultimate
Feature Usage Metrics
TBD
What does success look like, and how can we measure that?
We will see a lower occurrence of support cases related to orphaned and missing files.
Fewer geo-assisted migration escalations.
What is the type of buyer?
Is this a cross-stage feature?
TBD