Implement Ability to Regenerate All MR Diffs
<!--IssueSummary start-->
<details>
<summary>
Everyone can contribute. [Help move this issue forward](https://handbook.gitlab.com/handbook/marketing/developer-relations/contributor-success/community-contributors-workflows/#contributor-links) while earning points, leveling up and collecting rewards.
</summary>
- [Close this issue](https://contributors.gitlab.com/manage-issue?action=close&projectId=278964&issueIid=362485)
</details>
<!--IssueSummary end-->
<!-- The first section "Release notes" is required if you want to have your release post blog MR auto generated. Currently in BETA, details on the **release post item generator** can be found in the handbook: https://about.gitlab.com/handbook/marketing/blog/release-posts/#release-post-item-generator and this video: https://www.youtube.com/watch?v=rfn9ebgTwKg. The next four sections: "Problem to solve", "Intended users", "User experience goal", and "Proposal", are strongly recommended in your first draft, while the rest of the sections can be filled out during the problem validation or breakdown phase. However, keep in mind that providing complete and relevant information early helps our product team validate the problem and start working on a solution. -->
### Release notes
<!-- What is the problem and solution you're proposing? This content sets the overall vision for the feature and serves as the release notes that will populate in various places, including the [release post blog](https://about.gitlab.com/releases/categories/releases/) and [Gitlab project releases](https://gitlab.com/gitlab-org/gitlab/-/releases). " -->
### Problem to solve
As an Administrator of a GitLab instance, I want to be able to regenerate MR diff data, so that I can omit this information from my backup and rest easily. I would also like to be able to regenerate this data if I notice a problem operationally - for example, if the storage storing the external_diffs data has been cleared out erroneously.
In the GitLab backup solution, it is possible to specify `--skip external_diffs` to [skip](https://docs.gitlab.com/charts/architecture/backup-restore.html) backup of MR Diffs. This is particularly valuable for self-hosted GitLab solutions when MR Diffs are [configured to use Object Storage](https://docs.gitlab.com/ee/administration/merge_request_diffs.html) (i.e. S3). By skipping the backup of external diffs, it makes the backup more efficient by negating the need to fetch and tar every diff from object storage, just to (probably) upload the entire backup to object storage again at the end of the process. This has, in our testing, reduced backup times from around an hour to a few minutes.
This does mean that external_diff backup restores need to be treated slightly differently, as it won't be backed up or restored as part of the overall backup process. We've guaranteed the integrity of our external diff data by enabling S3 Versioning on the bucket containing it, and also intend to configure AWS Backup rules for this bucket.
However, in the strictest sense, it should not be needed to back this data up at all. Merge request diff information should be _regeneratable_, as it does not contain any new information than is already contained within the application.
It would be nice to have a rake task that could be run to say: "regenerate all of the merge request diff data, as we think something's wrong with it". This should be possible, as the repository data, merge request data and everything else should already be present.
It might be possible to add this to the current restore process - simply regenerate these diffs as part of a GitLab backup restore operation, rather than them needing to be a distinct part of the backup restore process (i.e. assume that this should be regenerated rather than restored in all restore instances). This would save customers a lot of disk space in their backups in large installations.
<!-- What problem do we solve? Try to define the who/what/why of the opportunity as a user story. For example, "As a (who), I want (what), so I can (why/value)." -->
### Intended users
Administrators, sysadmins, devops / platform / SRE engineers and developers responsibility for the operation of self-hosted GitLab instances.
<!-- Who will use this feature? If known, include any of the following: types of users (e.g. Developer), personas, or specific company roles (e.g. Release Manager). It's okay to write "Unknown" and fill this field in later.
Personas are described at https://about.gitlab.com/handbook/marketing/product-marketing/roles-personas/
* [Cameron (Compliance Manager)](https://about.gitlab.com/handbook/marketing/product-marketing/roles-personas/#cameron-compliance-manager)
* [Parker (Product Manager)](https://about.gitlab.com/handbook/marketing/product-marketing/roles-personas/#parker-product-manager)
* [Delaney (Development Team Lead)](https://about.gitlab.com/handbook/marketing/product-marketing/roles-personas/#delaney-development-team-lead)
* [Presley (Product Designer)](https://about.gitlab.com/handbook/marketing/product-marketing/roles-personas/#presley-product-designer)
* [Sasha (Software Developer)](https://about.gitlab.com/handbook/marketing/product-marketing/roles-personas/#sasha-software-developer)
* [Priyanka (Platform Engineer)](https://about.gitlab.com/handbook/marketing/product-marketing/roles-personas/#priyanka-platform-engineer)
* [Sidney (Systems Administrator)](https://about.gitlab.com/handbook/marketing/product-marketing/roles-personas/#sidney-systems-administrator)
* [Sam (Security Analyst)](https://about.gitlab.com/handbook/marketing/product-marketing/roles-personas/#sam-security-analyst)
* [Rachel (Release Manager)](https://about.gitlab.com/handbook/marketing/product-marketing/roles-personas/#rachel-release-manager)
* [Alex (Security Operations Engineer)](https://about.gitlab.com/handbook/marketing/product-marketing/roles-personas/#alex-security-operations-engineer)
* [Simone (Software Engineer in Test)](https://about.gitlab.com/handbook/marketing/product-marketing/roles-personas/#simone-software-engineer-in-test)
* [Allison (Application Ops)](https://about.gitlab.com/handbook/marketing/product-marketing/roles-personas/#allison-application-ops)
* [Ingrid (Infrastructure Operator)](https://about.gitlab.com/handbook/product/personas/#ingrid-infrastructure-operator)
* [Dakota (Application Development Director)](https://about.gitlab.com/handbook/product/personas/#dakota-application-development-director)
* [Dana (Data Analyst)](https://about.gitlab.com/handbook/marketing/product-marketing/roles-personas/#dana-data-analyst)
* [Eddie (Content Editor)](https://about.gitlab.com/handbook/marketing/product-marketing/roles-personas/#eddie-content-editor)
-->
### User experience goal
As an administrator, I should be able to run a rake task that kicks off a regeneration of all merge request diff information, recreating it in external object storage if necessary. If I have a broken MR (i.e. with no external files available), it should be fixed after running this.
<!-- What is the single user experience workflow this problem addresses?
For example, "The user should be able to use the UI/API/.gitlab-ci.yml with GitLab to <perform a specific task>"
https://about.gitlab.com/handbook/engineering/ux/ux-research-training/user-story-mapping/ -->
### Proposal
A rake task that can be executed by the administrator would be the ideal solution. I should be able to run a rake task that kicks off a regeneration of all merge request diff information, recreating it in external object storage if necessary. If I have a broken MR (i.e. with no external files available), it should be fixed after running this.
In this way, the rake task could be considered a part of "disaster recovery" should something happen to the storage of diffs.
<!-- How are we going to solve the problem? Try to include the user journey! https://about.gitlab.com/handbook/journeys/#user-journey -->
### Further details
It is possible to do this on a "per-repository" basis already - see the note on this support issue: https://gitlab.com/gitlab-org/gitlab/-/issues/214356#note_737579031
This feature proposal involves potentially using the above solution but making it more applicable to an entire instance - i.e. a task to go through each MR and ensure valid diffs are present and exist, and recreate them if they're not.
Also see GitLab Support #289327 where I (under my organization's email) confirm that regenerating this data is possible.
<!-- Include use cases, benefits, goals, or any other details that will help us understand the problem better. -->
### Permissions and Security
Administrators with permission to take and restore backups are the primary audience, so those roles which apply to those tasks are suitable.
<!-- What permissions are required to perform the described actions? Are they consistent with the existing permissions as documented for users, groups, and projects as appropriate? Is the proposed behavior consistent between the UI, API, and other access methods (e.g. email replies)?
Consider adding checkboxes and expectations of users with certain levels of membership https://docs.gitlab.com/ee/user/permissions.html
* [ ] Add expected impact to members with no access (0)
* [ ] Add expected impact to Guest (10) members
* [ ] Add expected impact to Reporter (20) members
* [ ] Add expected impact to Developer (30) members
* [ ] Add expected impact to Maintainer (40) members
* [ ] Add expected impact to Owner (50) members
Please consider performing a threat model for the code changes that are introduced as part of this feature. To get started, refer to our Threat Modeling handbook page https://about.gitlab.com/handbook/security/threat_modeling/#threat-modeling.
Don't hesitate to reach out to the Application Security Team (`@gitlab-com/gl-security/appsec`) to discuss any security concerns.
-->
### Documentation
Documentation changes to [Backup and restore](https://docs.gitlab.com/charts/architecture/backup-restore.html) architectural documentation to describe this new feature, along with any other relevant backup/restore pages.
<!-- See the Feature Change Documentation Workflow https://docs.gitlab.com/ee/development/documentation/workflow.html#for-a-product-change
* Add all known Documentation Requirements in this section. See https://docs.gitlab.com/ee/development/documentation/workflow.html
* If this feature requires changing permissions, update the permissions document. See https://docs.gitlab.com/ee/user/permissions.html -->
### Availability & Testing
None; this feature request should make backups smaller and more reliable as it would be possible to regenerate the latest state of this data rather than relying on a "snapshot" from a previous backup which could be old.
<!-- This section needs to be retained and filled in during the workflow planning breakdown phase of this feature proposal, if not earlier.
What risks does this change pose to our availability? How might it affect the quality of the product? What additional test coverage or changes to tests will be needed? Will it require cross-browser testing?
Please list the test areas (unit, integration and end-to-end) that needs to be added or updated to ensure that this feature will work as intended. Please use the list below as guidance.
* Unit test changes
* Integration test changes
* End-to-end test change
See the test engineering planning process and reach out to your counterpart Software Engineer in Test for assistance: https://about.gitlab.com/handbook/engineering/quality/test-engineering/#test-planning -->
### Available Tier
I think this should be available in all tiers that can use the backup functionality, but happy to defer to you.
<!-- This section should be used for setting the appropriate tier that this feature will belong to. Pricing can be found here: https://about.gitlab.com/pricing/
* Free
* Premium/Silver
* Ultimate/Gold
-->
### Feature Usage Metrics
Happy to leave this to GitLab to determine if appropriate.
<!-- How are you going to track usage of this feature? Think about user behavior and their interaction with the product. What indicates someone is getting value from it?
Create tracking issue using the Snowplow event tracking template. See https://gitlab.com/gitlab-org/gitlab/-/blob/master/.gitlab/issue_templates/Snowplow%20event%20tracking.md
-->
### What does success look like, and how can we measure that?
Customers can configure their backups to ignore external_diff data. Reporting on the size of this data and that it has been "saved" from backups might be a valid success metric. This in turn reduces cost to customers through object storage bills for backups (i.e. we store backups in S3) and space in on-premises backup solutions (SAN/NAS space, etc)
<!--
Define both the success metrics and acceptance criteria. Note that success metrics indicate the desired business outcomes, while acceptance criteria indicate when the solution is working correctly. If there is no way to measure success, link to an issue that will implement a way to measure this.
Create tracking issue using the Snowplow event tracking template. See https://gitlab.com/gitlab-org/gitlab/-/blob/master/.gitlab/issue_templates/Snowplow%20event%20tracking.md
-->
### What is the type of buyer?
Those who are responsible for maintenance, backup, restore, availability and operation of GitLab. They will be reassured knowing that features like this are present in backup solutions.
<!-- What is the buyer persona for this feature? See https://about.gitlab.com/handbook/marketing/product-marketing/roles-personas/buyer-persona/
In which enterprise tier should this feature go? See https://about.gitlab.com/handbook/product/pricing/#three-tiers -->
### Is this a cross-stage feature?
Unknown - happy to take GitLab's steer here.
<!-- Communicate if this change will affect multiple Stage Groups or product areas. We recommend always start with the assumption that a feature request will have an impact into another Group. Loop in the most relevant PM and Product Designer from that Group to provide strategic support to help align the Group's broader plan and vision, as well as to avoid UX and technical debt. https://about.gitlab.com/handbook/product/#cross-stage-features -->
### What is the competitive advantage or differentiation for this feature?
Straightforward disaster recovery. Faster, more consistent backups?
### Links / references
See GitLab support case #289327 for the original enquiry
<!-- Label reminders - you should have one of each of the following labels.
Use the following resources to find the appropriate labels:
- https://gitlab.com/gitlab-org/gitlab/-/labels
- https://about.gitlab.com/handbook/product/categories/features/
-->
issue