Where should special diffs go?

Problem Statement
Alternative: Git External Driver
Alternative: Async Worker
Alternative: Shadow Repos
Alternative: Special Diff Service

Problem Statement

When working on implementing better diffs for Jupyter Notebooks, a question that kept coming was: 'Where should special diffs happen'? Suppose that we want to do preprocessing on a file before diffing diff(transform(f1), transform(f2)), where would the transform be defined and executed?

Git provides the External Diff Driver interface, that allows one to develop custom differs, (example), but Gitaly isn't prepared for this use case. As an alternative, we can add this logic to the application, but as described by @kerrizor here this means:

Fetch diffs from Gitaly
Parse the file list, looking for non-binary *.ipnyb files
Fetch the raw data for the target and source SHA
Convert to markdown
Diff the 2 resulting markdown files
Build a diff response the existing backend understands how to display

Another issue here is that only the Rails app can use the special diff. If we ever implement a mobile app, or use a cli, they will still see regular diffs.

I believe we all agree this logic needs to be added before the diff is generated, but not on Gitaly

This is issue is a spinoff of gitaly#3787 (closed), we recommend reading that discussion if further context is necessary.

Alternative: Git External Driver

Suggested initially by @chriscool here, Git provides the interface for external drivers that can be used to implement custom differs. These are simple to implement, but completely replace the diff command. Also, as @sluongng points out here, we need to be really careful with performance, and diff scripts should be very minimal with no network requests for example.

Here we show how to use specifically the ipynbdiff as a driver, and @stanhu gives some ideas on how we could try it out on Gitaly here

Advantages: simple, minimal changes necessary

Concerns: cannot be too complex, fully replace diff command, we need to manually implement the diff options

Alternative: Async Worker

Async Worker generates the transformation and diff that can be cached for later use, either for diff or performance tunning

Advantages:

No changes necessary on Gitaly
We are already working towards this direction
Files generated can be used for different purposes, not diffing (performance tuning for example)

Concerns:

Still discard the original diff from gitaly
Still requires the raw files to be requested
Still requires the diff algo to be reimplemented

Alternative: Shadow Repos

One idea suggested by @jcaigitlab and that I discussed a bit with @kerrizor is the one of shadow repos: For every repo we have, we create a shadow repo that we can use for adding transformed files. This is more of an evolution of the async worker rather than an alternative, since the transformations will likely happen async, and be ready before gitaly for it to return a diff:

graph TD
    A(Raw Commit) -.->|Async Transform Commit| B(Shadow Repo)
    A --> C(Original Repo)
    B -.-> D(Gitaly)
    C --> D(Gitaly)
    D --> E(Application)

Advantages:

This works for any file type we want to transform
Can be used for performance tuning: For example, on the jupyter case, when opening the file on Gitlab file viewer, rendering the markdown version would be a lot faster than the rendering the raw jupyter file, specially if we extract embedded images
No need to reimplement the diff, only the transformation. We could still use git diff and all it's options

Concerns*:

Mapping: We need to keep a map between commits on the original repo and the shadow one. This might not be a 1:1 relationship, since some commits on the original might not have transformations to be done. This could be solved by adding a file on the shadow repo that is updated with a hash of the commit on the original repo, forcing a 1:1 relationship.
Consistency: Suppose we create transform file A with a script at version .1 and generate commit A1 on the shadow repo. A new change on A happens and we generate a new transformation, but the script is now at version .2, generating A2. The diff between A1 and A2 is not consistent anymore. When a new version of the transformation function is created, should we regenerate the tree?
Storage: This will of course require more space, potentially doubling the storage requirements

Alternative: Special Diff Service

Another evolution into the worker solution. Here, we assume that the files don't work well if git at all, and do require their own algorithm. In this case, shadow repos don't bring a lot of benefit. pdf's, docx, &165, etc are examples of what would fit in here. The diagram is similar to the one in the previous alternative:

graph TD
    A(Raw Commit) --> B(Original Repo)
    B -.-> |Async Transform| C(Special Diff Service)
    C --> D(Application)
    B --> E(Gitaly)
    E --> D(Application)

Note that the solutions are not mutually exclusive: We could have a setup with drivers, shadow repos AND a Special Diff Service.

Edited Oct 13, 2021 by Eduardo Bonet