Where should special diffs go?
- Problem Statement
- Alternative: Git External Driver
- Alternative: Async Worker
- Alternative: Shadow Repos
- Alternative: Special Diff Service
Problem Statement
When working on implementing better diffs for Jupyter Notebooks, a question that kept coming was: 'Where should special diffs happen'? Suppose that we want to do preprocessing on a file before diffing diff(transform(f1), transform(f2))
, where would the transform
be defined and executed?
Git provides the External Diff Driver interface, that allows one to develop custom differs, (example), but Gitaly isn't prepared for this use case. As an alternative, we can add this logic to the application, but as described by @kerrizor here this means:
- Fetch diffs from Gitaly
- Parse the file list, looking for non-binary
*.ipnyb
files - Fetch the raw data for the target and source SHA
- Convert to markdown
- Diff the 2 resulting markdown files
- Build a diff response the existing backend understands how to display
Another issue here is that only the Rails app can use the special diff. If we ever implement a mobile app, or use a cli, they will still see regular diffs.
I believe we all agree this logic needs to be added before the diff is generated, but not on Gitaly
This is issue is a spinoff of gitaly#3787 (closed), we recommend reading that discussion if further context is necessary.
Alternative: Git External Driver
Suggested initially by @chriscool here, Git provides the interface for external drivers that can be used to implement custom differs. These are simple to implement, but completely replace the diff command. Also, as @sluongng points out here, we need to be really careful with performance, and diff scripts should be very minimal with no network requests for example.
Here we show how to use specifically the ipynbdiff as a driver, and @stanhu gives some ideas on how we could try it out on Gitaly here
Advantages: simple, minimal changes necessary
Concerns: cannot be too complex, fully replace diff command, we need to manually implement the diff options
Alternative: Async Worker
Async Worker generates the transformation and diff that can be cached for later use, either for diff or performance tunning
Advantages:
- No changes necessary on Gitaly
- We are already working towards this direction
- Files generated can be used for different purposes, not diffing (performance tuning for example)
Concerns:
- Still discard the original diff from gitaly
- Still requires the raw files to be requested
- Still requires the diff algo to be reimplemented
Alternative: Shadow Repos
One idea suggested by @jcaigitlab and that I discussed a bit with @kerrizor is the one of shadow repos: For every repo we have, we create a shadow repo that we can use for adding transformed files. This is more of an evolution of the async worker rather than an alternative, since the transformations will likely happen async, and be ready before gitaly for it to return a diff:
graph TD
A(Raw Commit) -.->|Async Transform Commit| B(Shadow Repo)
A --> C(Original Repo)
B -.-> D(Gitaly)
C --> D(Gitaly)
D --> E(Application)
Advantages:
- This works for any file type we want to transform
- Can be used for performance tuning: For example, on the jupyter case, when opening the file on Gitlab file viewer, rendering the markdown version would be a lot faster than the rendering the raw jupyter file, specially if we extract embedded images
- No need to reimplement the
diff
, only the transformation. We could still usegit diff
and all it's options
Concerns*:
-
Mapping: We need to keep a map between commits on the original repo and the shadow one. This might not be a 1:1 relationship, since some commits on the original might not have transformations to be done. This could be solved by adding a file on the shadow repo that is updated with a hash of the commit on the original repo, forcing a 1:1 relationship.
-
Consistency: Suppose we create transform file A with a script at version .1 and generate commit A1 on the shadow repo. A new change on A happens and we generate a new transformation, but the script is now at version .2, generating A2. The diff between A1 and A2 is not consistent anymore. When a new version of the transformation function is created, should we regenerate the tree?
-
Storage: This will of course require more space, potentially doubling the storage requirements
Alternative: Special Diff Service
Another evolution into the worker solution. Here, we assume that the files don't work well if git at all, and do require their own algorithm. In this case, shadow repos don't bring a lot of benefit. pdf's, docx, &165, etc are examples of what would fit in here. The diagram is similar to the one in the previous alternative:
graph TD
A(Raw Commit) --> B(Original Repo)
B -.-> |Async Transform| C(Special Diff Service)
C --> D(Application)
B --> E(Gitaly)
E --> D(Application)
Note that the solutions are not mutually exclusive: We could have a setup with drivers, shadow repos AND a Special Diff Service.