Skip to content

Identify Full Diff Files

Thomas Randolph requested to merge feature/file-reviews/diff-file-ids into master

What does this MR do?

For #17531 (closed).

We need to be able to store "reviews" at a frozen point in time for a given file in a given MR.

See the designs on the linked issue, plus this initial review, plus this UX thread.

A "point in time" in git terms is essentially at a commit, which is why we're hashing both the file_identifier_hash and the content_sha to get the ID for each diff file.

New Behavior

Full diff files (that is: diff files loaded for display via the diffs endpoint diffs_batch.json) are assigned a unique identifier.

Branching Logic

Our central diff file processor (prepareDiffData) is used to prepare both the diff files and the diff metadata. Unfortunately, the "files" present in the diff metadata are not really complete "files" and so can't actually be uniquely identified.

To avoid errors or assigning non-unique IDs, if prepareDiffData is called with metadata, it skips assigning IDs.
Notably, the files processed when the metadata is loaded are never set to state.
Files processed during a normal diff load (e.g., not metadata), are set to state, so we can conveniently assume that diff files in state have unique identifiers.

This is just a flag passed by the caller action, so other potential branches can use this flag in the future.

Known Issues

Because we have to bail out on identifying files when they're metadata files, we can't make the blanket statement that all diff files have these IDs.
Similarly, because we can't guarantee every file has an ID assigned (e.g., if they don't have a content_sha), we also need to double-check that it exists before depending on it.

On another similar note, because we don't currently have a way of identifying a file based on the last time it changed, these IDs change with every commit and become slightly less useful.

Further detail

If we had a way to identify a file based on which commits it changes in, the ID would be consistent until the file was modified which is probably more useful. Example:

  1. File A is modified in commit 1
  2. File A is assigned ID 123abc
  3. The branch has more commits added, commits 2, 3, 4, 5, but none touch File A
  4. In each of the MR versions (whether they're pushed individually or all at once), File A continues to have ID 123abc, because none of the later commits modified it.
  5. Commit 6 - which modifies File A - is pushed to the branch
  6. File A is assigned a new ID: 234def

In this example, an unchanged file retains a consistent ID until it's modified. For most use-cases, this is ideal.
For situations where the ID must be truly unique, the file ID can be combined with the head_sha to create something unique in every commit in the MR.


None of these known issues break this feature, but they will be something that we have to address in the future.

Most notably:

we don't currently have a way of identifying a file based on the last time it changed

If we implement a more robust identifying algorithm, we may invalidate many of the old-style IDs and - as one example - inadvertently clear any in-progress reviews.
Alternatively, we could implement an "upgrade migration" which checks the old style and upgrades to the new style dynamically, but this is a lot of extra work.

Review Recommendations and Overview

I recommend taking a look at the commits from oldest to newest, but not reviewing there.
The work is broken up into 4 commits:

  • The oldest commit implements the basics of the new behavior: it creates an ID for diff files.
  • The second commit updates the preparation code to skip that ID in the case of metadata, and - in the bulk of the changes for this MR - updates the call signature to use an object instead of ordered parameters.
  • The third commit updates the test suite for this utility file to:
    • Test the newly added ID and - critically - make sure it's not added for metadata files
    • Use a getter for the local test diff files since the previous version maintained mutated state between test runs.
  • The most recent commit adds the additional check to exclude the ID if the file doesn't have a content_sha
    • The additional test for this feature is also included in this commit

Timeline

Merge Request Description
!49173 (merged)
!49174 (merged)
!49190 (merged)
!49208 (merged)
Various cleanup merge requests in preparation for this feature.
We're here 👉🏻 Identify diff files using the content_sha
Switch identification to use blob.id
Add tooling to add/remove reviews from LocalStorage to accommodate the file review feature
Add UI to review files

Screenshots (strongly suggested)

N/A, this is all developer-facing work

Does this MR meet the acceptance criteria?

Conformity

Availability and Testing

Security

If this MR contains changes to processing or storing of credentials or tokens, authorization and authentication methods and other items described in the security review guidelines:

  • [-] Label as security and @ mention @gitlab-com/gl-security/appsec
  • [-] The MR includes necessary changes to maintain consistency between UI, API, email, or other methods
  • [-] Security reports checked/validated by a reviewer from the AppSec team
Edited by Thomas Randolph

Merge request reports