Refactor blob/diff handling to prevent loading of blob data until absolutely necessary

libgit2 will load an entire blob's data into memory the moment you retrieve a blob. This makes pages with lots of blobs (e.g. a commit with lots of diffs) very slow, with easily over 50% being spent in just loading Git data alone. Part of the problem is that we load blobs into memory just to we can perform certain checks such as:

Is the blob an LFS pointer?
Is the blob a text or binary blob?
How large is the diff?

These checks will always return the same data for the same blob and as such should be cached in a certain way. This in turn means we can defer loading of blob data until we actually need it.

To support this we'll need to make sure that the Blob class is used everywhere and only loads the actual blob data when needed. Furthermore it should cache all the data needed for the various checks. Since Git objects are ephemeral I think using Redis is best. This way we don't have to worry about manually pruning data as the key can expire automatically after a certain time period. At the same time we can increment the TTL so frequently used keys don't expire as quickly.

For diffs we may also need to cache extra data to prevent loading data from disk. Data would include the old/new mode, the file name, etc.

The first step in this process would be to find out which checks are performed for a blob when displaying a diff, those checks in turn should be listed in this issue so we have a clear overview.

Edited Jun 13, 2025 by 🤖 GitLab Bot 🤖