Consider detecting binaryness of files in gitaly
Problem
In https://gitlab.com/gitlab-org/gitlab/-/issues/326316 we identified that one source of significant performance drag when rendering commit diffs is a function that probes into git blobs to detect whether they represent a binary file or something else. This is done by integrating with https://github.com/brianmario/charlock_holmes, a Ruby gem that uses a C-extension and which seeks into the byte string for up to 8kB to make this decision.
Since we do this for every blob we fetch, and since sometimes we need to fetch blobs twice in different revisions (e.g. for diff rendering), this may happen thousands of times when rendering commit diffs. For one test commit, we found that more than 50% of all CPU time rendering a diff of 3000 files was spent in this function (EncodingHelper::detect_libgit2_binary?
):
Proposal
We are already looking to speed up this logic by caching the results of this call in !60128 (merged). However, this will only speed up subsequent calls and does not reduce the actual work required initially.
I think we should consider moving this check down the stack, e.g. into gitaly, to make it more efficient, and return a boolean or even better, a MIME type along with the gitaly RPC response. I also suspect that binary files in VC systems are a much less likely occurrence than e.g. text files, so I wonder if there are opportunities to also perform this check less frequently or in a more targeted fashion.