Resolve Incident 19090 - allow for special characters in commit diffs
What does this MR do and why?
Fixes Incident 19090.
Reviewer note This MR looks large but it mostly reintroduces changes from the original MR with additional logic to account for possible special characters.
In that incident, we saw Ultimate customers with secret push protection (SPP) enabled were having their pushes rejected with a 500 error when the commit included a special character, like the em-dash ("—") or the trademark symbol (™).
The error we saw was an Encoding::UndefinedConversionError when the code attempted to create an instance of Gitlab::SecretDetection::GRPC::ScanRequest::Payload with the data contained in DiffBlobs containing the special character. The data was encoded in ASCII-8BIT and the gRPC libraries we attempting to convert it to UTF-8 and failing. This did not seem to occur with SPP disabled, thus parsing the whole file.
In this MR we use force_encoding('UTF-8') on the diff data if it isn't already.
force_encoding simply marks the string as encoded as such but does no validation. String#encode attempts to transcode the string into the. desired encoding but in this situation that fails exactly like we saw on production from the gRPC object creation. Since the data appears to be coming from Gitaly as unicode but interpreted as being encoded as ASCII-8BIT, using force_encoding appears to work.
References
MR acceptance checklist
Please evaluate this MR against the MR acceptance checklist. It helps you analyze changes to reduce risks in quality, performance, reliability, security, and maintainability.
How to set up and validate locally
See the Steps to reproduce on the related issue.
Related to #512315 (closed)