Redesign content headers recognision
At the moment, to detect the content type we use the method DetectContentType
from http
but the set of mime types is quite small. One example is https://gitlab.com/gitlab-org/gitlab-ce/issues/57041 where instead of returning the proper MIME type for docx
files it returns just application/zip
.
Another problem we have is the size of the header we use to detect the content type. For most file types 512 bytes is enough but, for text types like SVG, how the file is built can determine if the file gets the right content type or not. You can see an example of it in https://gitlab.com/gitlab-org/gitlab-ce/issues/56701.
We have tested other libraries like filetype, magicmime or mimetype. In all of them, there are edge cases and problems depending on the header size we use to detect the content.
That's why we should redesign the way we actually perform this recognition. Because Workhorse doesn't have all the information about the file, like the filename when we deal with blobs, we should start this process in Rails.
Rails will determine and set the content headers based on the file extension. Those headers will go to Workhorse and it will try to detect the content type based on the content. If there is a severe disagreement between the headers coming from Rails and the ones Workhorse has detected, then we will use the Workhorse ones. For example, if we have an SWF file named as foo.txt
Rails will set the content type as text/plain
. Then Workhorse will detect the content and it will see that the content type matches an application/octet-stream
. In this case, because there is a big difference between a text/*
and application/*
, Workhorse will rewrite the content type and set it to application/octet-stream
.
/cc @stanhu