Redesign content headers recognision (#325074) · Issues · GitLab.org / GitLab

Redesign content headers recognision

At the moment, to detect the content type we use the method `DetectContentType` from `http` but the set of mime types is quite small. One example is https://gitlab.com/gitlab-org/gitlab-ce/issues/57041 where instead of returning the proper MIME type for `docx` files it returns just `application/zip`. Another problem we have is the size of the header we use to detect the content type. For most file types 512 bytes is enough but, for text types like SVG, how the file is built can determine if the file gets the right content type or not. You can see an example of it in https://gitlab.com/gitlab-org/gitlab-ce/issues/56701. We have tested other libraries like [filetype](https://github.com/h2non/filetype), [magicmime](https://github.com/rakyll/magicmime) or [mimetype](https://github.com/gabriel-vasile/mimetype). In all of them, there are edge cases and problems depending on the header size we use to detect the content. That's why we should redesign the way we actually perform this recognition. Because Workhorse doesn't have all the information about the file, like the filename when we deal with blobs, we should start this process in Rails. Rails will determine and set the content headers based on the file extension. Those headers will go to Workhorse and it will try to detect the content type based on the content. If there is a severe disagreement between the headers coming from Rails and the ones Workhorse has detected, then we will use the Workhorse ones. For example, if we have an SWF file named as `foo.txt` Rails will set the content type as `text/plain`. Then Workhorse will detect the content and it will see that the content type matches an `application/octet-stream`. In this case, because there is a big difference between a `text/*` and `application/*`, Workhorse will rewrite the content type and set it to `application/octet-stream`. /cc @stanhu

issue