Skip to content

Improve Content-Type detection for CI artifacts and Git blobs

Stan Hu requested to merge sh-relax-content-type into master

What does this MR do and why?

In GitLab 11.6 via gitlab-workhorse!335 (merged), Workhorse attempts to detect a blob's Content-Type by proxying the download and examining the first 512 bytes. This was done to thwart a security issue where certain types of files could trick the browser into displaying an inline file and execute (https://gitlab.com/gitlab-org/gitlab-foss/-/issues/36103).

However, the detection mechanism used relies on Go's http.DetectContentType([]byte) function, which implements the algorithm described in https://mimesniff.spec.whatwg.org/. This detection mechanism only can detect a small number of types and can cause files to be labeled with the wrong Content-Type.

For example, this Workhorse change caused Microsoft Word .docx files to have a Content-Type of application/zip instead of application/vnd.openxmlformats-officedocument.wordprocessingml.document. A similar problem exists for other documents.

To fix this, we have a few options:

  1. Improve Workhorse's Content-Type detection.
  2. Use the filename extension as a hint about what specific type is used. One attempt was tried in gitlab-workhorse!478 (closed) and closed due to complexity.

The first option seems ideal for libmagic, which we already use with ruby-magic. However, libmagic relies on cgo bindings, and we'd have to make Workhorse build this archive properly. In addition, the detection mechanism currently uses the first 512 bytes, which is a bit arbitrary. We may need to increase this to 1024 or even to a megabyte if we want libmagic to work.

For now, the simplest solution is to use the extension as a hint. If the extension Content-Type appears to be application/* for both the detected and hint, then we go with the hint option since it's more likely to have the application-level knowledge of what's correct.

Relates to #26448 (closed)

How to set up and validate locally

  1. Open a project and upload a .docx file.
  2. In the repository tree, click on the .docx. Your browser should say it is application/zip.
  3. Change to this branch.
  4. Enable the feature flag: Feature.enabled?(:use_content_type_from_filename).
  5. The browser should show the right Content-Type:

image

Repeat for CI job artifacts, snippets, etc.

MR acceptance checklist

This checklist encourages us to confirm any changes have been analyzed to reduce risks in quality, performance, reliability, security, and maintainability.

Edited by Stan Hu

Merge request reports