Non-ASCII characters corrupted in filenames for wikis and blobs

When non-ASCII characters are in filenames for wikis and blobs, they are corrupted in the index for the blobs, which affects searching by the title.

I created a Wiki with the title Møle and body Testing Møle in the title for Elasticsearch. In the GitLab Search UI, it displays Møle as the title for this Wiki page. I queried the Elasticsearch result in the rails console:

user = User.find_by_username('root')
s = SearchService.new(user, {:search => 'testing', :scope => 'wiki_blobs'})
pp s.search_objects.to_a

[#<Elasticsearch::Model::Response::Result:0x00007f2757c47a90
  @result=
   {"_index"=>"gitlab-production",
    "_type"=>"doc",
    "_id"=>"Møle",
    "_score"=>4.088106,
    "_routing"=>"project_188",
    "_source"=>
     {"blob"=>
       {"type"=>"wiki_blob",
        "oid"=>"9ea83eccaa30d8b5e084191be0af93e76aad53ee",
        "rid"=>"wiki_188",
        "commit_sha"=>"0ac80d94b560905f7dd17036eedac27b44f04d25",
        "content"=>
         "Testing Møle in the title for Elasticsearch",
        "path"=>"Møle.md",
        "file_name"=>"Møle.md",
        "language"=>"Markdown"},
      "join_field"=>{"name"=>"wiki_blob", "parent"=>"project_188"},
      "project_id"=>188,
      "type"=>"wiki_blob"},
    "highlight"=>
     {"blob.content"=>
       ["gitlabelasticsearch→Testing←gitlabelasticsearch Møle in the title for Elasticsearch"]}}>]

The body of the wiki blob is not affected, but the path/filename of the wiki blob is corrupted. I'm not sure if this is caused by the path on the GitLab side, or encoding on the indexer side, or something else. I am using hashed storage for this project:

project = Project.find 188
repository = project.wiki.repository
repository.disk_path
=> "@hashed/d6/06/d6061bbee6cf13bd73765faaea7cdd0af1323e4b125342ac346047f7c4bda1fc.wiki"

The indexer uses the following method to encode the filename: https://gitlab.com/gitlab-org/gitlab-elasticsearch-indexer/blob/master/indexer/encoding.go

Customer ticket that initially reported this issue: https://gitlab.zendesk.com/agent/tickets/133684 (internal use only)

Let me know if I can provide more information.

Edited by Dylan Griffith