Non-ASCII characters corrupted in filenames for wikis and blobs
When non-ASCII characters are in filenames for wikis and blobs, they are corrupted in the index for the blobs, which affects searching by the title.
I created a Wiki with the title Møle and body Testing Møle in the title for Elasticsearch. In the GitLab Search UI, it displays Møle as the title for this Wiki page. I queried the Elasticsearch result in the rails console:
user = User.find_by_username('root')
s = SearchService.new(user, {:search => 'testing', :scope => 'wiki_blobs'})
pp s.search_objects.to_a
[#<Elasticsearch::Model::Response::Result:0x00007f2757c47a90
@result=
{"_index"=>"gitlab-production",
"_type"=>"doc",
"_id"=>"Møle",
"_score"=>4.088106,
"_routing"=>"project_188",
"_source"=>
{"blob"=>
{"type"=>"wiki_blob",
"oid"=>"9ea83eccaa30d8b5e084191be0af93e76aad53ee",
"rid"=>"wiki_188",
"commit_sha"=>"0ac80d94b560905f7dd17036eedac27b44f04d25",
"content"=>
"Testing Møle in the title for Elasticsearch",
"path"=>"Møle.md",
"file_name"=>"Møle.md",
"language"=>"Markdown"},
"join_field"=>{"name"=>"wiki_blob", "parent"=>"project_188"},
"project_id"=>188,
"type"=>"wiki_blob"},
"highlight"=>
{"blob.content"=>
["gitlabelasticsearch→Testing←gitlabelasticsearch Møle in the title for Elasticsearch"]}}>]
The body of the wiki blob is not affected, but the path/filename of the wiki blob is corrupted. I'm not sure if this is caused by the path on the GitLab side, or encoding on the indexer side, or something else. I am using hashed storage for this project:
project = Project.find 188
repository = project.wiki.repository
repository.disk_path
=> "@hashed/d6/06/d6061bbee6cf13bd73765faaea7cdd0af1323e4b125342ac346047f7c4bda1fc.wiki"
The indexer uses the following method to encode the filename: https://gitlab.com/gitlab-org/gitlab-elasticsearch-indexer/blob/master/indexer/encoding.go
Customer ticket that initially reported this issue: https://gitlab.zendesk.com/agent/tickets/133684 (internal use only)
Let me know if I can provide more information.