Draft: Fix case insensitive search containing diacritics by changing FTS4 tokenizer (!1404) · Merge requests · F-Droid / Client

Tobias_Groza requested to merge Tobias_Groza/fdroidclient:fts4-search into master Jun 15, 2024

The default simple tokenzier config used in Room/SQLite3 FTS4 makes only ASCII characters case insensitive. The SQLite FTS3/4 doc says:

All uppercase characters within the ASCII range (Unicode codepoints less than 128), are transformed to their lowercase equivalents as part of the tokenization process. Thus, full-text queries are case-insensitive when using the simple tokenizer.

[...]

The "unicode61" tokenizer [...] works very much like "simple" except that it does simple unicode case folding according to rules in Unicode Version 6.1 and it recognizes unicode space and punctuation characters and uses those to separate tokens. The simple tokenizer only does case folding of ASCII characters and only recognizes ASCII space and punctuation characters as token separators.

The remove_diacritics option may be set to "0", "1" or "2". The default value is "1". If it is set to "1" or "2", then diacritics are removed from Latin script characters as described above. However, if it is set to "1", then diacritics are not removed in the fairly uncommon case where a single unicode codepoint is used to represent a character with more that one diacritic. [...] This is technically a bug, but cannot be fixed without creating backwards compatibility problems. If this option is set to "2", then diacritics are correctly removed from all Latin characters.

This change makes use of the intended behaviour by using the unicode61 tokenizer with the diacritics="2" option to keep the behaviour similar to the current one. This replaces the previously used simple tokenizer. A migration is necessary to recreate the AppMetadataFts table to make use of the different tokenizer.

Fixes #2636

TODO: add test for migration

Edited Jun 17, 2024 by Tobias_Groza

Draft: Fix case insensitive search containing diacritics by changing FTS4 tokenizer

Merge request reports