detectLanguage: check the text's script for definitive language categorization for some languages (!635) · Merge requests · Soapbox / Ditto

Take advantage of the fact that some languages use a specific character set that is exclusive to their language. If we detect that, we can process the message without lande.

This only works for some languages, and only if the text is entirely in that language's script. So for example, こんにちは hello! would fail the regex due to presence of non-Japanese characters.

In the case of Japanese, at least one Hiragana or Katakana character must be present in order to distinguish it from Chinese.

Possible TODO for the future:

If the text contains only Han characters, pass ['ja', 'zh'] to lande as the only possible options.
Similarly, if the text contains only Cyrillic characters, we know it can only be one of several languages (including Russian), but that list is more extensive: ['bg','kk','ky','mk','mn','ru','sr','tg','tk','tt','uk','uz']
Basically we can do a first-pass of the text to determine which languages it could possibly be (and by extension, couldn't possibly be), and use that to set limits on lande before processing the text.

detectLanguage: check the text's script for definitive language categorization for some languages

Merge request reports