Feature request: Determine charset automatically based on language
I suppose you don't want to use an encoding detector library, such as uchardet, because of its large size. Its aim is to determine the encoding of the text without any additional information. But in MKVToolNix there's a language property, which is enough additional information to implement a simple automatic charset detection. I think it be way easier for regular user to specify only the language and not worry about the charset, as most users are probably not even familiar with the concept of character sets.
The implementation could be quite simple, for each language having the information about:
- range of utf-8 characters
- possible charsets
Then mkvmerge could try all the possible charsets for the language and determine which conversion to UTF-8 was the most successful based on the number of characters that fall into the UTF-8 range.
Examples:
- Arabic | range: U+0600 to U+06FF | charsets: WINDOWS-1256, ISO-8859-6
- Bulgarian | range: U+0410 to U+044F | charsets: WINDOWS-1251, ISO-8859-5
- Chinese | range: U+4E00 to U+9FFF | charsets: ISO-2022-CN, BIG5, EUC-TW, GB18030, HZ-GB-2312
- Greek | range: U+0391 to U+03C9 | charsets: WINDOWS-1253, ISO-8859-7
- Hebrew | range: U+05D0 to U+05EA | charsets: WINDOWS-1255, ISO-8859-8
Algorithm suggestion: