Wrong UTF-8 detection in function make_utf8bytes
Function make_utf8bytes is not able to correctly identify UTF-8 coding. For a polish special character 'ń' - example: Anna Kańtoch (polish author) This is coded by make byterstr (correctly): Anna Ka%C5%84toch Then function make_utf8bytes is not catching that in lines:
if ((ch == '\xC2') | (ch == '\xC3')) & ((chx >= '\xA0') & (chx <= '\xFF')): return name, 'UTF-8'
as the reslut - it is interpreted as iso-8859 - what leads to wrong string.
Proposed solutions (one of) -
- Use dedicated library to detect utf, e.g.
import chardet
def make_utf8bytes(txt): name = make_bytestr(txt)
result = chardet.detect(name)
detected_encoding = result['encoding']
if detected_encoding.lower() != 'utf-8':
name = name.decode(detected_encoding).encode('utf-8')
return name, detected_encoding
- OR - fix UTF recognition - the actual one is NOT sufficient
Edited by Stanisław Kozicki