Wrong UTF-8 detection in function make_utf8bytes

Function make_utf8bytes is not able to correctly identify UTF-8 coding. For a polish special character 'ń' - example: Anna Kańtoch (polish author) This is coded by make byterstr (correctly): Anna Ka%C5%84toch Then function make_utf8bytes is not catching that in lines:

if ((ch == '\xC2') | (ch == '\xC3')) & ((chx >= '\xA0') & (chx <= '\xFF')): return name, 'UTF-8'

as the reslut - it is interpreted as iso-8859 - what leads to wrong string.

Proposed solutions (one of) -

Use dedicated library to detect utf, e.g.

import chardet

def make_utf8bytes(txt): name = make_bytestr(txt)


result = chardet.detect(name)
detected_encoding = result['encoding']

if detected_encoding.lower() != 'utf-8':
    name = name.decode(detected_encoding).encode('utf-8')

return name, detected_encoding

OR - fix UTF recognition - the actual one is NOT sufficient

Edited Oct 22, 2023 by Stanisław Kozicki