Better Utf8ToUnicode.
Utf8ToUnicode
does not null-terminate but returns the length including the non-written null terminator, i.e. Utf8ToUnicode(dest='XYZWV', src='abc')
returns 4 but writes dest = 'abcWV'
instead of dest = 'abc'#0'V'
.
↑ However, certain callers inside RTL (wrongly?) assume that destination length does not account for null terminator, i.e. that Utf8ToUnicode(destLen=3, src='abc')
will write abc
rather than ab#0
, so I null-terminate on a residual basis if there’s room left after the main work. When doing things in Delphi-compatible way (write ab#0
if 3 chars are allowed), my version fails ~6 new tests.
My version is also a lot faster with multi-byte characters and handles all cases of invalid UTF-8 in existence.
Benchmark against existing implementation and Windows.MultiByteToWideChar
(also used by Delphi’s Utf8ToUnicode
): Utf8ToUnicodeBenchmark.pas.
My results:
Utf8ToUnicode_New(English): 164 ns/call
Utf8ToUnicode_New(Russian): 329 ns/call
Utf8ToUnicode_New(Japanese): 729 ns/call
Utf8ToUnicode_Existing(English): 540 ns/call
Wrong result/checksum: 264/44c37307dd567fbcd8ded57516ac5f5d3bd8b6da, expected: 264/e95b8f2254209e29a12d37c4202b9d0ab544acb2.
Utf8ToUnicode_Existing(Russian): 1.5 us/call
Wrong result/checksum: 256/364b0e70dc769476471b9e09ed18a210492f5a86, expected: 256/8bd04be2c3927654543c4f96316c4baa3fcc8f37.
Utf8ToUnicode_Existing(Japanese): 2.5 us/call
Wrong result/checksum: 259/ebf4dfdd4d2c4e9f953f0f4dec3ce44b7a876914, expected: 259/1267ff0b06b22a51fa310416f0161aeba91424d3.
Utf8ToUnicode_MultiByteToWideChar(English): 106 ns/call
Utf8ToUnicode_MultiByteToWideChar(Russian): 426 ns/call
Utf8ToUnicode_MultiByteToWideChar(Japanese): 818 ns/call
Note that mismatches of _Existing
are solely because of missing null terminators. MultiByteToWideChar
can be faster (e.g. for me it is 4× faster for English on i386) because it purposely optimizes this scenario and because FPC versions use indices instead of pointers (this code is shared with JVM I guess?).