Better Utf8ToUnicode. (!650) · Merge requests · FPC / FPC / FPC Source

Rika requested to merge runewalsh/source:uu into main Apr 13, 2024

Utf8ToUnicode does not null-terminate but returns the length including the non-written null terminator, i.e. Utf8ToUnicode(dest='XYZWV', src='abc') returns 4 but writes dest = 'abcWV' instead of dest = 'abc'#0'V'.

↑ However, certain callers inside RTL (wrongly?) assume that destination length does not account for null terminator, i.e. that Utf8ToUnicode(destLen=3, src='abc') will write abc rather than ab#0, so I null-terminate on a residual basis if there’s room left after the main work. When doing things in Delphi-compatible way (write ab#0 if 3 chars are allowed), my version fails ~6 new tests.

My version is also a lot faster with multi-byte characters and handles all cases of invalid UTF-8 in existence.

Benchmark against existing implementation and Windows.MultiByteToWideChar (also used by Delphi’s Utf8ToUnicode): Utf8ToUnicodeBenchmark.pas.

My results:

Utf8ToUnicode_New(English):                  164 ns/call
Utf8ToUnicode_New(Russian):                  329 ns/call
Utf8ToUnicode_New(Japanese):                 729 ns/call

Utf8ToUnicode_Existing(English):             540 ns/call
Wrong result/checksum: 264/44c37307dd567fbcd8ded57516ac5f5d3bd8b6da, expected: 264/e95b8f2254209e29a12d37c4202b9d0ab544acb2.
Utf8ToUnicode_Existing(Russian):             1.5 us/call
Wrong result/checksum: 256/364b0e70dc769476471b9e09ed18a210492f5a86, expected: 256/8bd04be2c3927654543c4f96316c4baa3fcc8f37.
Utf8ToUnicode_Existing(Japanese):            2.5 us/call
Wrong result/checksum: 259/ebf4dfdd4d2c4e9f953f0f4dec3ce44b7a876914, expected: 259/1267ff0b06b22a51fa310416f0161aeba91424d3.

Utf8ToUnicode_MultiByteToWideChar(English):  106 ns/call
Utf8ToUnicode_MultiByteToWideChar(Russian):  426 ns/call
Utf8ToUnicode_MultiByteToWideChar(Japanese): 818 ns/call

Note that mismatches of _Existing are solely because of missing null terminators. MultiByteToWideChar can be faster (e.g. for me it is 4× faster for English on i386) because it purposely optimizes this scenario and because FPC versions use indices instead of pointers (this code is shared with JVM I guess?).

Edited Apr 23, 2024 by Rika

Admin message

Better Utf8ToUnicode.

Merge request reports