Skip to content

Completely (up to the Unicode version) fix Utf8CodepointLen.

Rika requested to merge runewalsh/source:ucpl2 into main

Alternative version of !245 (closed) (read it for details) that uses a lookup table, because combining characters are scattered across the Unicode space and current implementation that thinks they are limited to certain five ranges would cover only 10% of all combining characters if checked these ranges correctly in the first place...

The downside is that the table should be regenerated for future Unicode versions. On the other hand, current solution behaves as an extremely incomplete table anyway. As a form of future proofing, my table considers said five ranges as consisting entirely of combining characters (just like current Utf8CodepointLen wanted to), even though not all of their codepoints are assigned so far.

Tests and benchmarks of current and proposed versions, modified from !245 (closed): ucpl2_demo.pas. Funny thing is that original from !245 (closed) checked the character U+1B00 that touches the combining range U+1AB0..U+1AFF and expected it to be non-combining, but it happens to be combining too.

My results (x86-64/win64):

Testing Utf8CodePointLenV1.
Fail on Last allowed 4-byte + 1, U+110000 (withMarks=FALSE): got 4, expected -4
Fail on First 5-byte, U+200000 (withMarks=FALSE): got 5, expected -1
Fail on Overlong 2-byte U+7F (withMarks=FALSE): got 2, expected -1
Fail on Overlong 3-byte NULL (withMarks=FALSE): got 3, expected -3
Fail on Overlong 3-byte U+7FF (withMarks=FALSE): got 3, expected -3
Fail on Overlong 4-byte U+FFFF (withMarks=FALSE): got 4, expected -4
Fail on Cyrillic A + U+1AFF (last in the combining range 1AB0..1AFF) (withMarks=TRUE): got 2, expected 5
Fail on Cyrillic A + U+1B00 (character just to the right of the combining range 1AB0..1AFF that happens to be combining anyway :D) (withMarks=TRUE): got 2, expected 5
Fail on Cyrillic A + U+33F COMBINING DOUBLE OVERLINE (withMarks=TRUE): got 2, expected 4
Fail on Cyrillic A + U+1AC0 COMBINING LATIN SMALL LETTER TURNED W BELOW (withMarks=TRUE): got 2, expected 5
Fail on Cyrillic A + U+3099 (kana voice mark, 3-byte combining character outside of five ranges) (withMarks=TRUE): got 2, expected 5
Fail on Cyrillic A + U+1D167 (tremolo, 4-byte combining character outside of five ranges) (withMarks=TRUE): got 2, expected 6
Fail on Cyrillic A + U+E0100 (variation selector 17, 4-byte combining character outside of five ranges and special-cased in the lookup table) (withMarks=TRUE): got 2, expected 6
Done.

Testing Utf8CodePointLenV2B.
Done.

---
Benchmarking Utf8CodePointLenV1, code size = 592 b.

European-like Lorem Ipsum                    / 361 x 1b,  69 x 2b
Without diacritics: 3.4 ns/call.
With diacritics:    6.1 ns/call.

Classical Japanese literature (Harry Potter) /   8 x 1b, 207 x 3b
Without diacritics: 6.3 ns/call.
With diacritics:    8.6 ns/call.

Zalgo humoresque                             /  75 x 1b, 645 x 2b,   1 x 3b
Without diacritics: 5.6 ns/call.
With diacritics:    7.6 ns/call.

Done.

---
Benchmarking Utf8CodePointLenV2B, code size = 1040 b.

European-like Lorem Ipsum                    / 361 x 1b,  69 x 2b
Without diacritics: 2.7 ns/call.
With diacritics:    5.0 ns/call.

Classical Japanese literature (Harry Potter) /   8 x 1b, 207 x 3b
Without diacritics: 3.5 ns/call.
With diacritics:    7.0 ns/call.

Zalgo humoresque                             /  75 x 1b, 287 x 2b,   1 x 3b, 358 x 2b comb
Without diacritics: 3.2 ns/call.
With diacritics:    9.3 ns/call.

Done.

I also added all of these tests failed by the current implementation to tests/test/tutf8cpl.pp.

Edited by Rika

Merge request reports