Skip to content

Mostly fix Utf8CodepointLen.

Rika requested to merge runewalsh/source:ucpl into main

Utf8CodepointLen has at least three bugs. Two of them are fixed in this merge request, fixing third is less trivial (but once fixed it can make the code even simpler than before by sharing it between main and diacritic parts or by offloading branching on the diacritic to the lookup table).

First bug (fixed) is that it accepts overlong UTF-8 sequences prohibited by the Unicode standard, as well as sequences of more than four bytes, theoretically possible but explicitly prohibited by the same standard.

This matters when using Utf8CodepointLen to validate UTF-8 as in #39800.

Second bug (fixed) is that its diacritic checks, in attempt to be super-clever, \textcolor{red}{\text{were coded in the wrong way}}. For example, the range

combining diacritical marks?
1) U+0300 - U+036F in UTF-8 = %11001100 10000000 - %11001101 10101111

is checked as

((ord(p[result]) and %11001100=%11001100)) and
(ord(p[result+1])>=%10000000) and
(ord(p[result+1])<=%10101111)

Note how the second byte is checked independently on the first. This wrongly rejects U+33F COMBINING DOUBLE OVERLINE = %11001100 %10111111.

Third bug (less trivial to fix) is that combining characters lie not only in these five ranges, they are scattered everywhere as ones with General_Category = Nonspacing_Mark, Spacing_Mark, or Enclosing_Mark. An example is Japanese combining voiced sound mark that can be artistically appended to letters not designated for it: A゙, falls under the category Nonspacing_Mark, and has the code U+3099. Dependence on UnicodeData unit to ask these categories would add a huge (100 Kb) overhead. I tried to adapt my utility from !179 to build an ad-hoc table containing bitpacked “is-this-a-mark” flags, it had three levels and occupied 1.5 Kb which is more than tolerable, but I don’t want to scare anyone more than required, so left it for now.

Benchmarks and tests of both implementations (of course partly designed specifically to stumble into described bugs):
ucpl_test.pas

My results:

Testing Utf8CodePointLenV1.
Fail on Last allowed 4-byte + 1, U+110000 (withMarks=FALSE): got 4, expected -4
Fail on First 5-byte, U+200000 (withMarks=FALSE): got 5, expected -1
Fail on Overlong 2-byte U+7F (withMarks=FALSE): got 2, expected -1
Fail on Overlong 3-byte NULL (withMarks=FALSE): got 3, expected -3
Fail on Overlong 3-byte U+7FF (withMarks=FALSE): got 3, expected -3
Fail on Overlong 4-byte U+FFFF (withMarks=FALSE): got 0, expected -1
Fail on Cyrillic A + U+1AFF (last in the combining range 1AB0..1AFF) (withMarks=TRUE): got 2, expected 5
Fail on Cyrillic A + U+33F COMBINING DOUBLE OVERLINE (withMarks=TRUE): got 2, expected 4
Fail on Cyrillic A + U+1AC0 COMBINING LATIN SMALL LETTER TURNED W BELOW (withMarks=TRUE): got 2, expected 5
Done.

Testing Utf8CodePointLenV2.
Done.

---
Benchmarking Utf8CodePointLenV1, code size = 592 b.

European-like Lorem Ipsum                    / 361 x 1b,  69 x 2b
Without diacritics: 3.5 ns/call.
With diacritics:    6.7 ns/call.

Classical Japanese literature (Harry Potter) /   8 x 1b, 207 x 3b
Without diacritics: 6.3 ns/call.
With diacritics:    9.0 ns/call.

Zalgo humoresque                             /  75 x 1b, 645 x 2b,   1 x 3b
Without diacritics: 5.5 ns/call.
With diacritics:    7.9 ns/call.

Done.

---
Benchmarking Utf8CodePointLenV2, code size = 1024 b.

European-like Lorem Ipsum                    / 361 x 1b,  69 x 2b
Without diacritics: 3.6 ns/call.
With diacritics:    3.7 ns/call.

Classical Japanese literature (Harry Potter) /   8 x 1b, 207 x 3b
Without diacritics: 3.8 ns/call.
With diacritics:    4.4 ns/call.

Zalgo humoresque                             /  75 x 1b, 287 x 2b,   1 x 3b, 358 x 2b comb
Without diacritics: 3.9 ns/call.
With diacritics:    6.4 ns/call.

Done.
Edited by Rika

Merge request reports