UnicodeData Alternative [PoC]

I have a compatibility-breaking (not necessarily, but ideally) idea that you might like.

I should have proposed this in the mailing list first, but I’m too lazy to get acquainted with this mailing list thing.

Presently, UnicodeData.pas uses a constant array UC_PROP_ARRAY: array[0 .. <4 thousands>] of TUC_Prop (TUC_Prop is 11 or 12 bytes) to store some character properties from Unicode database file ucd/UnicodeData.txt, and helper tables to map sparser codepoints to its denser elements. Despite densifying tricks, these arrays are somewhat big (45 Kb for UC_PROP_ARRAY + 40 Kb for unicodedata.inc helper tables), and the TUC_Prop form prompts the usage of the endianness-dependent Uint24 and having big- and little-endian versions of UC_PROP_ARRAY.

An alternative I propose is that separate character properties can be stored in separate table hierarchies, as TUC_Prop fields are often different in their nature, if not mutually exclusive. For example, all simple case mappings — Simple_Uppercase, Simple_Lowercase and Simple_Titlecase — can be implemented with only 2.5 Kb of tables plus 0.5 Kb of code, which is eight times smaller than the space occupied by SimpleLowerCase and SimpleUpperCase fields in UC_PROP_ARRAY alone (4000 × (3 + 3) = 24 Kb). Simple_Titlecase is not even present in current UnicodeData.pas, but it shares almost all data with Simple_Uppercase, allowing for my tables to have it for next to nothing.

The opposite, worst part is that my GeneralCategory tables occupy 11 Kb while total size of UC_PROP_ARRAY[].CategoryDatas is 4 Kb. This, however, is probably the only such case, is fully compensated by everything else, and my categories are more complete than FPC’s: among first 100 000 codepoints, there are 51 assigned wrongly (mainly because FPC’s UnicodeData.GetProp reports Georgian letters as Lo (Other_Letter) while in UnicodeData.txt they are Ll (Lowercase_Letter)) and 1200 that are reported as Cn (Unassigned) while present in the latest UnicodeData.txt (Unassigned is the dedicated category for codepoints not present). In total, my tables save at least 50 Kb, and even more if some of them are unused, as they can be dropped by the compiler one by one unlike UC_PROP_ARRAY.

This could introduce an incompatibility: GetProps(sym): PUC_Prop is not readily available by design, and probably cannot be emulated sanely enough. Instead of GetProps(sym), separate functions for different properties are provided: GetSimpleUppercase(sym), etc. In practice, most usage scenarios (all usages in RTL itself) do things like GetProps(sym)^.CategoryData — that is, read a specific property, which can be replaced with GetCategoryData(sym) with no drawbacks.

After all, I have implemented GetProps emulation based on disguising a codepoint as a pointer (i.e. in an in-sane way). I marked it with deprecated; ideally it should be removed, I was also considering renaming it to UnpackProps to reduce the temptation to use it as it must be slower than dedicated functions and uses a trick of questionable viability.

Also I have made a program that tests UnicodeData.NormalizeNFD, UnicodeData_Alt.NormalizeNFD (didn't test anything else yet...) on the data from ucd/NormalizationTest.txt. Note that the test data is compiled into executable, always occupying sizeof(TestsRaw) = 897.5 Kb of it.

ucd_test.pas

My results are (-O4 -Xs, x86-64/win64):

— With only UnicodeData enabled ({$define test_orig} {-$define test_alt}), executable size is 1 124 Kb = sizeof(TestsRaw) + 226.5 Kb and:

NormalizeNFD failures: 645/18992
90.2 ns/NormalizeNFD call	
1124.0 ns/UnicodeToUpper call

— With only UnicodeData_Alt enabled ({-$define test_orig} {$define test_alt}), executable size is 1 042.5 Kb = sizeof(TestsRaw) + 145 Kb and:

NormalizeNFD failures: 434/18992
84.4 ns/NormalizeNFD call
509.0 ns/UnicodeToUpper call

— With both versions enabled ({$define test_orig} {$define test_alt}), executable size is 1 145 Kb = sizeof(TestsRaw) + 247.5 Kb. This means that UnicodeData added 247.5 − 145 = 102.5 Kb, and UnicodeData_Alt added 247.5 − 226.5 = 21 Kb.

Honestly, I didn’t expect the speedup at all and was only seeking the size reduction, and I vaguely remember that uppercasing with my tables is slower than with system functions like WinAPI’s CharUpperBuffW, so credits for 2× uppercasing speedup go to the poor performance of original UnicodeData.

Summarizing, my proposal will make UnicodeData three to five times smaller (always) and two times faster (sometimes; with a fine-printed footnote about shortcomings).

Lastly, I have generated all my data with a P*thon script that will require tidying it up, reworking to generate turnkey tables for {$include}ing, and rewriting into Pascal for ideological and practical purity (for being able to update tables to a new Unicode version without requiring Python installed haha), which is going to be a long time considering my laziness (think of Jan’2023 or so), but I guess it’s too early to think about it anyway, right now I just wanted the opinion of the people in charge.

To upload designs, you'll need to enable LFS and have an admin enable hashed storage. More information