Skip to content

Refine all x86/x64 Index/Compare.

Rika requested to merge runewalsh/source:ic-refine into main

Redoes most all of the functions I did previously using GCC by hand because GCC output both lacks aesthetics and is underwhelming in comparison (pun not intended) to what can be done directly with certain ideas I didn’t know at that time or difficult to express in a high-level language.

For example, i386.inc:CompareByte_Plain reads head and tail uint32 unaligned instead of aligning with bytewise loops, which in the former worst case of 3+3 head and tail bytes can give a 7→2 ns speedup (if this function were called...), SSE versions use the approach similar to !397 (merged) (less extreme though), etc.

Brand-new idea: SSE comparisons align intermediate vectors on 16 bytes, search for the first differing byte with pcmbeqbs regardless of the word size in question, then recover the word position containing mismatching byte and compare two mismatching words. This allows using aligned vectors even if the input was misaligned on its natural boundary.

Benchmark: IndexCompareBenchmark.pas. My results:

i386
                              New plain   Existing plain  New SSE2    Existing SSE2

IndexDWord(#0 .. #1):        2.0 ns/call   2.3 ns/call   2.6 ns/call   2.8 ns/call
IndexDWord(#0 .. #2):        2.0 ns/call   2.2 ns/call   2.8 ns/call   3.1 ns/call
IndexDWord(#0 .. #3):        2.0 ns/call   2.4 ns/call   3.1 ns/call   2.8 ns/call
IndexDWord(#0 .. #4):        2.3 ns/call   2.7 ns/call   1.6 ns/call   2.9 ns/call
IndexDWord(#0 .. #14):       3.2 ns/call   3.9 ns/call   2.1 ns/call   3.3 ns/call
IndexDWord(#0 .. #15):       3.3 ns/call   3.9 ns/call   2.1 ns/call   3.4 ns/call
IndexDWord(#0 .. #16):       3.5 ns/call   4.2 ns/call   2.4 ns/call   3.3 ns/call
IndexDWord(#0 .. #49):        11 ns/call    10 ns/call   3.9 ns/call   4.5 ns/call
IndexDWord(#0 .. #99):        23 ns/call    26 ns/call    11 ns/call   8.3 ns/call
IndexDWord(#0 .. #999):      132 ns/call   135 ns/call    58 ns/call    60 ns/call

CompareByte(#0 / 1):         1.6 ns/call   2.0 ns/call
CompareByte(#1 / 2):         2.3 ns/call   2.9 ns/call
CompareByte(#2 / 3):         2.7 ns/call   3.4 ns/call
CompareByte(#6 / 7):         2.7 ns/call   4.3 ns/call
CompareByte(#14 / 15):       3.9 ns/call   5.7 ns/call
CompareByte(#30 / 31):       5.5 ns/call   7.5 ns/call
CompareByte(#1 / 100):       1.7 ns/call   2.2 ns/call
CompareByte(#50 / 100):      6.9 ns/call   7.4 ns/call
CompareByte(#99 / 100):       11 ns/call    11 ns/call
CompareByte(#100 / 200):      10 ns/call    11 ns/call
CompareByte(#199 / 200):      18 ns/call    18 ns/call
CompareByte(#999 / 1000):     82 ns/call    80 ns/call
CompareByte(#5000 / 10000):  343 ns/call   361 ns/call
CompareByte(#9999 / 10000):  667 ns/call   708 ns/call

CompareWord(#0 / 1):         2.0 ns/call   2.2 ns/call   2.3 ns/call   3.0 ns/call
CompareWord(#1 / 2):         2.7 ns/call   2.9 ns/call   2.7 ns/call   2.7 ns/call
CompareWord(#2 / 3):         2.9 ns/call   2.9 ns/call   2.6 ns/call   2.9 ns/call
CompareWord(#6 / 7):         3.1 ns/call   3.4 ns/call   2.7 ns/call   3.2 ns/call
CompareWord(#14 / 15):       4.5 ns/call   5.2 ns/call   2.4 ns/call   2.7 ns/call
CompareWord(#30 / 31):       7.6 ns/call   8.1 ns/call   2.7 ns/call   3.0 ns/call
CompareWord(#1 / 100):       1.7 ns/call   2.5 ns/call   2.2 ns/call   3.0 ns/call
CompareWord(#50 / 100):      4.6 ns/call   5.1 ns/call   3.0 ns/call   3.4 ns/call
CompareWord(#99 / 100):      6.0 ns/call   6.4 ns/call   3.6 ns/call   4.4 ns/call
CompareWord(#100 / 200):      18 ns/call    18 ns/call   4.1 ns/call   4.6 ns/call
CompareWord(#199 / 200):      25 ns/call    25 ns/call   5.5 ns/call   6.5 ns/call
CompareWord(#999 / 1000):    130 ns/call   139 ns/call    33 ns/call    42 ns/call
CompareWord(#5000 / 10000):  288 ns/call   298 ns/call   140 ns/call   192 ns/call
CompareWord(#9999 / 10000):  436 ns/call   436 ns/call   210 ns/call   293 ns/call

CompareDWord(#0 / 1):        1.8 ns/call   2.1 ns/call   2.7 ns/call   2.5 ns/call
CompareDWord(#1 / 2):        1.8 ns/call   2.4 ns/call   2.7 ns/call   2.5 ns/call
CompareDWord(#2 / 3):        2.0 ns/call   2.5 ns/call   2.7 ns/call   2.7 ns/call
CompareDWord(#6 / 7):        3.1 ns/call   4.0 ns/call   2.3 ns/call   2.6 ns/call
CompareDWord(#14 / 15):      4.9 ns/call   5.4 ns/call   2.9 ns/call   3.3 ns/call
CompareDWord(#30 / 31):      9.9 ns/call    11 ns/call   4.8 ns/call   5.2 ns/call
CompareDWord(#1 / 100):      1.8 ns/call   2.7 ns/call   2.3 ns/call   2.7 ns/call
CompareDWord(#50 / 100):     3.0 ns/call   3.8 ns/call   2.6 ns/call   3.2 ns/call
CompareDWord(#99 / 100):     3.4 ns/call   4.2 ns/call   2.7 ns/call   3.3 ns/call
CompareDWord(#100 / 200):     27 ns/call    30 ns/call    10 ns/call    13 ns/call
CompareDWord(#199 / 200):     28 ns/call    31 ns/call    10 ns/call    14 ns/call
CompareDWord(#999 / 1000):   264 ns/call   318 ns/call   110 ns/call   163 ns/call
CompareDWord(#5000 / 10000): 544 ns/call   630 ns/call   246 ns/call   343 ns/call
CompareDWord(#9999 / 10000): 561 ns/call   654 ns/call   260 ns/call   361 ns/call
x86-64
                                 New        Existing

IndexDWord(#0 .. #1):        2.7 ns/call   2.7 ns/call
IndexDWord(#0 .. #2):        2.8 ns/call   3.0 ns/call
IndexDWord(#0 .. #3):        3.2 ns/call   2.4 ns/call
IndexDWord(#0 .. #4):        2.1 ns/call   2.4 ns/call
IndexDWord(#0 .. #14):       2.5 ns/call   2.8 ns/call
IndexDWord(#0 .. #15):       2.5 ns/call   2.8 ns/call
IndexDWord(#0 .. #16):       2.6 ns/call   2.9 ns/call
IndexDWord(#0 .. #49):       3.9 ns/call   3.9 ns/call
IndexDWord(#0 .. #99):       8.1 ns/call   6.6 ns/call
IndexDWord(#0 .. #999):       59 ns/call    60 ns/call

IndexQWord(#0 .. #1):        2.1 ns/call   2.4 ns/call
IndexQWord(#0 .. #2):        2.2 ns/call   2.7 ns/call
IndexQWord(#0 .. #3):        2.3 ns/call   2.6 ns/call
IndexQWord(#0 .. #4):        2.5 ns/call   3.2 ns/call
IndexQWord(#0 .. #14):       3.7 ns/call   4.4 ns/call
IndexQWord(#0 .. #15):       3.8 ns/call   4.5 ns/call
IndexQWord(#0 .. #16):       3.9 ns/call   4.7 ns/call
IndexQWord(#0 .. #49):        11 ns/call    11 ns/call
IndexQWord(#0 .. #99):        23 ns/call    26 ns/call
IndexQWord(#0 .. #999):      132 ns/call   132 ns/call

CompareWord(#0 / 1):         2.4 ns/call   2.4 ns/call
CompareWord(#1 / 2):         2.8 ns/call   2.4 ns/call
CompareWord(#2 / 3):         2.8 ns/call   2.6 ns/call
CompareWord(#6 / 7):         2.8 ns/call   3.0 ns/call
CompareWord(#14 / 15):       2.4 ns/call   2.8 ns/call
CompareWord(#30 / 31):       2.8 ns/call   3.1 ns/call
CompareWord(#1 / 100):       2.3 ns/call   2.5 ns/call
CompareWord(#50 / 100):      3.4 ns/call   3.4 ns/call
CompareWord(#99 / 100):      3.9 ns/call   4.7 ns/call
CompareWord(#100 / 200):     4.3 ns/call   4.5 ns/call
CompareWord(#199 / 200):     5.6 ns/call   6.7 ns/call
CompareWord(#999 / 1000):     33 ns/call    47 ns/call
CompareWord(#5000 / 10000):  140 ns/call   206 ns/call
CompareWord(#9999 / 10000):  210 ns/call   306 ns/call

CompareDWord(#0 / 1):        2.4 ns/call   2.8 ns/call
CompareDWord(#1 / 2):        2.7 ns/call   3.2 ns/call
CompareDWord(#2 / 3):        2.7 ns/call   3.3 ns/call
CompareDWord(#6 / 7):        2.7 ns/call   2.9 ns/call
CompareDWord(#14 / 15):      3.4 ns/call   4.0 ns/call
CompareDWord(#30 / 31):      4.7 ns/call   6.3 ns/call
CompareDWord(#1 / 100):      2.6 ns/call   2.5 ns/call
CompareDWord(#50 / 100):     2.8 ns/call   2.9 ns/call
CompareDWord(#99 / 100):     2.9 ns/call   3.1 ns/call
CompareDWord(#100 / 200):     10 ns/call    15 ns/call
CompareDWord(#199 / 200):     11 ns/call    16 ns/call
CompareDWord(#999 / 1000):   111 ns/call   189 ns/call
CompareDWord(#5000 / 10000): 243 ns/call   378 ns/call
CompareDWord(#9999 / 10000): 258 ns/call   397 ns/call
Edited by Rika

Merge request reports