Refine all x86/x64 Index/Compare.
Redoes most all of the functions I did previously using GCC by hand because GCC output both lacks aesthetics and is underwhelming in comparison (pun not intended) to what can be done directly with certain ideas I didn’t know at that time or difficult to express in a high-level language.
For example, i386.inc:CompareByte_Plain
reads head and tail uint32
unaligned instead of aligning with bytewise loops, which in the former worst case of 3+3 head and tail bytes can give a 7→2 ns speedup (if this function were called...), SSE versions use the approach similar to !397 (merged) (less extreme though), etc.
Brand-new idea: SSE comparisons align intermediate vectors on 16 bytes, search for the first differing byte with pcmbeqb
s regardless of the word size in question, then recover the word position containing mismatching byte and compare two mismatching words. This allows using aligned vectors even if the input was misaligned on its natural boundary.
Benchmark: IndexCompareBenchmark.pas. My results:
i386
New plain Existing plain New SSE2 Existing SSE2
IndexDWord(#0 .. #1): 2.0 ns/call 2.3 ns/call 2.6 ns/call 2.8 ns/call
IndexDWord(#0 .. #2): 2.0 ns/call 2.2 ns/call 2.8 ns/call 3.1 ns/call
IndexDWord(#0 .. #3): 2.0 ns/call 2.4 ns/call 3.1 ns/call 2.8 ns/call
IndexDWord(#0 .. #4): 2.3 ns/call 2.7 ns/call 1.6 ns/call 2.9 ns/call
IndexDWord(#0 .. #14): 3.2 ns/call 3.9 ns/call 2.1 ns/call 3.3 ns/call
IndexDWord(#0 .. #15): 3.3 ns/call 3.9 ns/call 2.1 ns/call 3.4 ns/call
IndexDWord(#0 .. #16): 3.5 ns/call 4.2 ns/call 2.4 ns/call 3.3 ns/call
IndexDWord(#0 .. #49): 11 ns/call 10 ns/call 3.9 ns/call 4.5 ns/call
IndexDWord(#0 .. #99): 23 ns/call 26 ns/call 11 ns/call 8.3 ns/call
IndexDWord(#0 .. #999): 132 ns/call 135 ns/call 58 ns/call 60 ns/call
CompareByte(#0 / 1): 1.6 ns/call 2.0 ns/call
CompareByte(#1 / 2): 2.3 ns/call 2.9 ns/call
CompareByte(#2 / 3): 2.7 ns/call 3.4 ns/call
CompareByte(#6 / 7): 2.7 ns/call 4.3 ns/call
CompareByte(#14 / 15): 3.9 ns/call 5.7 ns/call
CompareByte(#30 / 31): 5.5 ns/call 7.5 ns/call
CompareByte(#1 / 100): 1.7 ns/call 2.2 ns/call
CompareByte(#50 / 100): 6.9 ns/call 7.4 ns/call
CompareByte(#99 / 100): 11 ns/call 11 ns/call
CompareByte(#100 / 200): 10 ns/call 11 ns/call
CompareByte(#199 / 200): 18 ns/call 18 ns/call
CompareByte(#999 / 1000): 82 ns/call 80 ns/call
CompareByte(#5000 / 10000): 343 ns/call 361 ns/call
CompareByte(#9999 / 10000): 667 ns/call 708 ns/call
CompareWord(#0 / 1): 2.0 ns/call 2.2 ns/call 2.3 ns/call 3.0 ns/call
CompareWord(#1 / 2): 2.7 ns/call 2.9 ns/call 2.7 ns/call 2.7 ns/call
CompareWord(#2 / 3): 2.9 ns/call 2.9 ns/call 2.6 ns/call 2.9 ns/call
CompareWord(#6 / 7): 3.1 ns/call 3.4 ns/call 2.7 ns/call 3.2 ns/call
CompareWord(#14 / 15): 4.5 ns/call 5.2 ns/call 2.4 ns/call 2.7 ns/call
CompareWord(#30 / 31): 7.6 ns/call 8.1 ns/call 2.7 ns/call 3.0 ns/call
CompareWord(#1 / 100): 1.7 ns/call 2.5 ns/call 2.2 ns/call 3.0 ns/call
CompareWord(#50 / 100): 4.6 ns/call 5.1 ns/call 3.0 ns/call 3.4 ns/call
CompareWord(#99 / 100): 6.0 ns/call 6.4 ns/call 3.6 ns/call 4.4 ns/call
CompareWord(#100 / 200): 18 ns/call 18 ns/call 4.1 ns/call 4.6 ns/call
CompareWord(#199 / 200): 25 ns/call 25 ns/call 5.5 ns/call 6.5 ns/call
CompareWord(#999 / 1000): 130 ns/call 139 ns/call 33 ns/call 42 ns/call
CompareWord(#5000 / 10000): 288 ns/call 298 ns/call 140 ns/call 192 ns/call
CompareWord(#9999 / 10000): 436 ns/call 436 ns/call 210 ns/call 293 ns/call
CompareDWord(#0 / 1): 1.8 ns/call 2.1 ns/call 2.7 ns/call 2.5 ns/call
CompareDWord(#1 / 2): 1.8 ns/call 2.4 ns/call 2.7 ns/call 2.5 ns/call
CompareDWord(#2 / 3): 2.0 ns/call 2.5 ns/call 2.7 ns/call 2.7 ns/call
CompareDWord(#6 / 7): 3.1 ns/call 4.0 ns/call 2.3 ns/call 2.6 ns/call
CompareDWord(#14 / 15): 4.9 ns/call 5.4 ns/call 2.9 ns/call 3.3 ns/call
CompareDWord(#30 / 31): 9.9 ns/call 11 ns/call 4.8 ns/call 5.2 ns/call
CompareDWord(#1 / 100): 1.8 ns/call 2.7 ns/call 2.3 ns/call 2.7 ns/call
CompareDWord(#50 / 100): 3.0 ns/call 3.8 ns/call 2.6 ns/call 3.2 ns/call
CompareDWord(#99 / 100): 3.4 ns/call 4.2 ns/call 2.7 ns/call 3.3 ns/call
CompareDWord(#100 / 200): 27 ns/call 30 ns/call 10 ns/call 13 ns/call
CompareDWord(#199 / 200): 28 ns/call 31 ns/call 10 ns/call 14 ns/call
CompareDWord(#999 / 1000): 264 ns/call 318 ns/call 110 ns/call 163 ns/call
CompareDWord(#5000 / 10000): 544 ns/call 630 ns/call 246 ns/call 343 ns/call
CompareDWord(#9999 / 10000): 561 ns/call 654 ns/call 260 ns/call 361 ns/call
x86-64
New Existing
IndexDWord(#0 .. #1): 2.7 ns/call 2.7 ns/call
IndexDWord(#0 .. #2): 2.8 ns/call 3.0 ns/call
IndexDWord(#0 .. #3): 3.2 ns/call 2.4 ns/call
IndexDWord(#0 .. #4): 2.1 ns/call 2.4 ns/call
IndexDWord(#0 .. #14): 2.5 ns/call 2.8 ns/call
IndexDWord(#0 .. #15): 2.5 ns/call 2.8 ns/call
IndexDWord(#0 .. #16): 2.6 ns/call 2.9 ns/call
IndexDWord(#0 .. #49): 3.9 ns/call 3.9 ns/call
IndexDWord(#0 .. #99): 8.1 ns/call 6.6 ns/call
IndexDWord(#0 .. #999): 59 ns/call 60 ns/call
IndexQWord(#0 .. #1): 2.1 ns/call 2.4 ns/call
IndexQWord(#0 .. #2): 2.2 ns/call 2.7 ns/call
IndexQWord(#0 .. #3): 2.3 ns/call 2.6 ns/call
IndexQWord(#0 .. #4): 2.5 ns/call 3.2 ns/call
IndexQWord(#0 .. #14): 3.7 ns/call 4.4 ns/call
IndexQWord(#0 .. #15): 3.8 ns/call 4.5 ns/call
IndexQWord(#0 .. #16): 3.9 ns/call 4.7 ns/call
IndexQWord(#0 .. #49): 11 ns/call 11 ns/call
IndexQWord(#0 .. #99): 23 ns/call 26 ns/call
IndexQWord(#0 .. #999): 132 ns/call 132 ns/call
CompareWord(#0 / 1): 2.4 ns/call 2.4 ns/call
CompareWord(#1 / 2): 2.8 ns/call 2.4 ns/call
CompareWord(#2 / 3): 2.8 ns/call 2.6 ns/call
CompareWord(#6 / 7): 2.8 ns/call 3.0 ns/call
CompareWord(#14 / 15): 2.4 ns/call 2.8 ns/call
CompareWord(#30 / 31): 2.8 ns/call 3.1 ns/call
CompareWord(#1 / 100): 2.3 ns/call 2.5 ns/call
CompareWord(#50 / 100): 3.4 ns/call 3.4 ns/call
CompareWord(#99 / 100): 3.9 ns/call 4.7 ns/call
CompareWord(#100 / 200): 4.3 ns/call 4.5 ns/call
CompareWord(#199 / 200): 5.6 ns/call 6.7 ns/call
CompareWord(#999 / 1000): 33 ns/call 47 ns/call
CompareWord(#5000 / 10000): 140 ns/call 206 ns/call
CompareWord(#9999 / 10000): 210 ns/call 306 ns/call
CompareDWord(#0 / 1): 2.4 ns/call 2.8 ns/call
CompareDWord(#1 / 2): 2.7 ns/call 3.2 ns/call
CompareDWord(#2 / 3): 2.7 ns/call 3.3 ns/call
CompareDWord(#6 / 7): 2.7 ns/call 2.9 ns/call
CompareDWord(#14 / 15): 3.4 ns/call 4.0 ns/call
CompareDWord(#30 / 31): 4.7 ns/call 6.3 ns/call
CompareDWord(#1 / 100): 2.6 ns/call 2.5 ns/call
CompareDWord(#50 / 100): 2.8 ns/call 2.9 ns/call
CompareDWord(#99 / 100): 2.9 ns/call 3.1 ns/call
CompareDWord(#100 / 200): 10 ns/call 15 ns/call
CompareDWord(#199 / 200): 11 ns/call 16 ns/call
CompareDWord(#999 / 1000): 111 ns/call 189 ns/call
CompareDWord(#5000 / 10000): 243 ns/call 378 ns/call
CompareDWord(#9999 / 10000): 258 ns/call 397 ns/call