Post-modern CompareByte for i386/SSE2.
I have redone SSE2 CompareByte
for i386
by hand, adapting most of the ideas from glibc
’s x86-64/AVX2 version that use more code for more speed. (For its own i386/SSE*
versions it uses something completely different and VERY scary, didn’t even try to understand and these things look unhealthy for your I-cache.)
Changes from before:
-
Faster test for page crossing (at the cost of a lot of false positives; but full test needs a second register), in the necessary conjunction with faster cross-page fallback that softens the impact of these false positives (and, of course, improves actual page crossings). (Btw, you can test its speed without crossing pages by changing the conditions from
cmp $4095, %eax; ja .CantOverRead
to, say,cmp $0, %eax; jae .CantOverRead
.) -
Up to 32 first bytes (two first XMMs) are tested early without entering the loop. This shortcuts many scenarios, for example, colliding hash table keys are often either equal or wildly different. If there are 64 bytes (four XMMs) or less in total, they are analyzed with up to 4 hardcoded comparisons without entering the loop at all.
-
Loop aligns
buf1
, is unrolled by 2×, and analyzes these pairs by looking first at the reducedpmovmskb(pcmpeqb(A, B) pand pcmpeqb(A + 16, B + 16))
, which is an optimization for a long sequence of equal vectors: first 32 bytes were equal if you got here, so this will likely continue.
Two downsides:
-
500 b of code, up from 260 b.
-
My version does not work if
len
is positive butbuf + len
overflowsPtrUint
. It affects only the aligned loop and can be fixed by adding something like “add len, buf; jc .LFixupLength; sub len, buf
” to its preparations, but do you really need that anyway?
Benchmark: CompareBytePostmodernBenchmark.pas.
My results ( ‾́ ◡ ‾́ )
SSE2 (modern) SSE2 (postmodern)
CompareByte(#0 / 1): 2.0 ns/call 1.4 ns/call
CompareByte(#6 / 7): 2.4 ns/call 1.8 ns/call
CompareByte(#14 / 15): 2.4 ns/call 1.8 ns/call
CompareByte(#30 / 31): 2.8 ns/call 2.8 ns/call
CompareByte(#1 / 100): 2.2 ns/call 1.8 ns/call
CompareByte(#50 / 100): 5.1 ns/call 3.2 ns/call
CompareByte(#99 / 100): 6.2 ns/call 5.3 ns/call
CompareByte(#100 / 200): 6.7 ns/call 5.0 ns/call
CompareByte(#199 / 200): 10 ns/call 7.1 ns/call
CompareByte(#500 / 1000): 22 ns/call 14 ns/call
CompareByte(#999 / 1000): 41 ns/call 25 ns/call
CompareByte(#5000 / 10000): 203 ns/call 132 ns/call
CompareByte(#9999 / 10000): 385 ns/call 253 ns/call