Post-modern CompareByte for i386/SSE2. (!397) · Merge requests · FPC / FPC / FPC Source

Rika requested to merge runewalsh/source:cb-pm-i386 into main Apr 09, 2023

I have redone SSE2 CompareByte for i386 by hand, adapting most of the ideas from glibc’s x86-64/AVX2 version that use more code for more speed. (For its own i386/SSE* versions it uses something completely different and VERY scary, didn’t even try to understand and these things look unhealthy for your I-cache.)

Changes from before:

Faster test for page crossing (at the cost of a lot of false positives; but full test needs a second register), in the necessary conjunction with faster cross-page fallback that softens the impact of these false positives (and, of course, improves actual page crossings). (Btw, you can test its speed without crossing pages by changing the conditions from cmp $4095, %eax; ja .CantOverRead to, say, cmp $0, %eax; jae .CantOverRead.)
Up to 32 first bytes (two first XMMs) are tested early without entering the loop. This shortcuts many scenarios, for example, colliding hash table keys are often either equal or wildly different. If there are 64 bytes (four XMMs) or less in total, they are analyzed with up to 4 hardcoded comparisons without entering the loop at all.
Loop aligns buf1, is unrolled by 2×, and analyzes these pairs by looking first at the reduced pmovmskb(pcmpeqb(A, B) pand pcmpeqb(A + 16, B + 16)), which is an optimization for a long sequence of equal vectors: first 32 bytes were equal if you got here, so this will likely continue.

Two downsides:

500 b of code, up from 260 b.
My version does not work if len is positive but buf + len overflows PtrUint. It affects only the aligned loop and can be fixed by adding something like “add len, buf; jc .LFixupLength; sub len, buf” to its preparations, but do you really need that anyway?

Benchmark: CompareBytePostmodernBenchmark.pas.

My results ( ‾́ ◡ ‾́ )

                             SSE2 (modern)   SSE2 (postmodern)

CompareByte(#0 / 1):           2.0 ns/call      1.4 ns/call
CompareByte(#6 / 7):           2.4 ns/call      1.8 ns/call
CompareByte(#14 / 15):         2.4 ns/call      1.8 ns/call
CompareByte(#30 / 31):         2.8 ns/call      2.8 ns/call
CompareByte(#1 / 100):         2.2 ns/call      1.8 ns/call
CompareByte(#50 / 100):        5.1 ns/call      3.2 ns/call
CompareByte(#99 / 100):        6.2 ns/call      5.3 ns/call
CompareByte(#100 / 200):       6.7 ns/call      5.0 ns/call
CompareByte(#199 / 200):        10 ns/call      7.1 ns/call
CompareByte(#500 / 1000):       22 ns/call       14 ns/call
CompareByte(#999 / 1000):       41 ns/call       25 ns/call
CompareByte(#5000 / 10000):    203 ns/call      132 ns/call
CompareByte(#9999 / 10000):    385 ns/call      253 ns/call

Edited Apr 12, 2023 by Rika

Post-modern CompareByte for i386/SSE2.

Merge request reports