Faster large set operations (i386, x86-64).
Platform-specific implementations of !469 (merged) for i386
and x86-64 / SSE2
(\textcolor{blue}{\text{\#AutoVectorizationCant}}
).
Note two things:
-
i386
implementations are 2 times shorter in binary than !469 (merged). -
x86-64
implementations handle 32-byte sets without looping (loop body is performed for the first 16 bytes but never jumps back), so there are zero taken branches inset_a := set_b op set_c
and zeroto onetaken branches inset_a <= set_b
.
I used Intel syntax for x86-64
because I wanted to avoid referencing each parameter with {$ifdef}
or using {$ifndef windows} mov %rdi, %rcx {$endif}
-style runtime adapter, but I don’t know how to dereference a pointer parameter by name in AT&T. ;_;
Benchmark: X86SetOpsBenchmark.pas, compares both with present versions and with !469 (merged).
My results.
i386
set + set (orig): 18 ns/call
set + set (MR 469): 5.8 ns/call
set + set (i386): 4.2 ns/call
set - set (orig): 20 ns/call
set - set (MR 469): 5.9 ns/call
set - set (i386): 5.4 ns/call
set <= set (orig): 73 ns/call
set <= set (MR 469): 7.4 ns/call
set <= set (i386): 4.2 ns/call
x86-64
set + set (orig): 16 ns/call
set + set (MR 469): 2.7 ns/call
set + set (x86_64): 1.9 ns/call
set - set (orig): 19 ns/call
set - set (MR 469): 2.8 ns/call
set - set (x86_64): 1.7 ns/call
set <= set (orig): 22 ns/call
set <= set (MR 469): 2.3 ns/call
set <= set (x86_64): 1.8 ns/call
Edited by Rika