Faster large set operations (i386, x86-64). (!475) · Merge requests · FPC / FPC / FPC Source

Rika requested to merge runewalsh/source:setx86ops into main Aug 09, 2023

Platform-specific implementations of !469 (merged) for i386 and x86-64 / SSE2 (\textcolor{blue}{\text{\#AutoVectorizationCant}}).

Note two things:

i386 implementations are 2 times shorter in binary than !469 (merged).
x86-64 implementations handle 32-byte sets without looping (loop body is performed for the first 16 bytes but never jumps back), so there are zero taken branches in set_a := set_b op set_c and zero ~~to one~~ taken branches in set_a <= set_b.

I used Intel syntax for x86-64 because I wanted to avoid referencing each parameter with {$ifdef} or using {$ifndef windows} mov %rdi, %rcx {$endif}-style runtime adapter, but I don’t know how to dereference a pointer parameter by name in AT&T. ;_;

Benchmark: X86SetOpsBenchmark.pas, compares both with present versions and with !469 (merged).

My results.

i386

set + set (orig):        18 ns/call
set + set (MR 469):      5.8 ns/call
set + set (i386):        4.2 ns/call

set - set (orig):        20 ns/call
set - set (MR 469):      5.9 ns/call
set - set (i386):        5.4 ns/call

set <= set (orig):       73 ns/call
set <= set (MR 469):     7.4 ns/call
set <= set (i386):       4.2 ns/call

x86-64

set + set (orig):        16 ns/call
set + set (MR 469):      2.7 ns/call
set + set (x86_64):      1.9 ns/call

set - set (orig):        19 ns/call
set - set (MR 469):      2.8 ns/call
set - set (x86_64):      1.7 ns/call

set <= set (orig):       22 ns/call
set <= set (MR 469):     2.3 ns/call
set <= set (x86_64):     1.8 ns/call

Edited Aug 11, 2023 by Rika

Faster large set operations (i386, x86-64).

Merge request reports