Faster large set operations.
This complaint matches my experience. Set functions can and should be improved.
Let’s start with something simple, generic and endianness-agnostic. (As long as I don’t know if large sets work differently on big-endian platforms. Probably not, i.e. i
th bit is always pByte(@theset)[i div 8] shr (i mod 8) and 1
?)
This MR greatly improves generic implementations of +
, −
, *
, ><
, =
, <=
over large sets on platforms that allow unaligned accesses, in particular both on i386
and x86-64
. (I suspect that these functions 1) never get arguments not aligned on a PtrUint
boundary when size >= sizeof(PtrUint)
, 2) never have tails I handle, but without a guarantee I’m playing safe.)
Benchmark: SetOpsBenchmark.pas.
My results
i386
set + set (system): 20 ns/call
set + set (orig): 17 ns/call
set + set (new): 5.7 ns/call
set = set (system): 18 ns/call
set = set (orig): 18 ns/call
set = set (new): 3.9 ns/call
set <= set (system): 74 ns/call
set <= set (orig): 73 ns/call
set <= set (new): 4.3 ns/call
x86-64
set + set (system): 17 ns/call
set + set (orig): 16 ns/call
set + set (new): 3.6 ns/call
set = set (system): 13 ns/call
set = set (orig): 17 ns/call
set = set (new): 4.0 ns/call
set <= set (system): 24 ns/call
set <= set (orig): 24 ns/call
set <= set (new): 2.7 ns/call
Edited by Rika