Shorter IndexByte_Plain.
Make IndexByte_Plain
smaller (150 b of code, down from 260 b), which is an unconditional benefit for computers with SSE2 that aren’t going to call it. Don’t know how it performs on authentic SSE2-incapable hardware, but on my computer it also appears to be slightly faster, even for large cases.
Changes are:
-
2× unroll, down from 4×. (I’d prefer 1×, but it becomes slower than trunk.)
-
Trunk uses the formula
((x - $01010101) xor x) and not x and $80808080
. Butxor x
is pointless: state of the art is simply(x - $01010101) and not x and $80808080
. There is no tricky tradeoff involved, it can be tested exhaustively that these two formulas are completely equivalent :). -
Tail is handled by over-reading from aligned address, somewhat similar to
IndexByte_SSE2
.
Benchmark: IndexBytePlainBenchmark.pas.
Possible output 1:
New Trunk
IndexByte_Plain(5): 8.2 ns/call 10 ns/call
IndexByte_Plain(15): 13 ns/call 17 ns/call
IndexByte_Plain(50): 24 ns/call 30 ns/call
IndexByte_Plain(70): 33 ns/call 37 ns/call
IndexByte_Plain(150): 51 ns/call 57 ns/call
IndexByte_Plain(300): 77 ns/call 87 ns/call
IndexByte_Plain(1000): 200 ns/call 215 ns/call
IndexByte_Plain(10000): 1.7 us/call 1.9 us/call
Code size: 150 b 261 b
Possible output 2:
New Trunk
IndexByte_Plain(5): 3.8 ns/call 4.1 ns/call
IndexByte_Plain(15): 3.8 ns/call 4.9 ns/call
IndexByte_Plain(50): 5.5 ns/call 6.9 ns/call
IndexByte_Plain(70): 6.6 ns/call 8.3 ns/call
IndexByte_Plain(150): 12 ns/call 14 ns/call
IndexByte_Plain(300): 22 ns/call 24 ns/call
IndexByte_Plain(1000): 68 ns/call 78 ns/call
IndexByte_Plain(10000): 597 ns/call 710 ns/call
Code size: 150 b 261 b