Skip to content

Shorter IndexByte_Plain.

Rika requested to merge runewalsh/source:ibplain into main

Make IndexByte_Plain smaller (150 b of code, down from 260 b), which is an unconditional benefit for computers with SSE2 that aren’t going to call it. Don’t know how it performs on authentic SSE2-incapable hardware, but on my computer it also appears to be slightly faster, even for large cases.

Changes are:

  • 2× unroll, down from 4×. (I’d prefer 1×, but it becomes slower than trunk.)

  • Trunk uses the formula ((x - $01010101) xor x) and not x and $80808080. But xor x is pointless: state of the art is simply (x - $01010101) and not x and $80808080. There is no tricky tradeoff involved, it can be tested exhaustively that these two formulas are completely equivalent :).

  • Tail is handled by over-reading from aligned address, somewhat similar to IndexByte_SSE2.

Benchmark: IndexBytePlainBenchmark.pas.

Possible output 1:

                             New          Trunk
IndexByte_Plain(5):      8.2 ns/call    10 ns/call
IndexByte_Plain(15):      13 ns/call    17 ns/call
IndexByte_Plain(50):      24 ns/call    30 ns/call
IndexByte_Plain(70):      33 ns/call    37 ns/call
IndexByte_Plain(150):     51 ns/call    57 ns/call
IndexByte_Plain(300):     77 ns/call    87 ns/call
IndexByte_Plain(1000):   200 ns/call   215 ns/call
IndexByte_Plain(10000):  1.7 us/call   1.9 us/call
Code size:                  150 b         261 b

Possible output 2:

                             New          Trunk
IndexByte_Plain(5):      3.8 ns/call   4.1 ns/call
IndexByte_Plain(15):     3.8 ns/call   4.9 ns/call
IndexByte_Plain(50):     5.5 ns/call   6.9 ns/call
IndexByte_Plain(70):     6.6 ns/call   8.3 ns/call
IndexByte_Plain(150):     12 ns/call    14 ns/call
IndexByte_Plain(300):     22 ns/call    24 ns/call
IndexByte_Plain(1000):    68 ns/call    78 ns/call
IndexByte_Plain(10000):  597 ns/call   710 ns/call
Code size:                  150 b         261 b

Merge request reports