Improve generic Index*.
I followed one of my relatively hot paths and noticed a signed division produced by subtracting typed pointers in generic IndexQWord
(x86-64
uses it, as well as IndexDWord
). Hence, I have four ideas about generic Index*
implementations, from least to most dubious.
- Force unsigned division.
Patch: index_v1.patch.
- Do the above plus force
(x >= CONST1) and (x <= CONST2)
→unsigned(x - CONST1) <= CONST2 - CONST1
optimization (unless #39306 (closed) fixed).
Patch: index_v2.patch, incremental from v1: index_v1to2.patch.
- Do everything of the above plus speculate
pend
calculation (which is also used in the condition).
Patch: index_v3.patch, incremental from v2: index_v2to3.patch.
- Do everything of the above plus tighten loops, at the cost of the extra comparison at the end.
Patch: index_v4.patch, incremental from v3: index_v3to4.patch.
I don’t have meaningful benchmarks though (or, rather, I see 2× speedup on x86-64 both for IndexDWord
and IndexQWord
mainly due to #4, but don’t want to believe it so easily). My results with index_benchmark.pas on
x86-32 (System.IndexByte
, System.IndexWord
and System.IndexDWord
are special)
System.IndexByte IndexByteV1 (80 b code) IndexByteV4 (96 b code)
n=1: 3.3 ns/call n=1: 2.9 ns/call n=1: 2.5 ns/call
n=10: 3.7 ns/call n=10: 7.7 ns/call n=10: 3.7 ns/call
n=100: 10.0 ns/call n=100: 53.6 ns/call n=100: 24.6 ns/call
System.IndexWord IndexWordV1 (96 b code) IndexWordV4 (80 b code)
n=1: 15.7 ns/call n=1: 3.2 ns/call n=1: 2.7 ns/call
n=10: 18.2 ns/call n=10: 5.9 ns/call n=10: 4.2 ns/call
n=100: 39.7 ns/call n=100: 32.3 ns/call n=100: 25.4 ns/call
System.IndexDWord IndexDWordV1 (112 b code) IndexDWordV4 (96 b code)
n=1: 15.7 ns/call n=1: 2.9 ns/call n=1: 2.8 ns/call
n=10: 17.9 ns/call n=10: 6.0 ns/call n=10: 4.9 ns/call
n=100: 39.5 ns/call n=100: 34.3 ns/call n=100: 33.4 ns/call
System.IndexQWord IndexQWordV1 (128 b code) IndexQWordV4 (112 b code)
n=1: 3.7 ns/call n=1: 3.5 ns/call n=1: 3.8 ns/call
n=10: 5.7 ns/call n=10: 5.4 ns/call n=10: 6.1 ns/call
n=100: 38.5 ns/call n=100: 38.4 ns/call n=100: 33.8 ns/call
x86-64 (System.IndexByte
and System.IndexWord
are special)
System.IndexByte IndexByteV1 (96 b code) IndexByteV4 (80 b code)
n=1: 3.3 ns/call n=1: 2.5 ns/call n=1: 2.4 ns/call
n=10: 2.8 ns/call n=10: 5.0 ns/call n=10: 3.3 ns/call
n=100: 4.6 ns/call n=100: 33.8 ns/call n=100: 23.0 ns/call
System.IndexWord IndexWordV1 (96 b code) IndexWordV4 (80 b code)
n=1: 3.7 ns/call n=1: 3.0 ns/call n=1: 2.3 ns/call
n=10: 3.6 ns/call n=10: 6.4 ns/call n=10: 3.6 ns/call
n=100: 6.7 ns/call n=100: 45.1 ns/call n=100: 23.0 ns/call
System.IndexDWord IndexDWordV1 (112 b code) IndexDWordV4 (96 b code)
n=1: 3.2 ns/call n=1: 3.2 ns/call n=1: 3.2 ns/call
n=10: 6.2 ns/call n=10: 6.5 ns/call n=10: 3.7 ns/call
n=100: 45.7 ns/call n=100: 44.2 ns/call n=100: 23.8 ns/call
System.IndexQWord IndexQWordV1 (112 b code) IndexQWordV4 (96 b code)
n=1: 2.9 ns/call n=1: 3.2 ns/call n=1: 2.5 ns/call
n=10: 6.2 ns/call n=10: 6.8 ns/call n=10: 4.0 ns/call
n=100: 44.8 ns/call n=100: 45.2 ns/call n=100: 25.7 ns/call