Use XMMs in x64 Move.
For me, my Move
is better both for small and large cases:
-
Small cases (≤32 bytes) are handled by selecting the appropriate branch and doing two unaligned reads + two writes. Considerably faster; people whose crucial structures cross page boundaries might disagree. (I tried also 4 reads + 4 writes for 32 < size ≤ 64, it looked a lot better in the sense of taking a lot less jumps but the resulting speedup was a bit dubious, like 3.0 → 2.0 ns.)
-
Large cases use XMM transfers. Maybe original author did not use them not for irrational reasons but because
MOVDQU
and evenMOVDQA
were worse than equivalent two 8-byte transfers for him, but logically, and on my computer, XMM transfers are better.
Benchmark: MoveBenchmark.pas.
My results🍍
New Existing
Move(1~8): 2.0 ns/call 3.3 ns/call
Move(10~30): 1.4 ns/call 4.1 ns/call
Move(20~100): 2.8 ns/call 6.0 ns/call
Move(50~300): 7.1 ns/call 10 ns/call
Move(100~1000): 18 ns/call 25 ns/call
Move(1000~10000): 167 ns/call 231 ns/call
Move(10000~100000): 1605 ns/call 2307 ns/call