Replace fastmove.inc with something from this decade.
Hereby I proclaim the blasphemous assertion that the entire fastmove.inc
thing is not very good and copying the x86_64.inc:Move
+ !551 approach is better (either faster or has the same speed with less code), at least with SSE on not too ancient CPUs, but potentially everywhere with everything.
-
The entire crazy unrolling does not seem to do much, and should do even less on older CPUs that are able to load/store even fewer values per clock. 2× is enough, and can’t be sure if I saw any difference with 1× when doing !404 (merged)... just checked that apparently not.
-
Trunk
fastmove.inc
doesn’t consider it necessary to make sure thatMove
distance is large enough before going NT likex86_64.inc:Move
does; as a consequence, example from !551 runs 3× slower. (And as I recently noticed, this threshold must be not 4 Kb but on the order of the NT threshold itself.) -
Trunk
fastmove.inc
boringly handles tails with ≤4-byte copies (through 2×400-byte jumptables) instead of one more unaligned write of 8 or 16 bytes by the same technology (16× SSE, 8× MMX, or 8× FPU). Maybe it was a conscious decision idk, but given that the author lightheadedly writes unaligned heads, he might simply not realize that exactly the same can be done with tails.
All of these could have been changed without a complete rewrite, but: 1) I’m lazy, 2) see above about the dubiousness of the fastmove.inc
approach, 3) I tried to do some PIC and I have no references of PIC in Intel syntax. I didn’t test my PIC and maybe it doesn’t even compile, let alone work, but in the worst case you can simply re-disable it with {$ifndef FPC_PIC} {$include fastmove.inc} {$endif}
as it was before the MR, while in the best case, this enables fastmove.inc
for PIC targets, yaaay! (I’ll get to grips with QEMU one day, I remember.)
Benchmark: FastMoveBenchmark.pas. Can be run with parameters plain
, mmx
, sse
, sse-erms
, or sse-no-erms
to force corresponding branch and/or notion of REP STOS
performance.
My results:
This MR code size: 1040 b
Trunk code size: 2112 b
This notebook with ERMS
This MR (SSE+ERMS) Trunk (SSE) This MR (MMX) Trunk (MMX) This MR (plain) Trunk (plain)
Move(1~10): 2.4 ns/call 2.3 ns/call 2.5 ns/call 2.3 ns/call 2.4 ns/call 2.3 ns/call
Move(10~20): 2.1 ns/call 2.5 ns/call 2.2 ns/call 2.5 ns/call 2.0 ns/call 2.4 ns/call
Move(20~30): 1.9 ns/call 2.6 ns/call 2.5 ns/call 2.8 ns/call 2.4 ns/call 2.7 ns/call
Move(30~40): 2.7 ns/call 3.1 ns/call 3.6 ns/call 3.8 ns/call 3.3 ns/call 3.6 ns/call
Move(40~50): 3.4 ns/call 5.7 ns/call 5.0 ns/call 9.1 ns/call 4.4 ns/call 8.1 ns/call
Move(50~100): 3.7 ns/call 4.8 ns/call 6.0 ns/call 7.0 ns/call 5.9 ns/call 7.0 ns/call
Move(100~300): 7.6 ns/call 10 ns/call 11 ns/call 13 ns/call 13 ns/call 16 ns/call
Move(300~1000): 20 ns/call 23 ns/call 32 ns/call 33 ns/call 44 ns/call 47 ns/call
Move(1000~10000): 175 ns/call 198 ns/call 266 ns/call 264 ns/call 366 ns/call 367 ns/call
Move(10000~100000): 2183 ns/call 2795 ns/call 2959 ns/call 2894 ns/call 3908 ns/call 3869 ns/call
Move(300000~500000): 27500 ns/call 35700 ns/call 39950 ns/call 39850 ns/call 42900 ns/call 42900 ns/call
Move(500000~1000000): 55100 ns/call 66100 ns/call 77600 ns/call 77600 ns/call 82200 ns/call 82400 ns/call
That notebook without ERMS
This MR (SSE) Trunk (SSE) This MR (MMX) Trunk (MMX) This MR (plain) Trunk (plain)
Move(1~10): 11 ns/call 16 ns/call 11 ns/call 15 ns/call 12 ns/call 15 ns/call
Move(10~20): 9.1 ns/call 15 ns/call 11 ns/call 15 ns/call 10 ns/call 15 ns/call
Move(20~30): 4.9 ns/call 16 ns/call 6.9 ns/call 16 ns/call 6.9 ns/call 16 ns/call
Move(30~40): 13 ns/call 18 ns/call 18 ns/call 21 ns/call 18 ns/call 20 ns/call
Move(40~50): 13 ns/call 18 ns/call 18 ns/call 25 ns/call 17 ns/call 24 ns/call
Move(50~100): 19 ns/call 26 ns/call 28 ns/call 33 ns/call 28 ns/call 35 ns/call
Move(100~300): 27 ns/call 37 ns/call 38 ns/call 44 ns/call 52 ns/call 59 ns/call
Move(300~1000): 60 ns/call 68 ns/call 86 ns/call 95 ns/call 112 ns/call 121 ns/call
Move(1000~10000): 687 ns/call 687 ns/call 831 ns/call 844 ns/call 962 ns/call 954 ns/call
Move(10000~100000): 20267 ns/call 20800 ns/call 22900 ns/call 22867 ns/call 20800 ns/call 20800 ns/call
Move(300000~500000): 120100 ns/call 121700 ns/call 165300 ns/call 163800 ns/call 159100 ns/call 159100 ns/call
Move(500000~1000000): 226200 ns/call 227700 ns/call 294800 ns/call 294800 ns/call 293300 ns/call 293300 ns/call
Furthermore: as mentioned, example from !551 speeds up by 3× (just because of not choosing NT) ~ 4× (with ERMS), and
begin
writeln('Hello world.');
end.
application size reduces from 103,038 to 101,849 bytes.