Replace fastmove.inc with something from this decade. (!555) · Merge requests · FPC / FPC / FPC Source

Rika requested to merge runewalsh/source:fastmove into main Dec 04, 2023

Hereby I proclaim the blasphemous assertion that the entire fastmove.inc thing is not very good and copying the x86_64.inc:Move + !551 approach is better (either faster or has the same speed with less code), at least with SSE on not too ancient CPUs, but potentially everywhere with everything.

The entire crazy unrolling does not seem to do much, and should do even less on older CPUs that are able to load/store even fewer values per clock. 2× is enough, and can’t be sure if I saw any difference with 1× when doing !404 (merged)... just checked that apparently not.
Trunk fastmove.inc doesn’t consider it necessary to make sure that Move distance is large enough before going NT like x86_64.inc:Move does; as a consequence, example from !551 runs 3× slower. (And as I recently noticed, this threshold must be not 4 Kb but on the order of the NT threshold itself.)
Trunk fastmove.inc boringly handles tails with ≤4-byte copies (through 2×400-byte jumptables) instead of one more unaligned write of 8 or 16 bytes by the same technology (16× SSE, 8× MMX, or 8× FPU). Maybe it was a conscious decision idk, but given that the author lightheadedly writes unaligned heads, he might simply not realize that exactly the same can be done with tails.

All of these could have been changed without a complete rewrite, but: 1) I’m lazy, 2) see above about the dubiousness of the fastmove.inc approach, 3) I tried to do some PIC and I have no references of PIC in Intel syntax. I didn’t test my PIC and maybe it doesn’t even compile, let alone work, but in the worst case you can simply re-disable it with {$ifndef FPC_PIC} {$include fastmove.inc} {$endif} as it was before the MR, while in the best case, this enables fastmove.inc for PIC targets, yaaay! (I’ll get to grips with QEMU one day, I remember.)

Benchmark: FastMoveBenchmark.pas. Can be run with parameters plain, mmx, sse, sse-erms, or sse-no-erms to force corresponding branch and/or notion of REP STOS performance.

My results:

This MR code size: 1040 b
Trunk code size:   2112 b

                                              This notebook with ERMS

                     This MR (SSE+ERMS)   Trunk (SSE)   This MR (MMX)     Trunk (MMX)  This MR (plain)   Trunk (plain)
Move(1~10):              2.4 ns/call      2.3 ns/call     2.5 ns/call     2.3 ns/call     2.4 ns/call     2.3 ns/call
Move(10~20):             2.1 ns/call      2.5 ns/call     2.2 ns/call     2.5 ns/call     2.0 ns/call     2.4 ns/call
Move(20~30):             1.9 ns/call      2.6 ns/call     2.5 ns/call     2.8 ns/call     2.4 ns/call     2.7 ns/call
Move(30~40):             2.7 ns/call      3.1 ns/call     3.6 ns/call     3.8 ns/call     3.3 ns/call     3.6 ns/call
Move(40~50):             3.4 ns/call      5.7 ns/call     5.0 ns/call     9.1 ns/call     4.4 ns/call     8.1 ns/call
Move(50~100):            3.7 ns/call      4.8 ns/call     6.0 ns/call     7.0 ns/call     5.9 ns/call     7.0 ns/call
Move(100~300):           7.6 ns/call       10 ns/call      11 ns/call      13 ns/call      13 ns/call      16 ns/call
Move(300~1000):           20 ns/call       23 ns/call      32 ns/call      33 ns/call      44 ns/call      47 ns/call
Move(1000~10000):        175 ns/call      198 ns/call     266 ns/call     264 ns/call     366 ns/call     367 ns/call
Move(10000~100000):     2183 ns/call     2795 ns/call    2959 ns/call    2894 ns/call    3908 ns/call    3869 ns/call
Move(300000~500000):   27500 ns/call    35700 ns/call   39950 ns/call   39850 ns/call   42900 ns/call   42900 ns/call
Move(500000~1000000):  55100 ns/call    66100 ns/call   77600 ns/call   77600 ns/call   82200 ns/call   82400 ns/call

                                              That notebook without ERMS

                       This MR (SSE)      Trunk (SSE)   This MR (MMX)     Trunk (MMX)  This MR (plain)   Trunk (plain)
Move(1~10):               11 ns/call       16 ns/call      11 ns/call      15 ns/call      12 ns/call      15 ns/call
Move(10~20):             9.1 ns/call       15 ns/call      11 ns/call      15 ns/call      10 ns/call      15 ns/call
Move(20~30):             4.9 ns/call       16 ns/call     6.9 ns/call      16 ns/call     6.9 ns/call      16 ns/call
Move(30~40):              13 ns/call       18 ns/call      18 ns/call      21 ns/call      18 ns/call      20 ns/call
Move(40~50):              13 ns/call       18 ns/call      18 ns/call      25 ns/call      17 ns/call      24 ns/call
Move(50~100):             19 ns/call       26 ns/call      28 ns/call      33 ns/call      28 ns/call      35 ns/call
Move(100~300):            27 ns/call       37 ns/call      38 ns/call      44 ns/call      52 ns/call      59 ns/call
Move(300~1000):           60 ns/call       68 ns/call      86 ns/call      95 ns/call     112 ns/call     121 ns/call
Move(1000~10000):        687 ns/call      687 ns/call     831 ns/call     844 ns/call     962 ns/call     954 ns/call
Move(10000~100000):    20267 ns/call    20800 ns/call   22900 ns/call   22867 ns/call   20800 ns/call   20800 ns/call
Move(300000~500000):  120100 ns/call   121700 ns/call  165300 ns/call  163800 ns/call  159100 ns/call  159100 ns/call
Move(500000~1000000): 226200 ns/call   227700 ns/call  294800 ns/call  294800 ns/call  293300 ns/call  293300 ns/call

Furthermore: as mentioned, example from !551 speeds up by 3× (just because of not choosing NT) ~ 4× (with ERMS), and

begin
	writeln('Hello world.');
end.

application size reduces from 103,038 to 101,849 bytes.

Edited Dec 05, 2023 by Rika

Replace fastmove.inc with something from this decade.

Merge request reports