Move with Move instead of string instructions.
Noticed it independently, but now got my hands on an elaborate benchmark.
Read this topic and try this: HeapsortBenchmark.pas, preferably on a CPU with FSRM (I don’t have one). It heapsorts arrays of various types using either Move
or operator :=
for assignments.
My results:
Fast Short REP MOV (FSRM): no.
x86-64
10 bytes, Move: 239 us/call, := (mov 8+2): 201 us/call
20 bytes, Move: 104 us/call, := (mov 2x8+4): 91 us/call
30 bytes, Move: 68 us/call, := (rep 3x8+2): 204 us/call ← !
32 bytes, Move: 58 us/call, := (mov 2x16): 43 us/call
40 bytes, Move: 54 us/call, := (mov 2x16+8): 41 us/call
64 bytes, Move: 34 us/call, := (rep 8x8): 69 us/call ← !
100 bytes, Move: 23 us/call, := (rep 12x8+4): 57 us/call ← !
i386
10 bytes, Move: 284 us/call, := (mov 2x4+2): 247 us/call
20 bytes, Move: 123 us/call, := (rep 5x4): 320 us/call ← !
30 bytes, Move: 81 us/call, := (rep 7x4+2): 226 us/call ← !
32 bytes, Move: 70 us/call, := (rep 8x4): 182 us/call ← !
40 bytes, Move: 73 us/call, := (rep 10x4): 148 us/call ← !
64 bytes, Move: 45 us/call, := (rep 16x4): 82 us/call ← !
100 bytes, Move: 36 us/call, := (rep 25x4): 66 us/call ← !
As you can see, REP
s (marked with ← !
) are terrible. (Mind you, these times are full heapsorts, not just a bunch of assignments.) Depending on how they perform with FSRM, I propose to either:
-
Completely replace this with a
Move
call (I hope it’s possible? :D) (preferably inlined; my gut says this stage is too late for inlining...), or -
Keep the string branch, but use it only on CPUs with FSRM (Ice Lake+, Zen3+).