Move with Move instead of string instructions.

Noticed it independently, but now got my hands on an elaborate benchmark.

Read this topic and try this: HeapsortBenchmark.pas, preferably on a CPU with FSRM (I don’t have one). It heapsorts arrays of various types using either Move or operator := for assignments.

My results:

Fast Short REP MOV (FSRM): no.

x86-64
10 bytes, Move:    239 us/call,    := (mov 8+2):     201 us/call
20 bytes, Move:    104 us/call,    := (mov 2x8+4):    91 us/call
30 bytes, Move:     68 us/call,    := (rep 3x8+2):   204 us/call   ← !
32 bytes, Move:     58 us/call,    := (mov 2x16):     43 us/call
40 bytes, Move:     54 us/call,    := (mov 2x16+8):   41 us/call
64 bytes, Move:     34 us/call,    := (rep 8x8):      69 us/call   ← !
100 bytes, Move:    23 us/call,    := (rep 12x8+4):   57 us/call   ← !

i386
10 bytes, Move:    284 us/call,    := (mov 2x4+2):   247 us/call
20 bytes, Move:    123 us/call,    := (rep 5x4):     320 us/call   ← !
30 bytes, Move:     81 us/call,    := (rep 7x4+2):   226 us/call   ← !
32 bytes, Move:     70 us/call,    := (rep 8x4):     182 us/call   ← !
40 bytes, Move:     73 us/call,    := (rep 10x4):    148 us/call   ← !
64 bytes, Move:     45 us/call,    := (rep 16x4):     82 us/call   ← !
100 bytes, Move:    36 us/call,    := (rep 25x4):     66 us/call   ← !

As you can see, REPs (marked with ← !) are terrible. (Mind you, these times are full heapsorts, not just a bunch of assignments.) Depending on how they perform with FSRM, I propose to either:

Completely replace this with a Move call (I hope it’s possible? :D) (preferably inlined; my gut says this stage is too late for inlining...), or
Keep the string branch, but use it only on CPUs with FSRM (Ice Lake+, Zen3+).

To upload designs, you'll need to enable LFS and have an admin enable hashed storage. More information