Fill* / i386: improve small cases.
Even rep stos
has large startup cost, at least without the Golden Cove “Fast Short REP STOSB” feature (and without “Enhanced REP MOVSB and STOSB (ERMSB)” they are always slow, like Nehalem below, and could benefit from manual vectorization instead, but I guess it isn’t worth the indirection, the SSE version itself, and the, ahem, CPUID for ERMSB). So this patch uses a ladder or a 4×4-byte loop for small cases, improving them noticeably, and this logic is physically reused among all Fill*
s.
Also, although I do align rep stosl
by as much as 16 bytes according to the vague claims from the Intel manual, I don’t see any difference, even with odd addresses. Corresponding branch could be greatly (25→11 LoC) simplified by not dealing with alignment at all, but that would feel wrong and CPUs that care might exist.
Benchmark: FillXxxxBenchmark_i386.pas.
My results.
Ones from “New: loop rep stos
... >_>
Skylake
New Existing
FillChar(2): 3.3 ns/call 2.5 ns/call
FillChar(8): 2.7 ns/call 4.1 ns/call
FillChar(9): 2.7 ns/call 4.3 ns/call
FillChar(21): 2.5 ns/call 8.0 ns/call
FillChar(22): 2.5 ns/call 8.2 ns/call
FillChar(23): 2.5 ns/call 10 ns/call ← Existing: bytewise ↔ rep stosb breakpoint
FillChar(31): 2.5 ns/call 10 ns/call
FillChar(32): 2.5 ns/call 10 ns/call
FillChar(33): 3.4 ns/call 10 ns/call ← New: ladder ↔ loop breakpoint
FillChar(127): 8.0 ns/call 10 ns/call
FillChar(128): 9.9 ns/call 10 ns/call ← New: loop ↔ rep stos breakpoint, IGNORE
FillChar(129): 10 ns/call 10 ns/call ← IGNORE
FillChar(500): 13 ns/call 14 ns/call ← IGNORE
FillChar(1000): 19 ns/call 16 ns/call ← IGNORE
FillChar(10000): 124 ns/call 80 ns/call ← IGNORE
FillChar(100000): 1651 ns/call 1537 ns/call ← IGNORE
FillWord(2): 3.2 ns/call 22 ns/call
FillWord(8): 3.1 ns/call 22 ns/call
FillWord(9): 2.7 ns/call 14 ns/call
FillWord(15): 2.7 ns/call 14 ns/call
FillWord(16): 2.7 ns/call 22 ns/call
FillWord(17): 3.4 ns/call 15 ns/call ← New: ladder ↔ loop breakpoint
FillWord(63): 8.0 ns/call 14 ns/call
FillWord(64): 10 ns/call 22 ns/call ← New: loop ↔ rep stos breakpoint, IGNORE
FillWord(65): 10 ns/call 14 ns/call ← IGNORE
FillWord(500): 19 ns/call 28 ns/call ← IGNORE
FillWord(1000): 29 ns/call 35 ns/call ← IGNORE
FillWord(10000): 246 ns/call 170 ns/call ← IGNORE
FillWord(100000): 3449 ns/call 2919 ns/call ← IGNORE
FillDWord(2): 2.5 ns/call 8.1 ns/call
FillDWord(7): 2.5 ns/call 7.8 ns/call
FillDWord(8): 2.5 ns/call 7.9 ns/call
FillDWord(9): 3.4 ns/call 7.9 ns/call ← New: ladder ↔ loop breakpoint
FillDWord(20): 5.3 ns/call 8.0 ns/call
FillDWord(31): 8.2 ns/call 7.9 ns/call
FillDWord(32): 10 ns/call 8.0 ns/call ← New: loop ↔ rep stos breakpoint, IGNORE
FillDWord(33): 10 ns/call 8.7 ns/call ← IGNORE
FillDWord(500): 29 ns/call 22 ns/call ← IGNORE
FillDWord(1000): 55 ns/call 39 ns/call ← IGNORE
FillDWord(10000): 636 ns/call 508 ns/call ← IGNORE
FillDWord(100000): 7127 ns/call 6275 ns/call ← IGNORE
Nehalem
New Existing
FillChar(2): 6.1 ns/call 6.0 ns/call
FillChar(8): 7.5 ns/call 13 ns/call
FillChar(9): 7.8 ns/call 13 ns/call
FillChar(21): 6.9 ns/call 28 ns/call
FillChar(22): 6.9 ns/call 29 ns/call
FillChar(23): 6.9 ns/call 21 ns/call ← Existing: bytewise ↔ rep stosb breakpoint
FillChar(31): 6.9 ns/call 21 ns/call
FillChar(32): 6.9 ns/call 21 ns/call
FillChar(33): 10 ns/call 21 ns/call ← New: ladder ↔ loop breakpoint
FillChar(127): 18 ns/call 31 ns/call
FillChar(128): 27 ns/call 32 ns/call ← New: loop ↔ rep stos breakpoint, IGNORE
FillChar(129): 29 ns/call 32 ns/call ← IGNORE
FillChar(500): 48 ns/call 45 ns/call ← IGNORE
FillChar(1000): 61 ns/call 65 ns/call ← IGNORE
FillChar(10000): 319 ns/call 326 ns/call ← IGNORE
FillChar(100000): 5852 ns/call 5598 ns/call ← IGNORE
FillWord(2): 6.0 ns/call 28 ns/call
FillWord(8): 7.3 ns/call 27 ns/call
FillWord(9): 6.9 ns/call 28 ns/call
FillWord(15): 6.9 ns/call 27 ns/call
FillWord(16): 6.9 ns/call 27 ns/call
FillWord(17): 9.4 ns/call 27 ns/call ← New: ladder ↔ loop breakpoint
FillWord(63): 17 ns/call 37 ns/call
FillWord(64): 28 ns/call 38 ns/call ← New: loop ↔ rep stos breakpoint, IGNORE
FillWord(65): 29 ns/call 38 ns/call ← IGNORE
FillWord(500): 62 ns/call 71 ns/call ← IGNORE
FillWord(1000): 86 ns/call 98 ns/call ← IGNORE
FillWord(10000): 618 ns/call 628 ns/call ← IGNORE
FillWord(100000): 11444 ns/call 11444 ns/call ← IGNORE
FillDWord(2): 6.1 ns/call 15 ns/call
FillDWord(7): 6.4 ns/call 15 ns/call
FillDWord(8): 6.4 ns/call 15 ns/call
FillDWord(9): 8.9 ns/call 15 ns/call ← New: ladder ↔ loop breakpoint
FillDWord(20): 11 ns/call 21 ns/call
FillDWord(31): 17 ns/call 26 ns/call
FillDWord(32): 27 ns/call 26 ns/call ← New: loop ↔ rep stos breakpoint, IGNORE
FillDWord(33): 29 ns/call 27 ns/call ← IGNORE
FillDWord(500): 86 ns/call 87 ns/call ← IGNORE
FillDWord(1000): 147 ns/call 145 ns/call ← IGNORE
FillDWord(10000): 1966 ns/call 2049 ns/call ← IGNORE
FillDWord(100000): 23409 ns/call 23364 ns/call ← IGNORE