Use XMMs in x64/SSE2 FillChar.
I thought about whether I could do what this person was talking about, but other Fill*
s are both more complex and a lot less useful. (I often want to fill the memory with a pattern, like broadcasting the first array element, but it usually has arbitrary size). So let’s start with something broadly applicable.
My proposal is to use unaligned overlapping (“postmodern”) SSE writes in x64 FillChar
.
There is a drawback: unaligned stores that cross the page boundary are greatly slowed down (they say 32-byte AVX stores spanning two pages consume extra 150 cycles, don’t know about 16-byte SSE, but I see the same slowdown as with storing 8-byte GPRs). That’s why I’ve made an effort to perform only two MOVDQU
s, at the very beginning and at the very end. At the same time, glibc
versions don’t bother and perform up to 4+4 MOVDQU
s for head+tail, and while existing FillChar
performs only aligned writes thus not having this problem from the start, this means sacrificing the common case.
Benchmark: FillCharBenchmark.pas.
My results 💧 .
Existing New New (cross-page)
FillChar(2): 3.6 ns/call 2.6 ns/call 2.6 ns/call
FillChar(6): 4.6 ns/call 2.9 ns/call 13 ns/call
FillChar(15): 4.9 ns/call 2.8 ns/call 7.2 ns/call
FillChar(16): 3.6 ns/call 2.8 ns/call 13 ns/call
FillChar(17): 3.4 ns/call 2.8 ns/call 13 ns/call
FillChar(50): 5.5 ns/call 2.8 ns/call 7.8 ns/call
FillChar(100): 7.4 ns/call 3.1 ns/call 8.3 ns/call
FillChar(500): 17 ns/call 9.8 ns/call 15 ns/call
FillChar(1000): 32 ns/call 17 ns/call 24 ns/call
FillChar(10000): 316 ns/call 157 ns/call 164 ns/call
FillChar(100000): 3500 ns/call 2070 ns/call 2008 ns/call