Use XMMs in x64/SSE2 FillChar. (!400) · Merge requests · FPC / FPC / FPC Source

Rika requested to merge runewalsh/source:fillchar-x64 into main Apr 13, 2023

I thought about whether I could do what this person was talking about, but other Fill*s are both more complex and a lot less useful. (I often want to fill the memory with a pattern, like broadcasting the first array element, but it usually has arbitrary size). So let’s start with something broadly applicable.

My proposal is to use unaligned overlapping (“postmodern”) SSE writes in x64 FillChar.

There is a drawback: unaligned stores that cross the page boundary are greatly slowed down (they say 32-byte AVX stores spanning two pages consume extra 150 cycles, don’t know about 16-byte SSE, but I see the same slowdown as with storing 8-byte GPRs). That’s why I’ve made an effort to perform only two MOVDQUs, at the very beginning and at the very end. At the same time, glibc versions don’t bother and perform up to 4+4 MOVDQUs for head+tail, and while existing FillChar performs only aligned writes thus not having this problem from the start, this means sacrificing the common case.

Benchmark: FillCharBenchmark.pas.

My results 💧.

                              Existing            New          New (cross-page)

FillChar(2):                 3.6 ns/call       2.6 ns/call        2.6 ns/call
FillChar(6):                 4.6 ns/call       2.9 ns/call         13 ns/call
FillChar(15):                4.9 ns/call       2.8 ns/call        7.2 ns/call
FillChar(16):                3.6 ns/call       2.8 ns/call         13 ns/call
FillChar(17):                3.4 ns/call       2.8 ns/call         13 ns/call
FillChar(50):                5.5 ns/call       2.8 ns/call        7.8 ns/call
FillChar(100):               7.4 ns/call       3.1 ns/call        8.3 ns/call
FillChar(500):                17 ns/call       9.8 ns/call         15 ns/call
FillChar(1000):               32 ns/call        17 ns/call         24 ns/call
FillChar(10000):             316 ns/call       157 ns/call        164 ns/call
FillChar(100000):           3500 ns/call      2070 ns/call       2008 ns/call

Use XMMs in x64/SSE2 FillChar.

Merge request reports