Remove / disable REP SCAS-based IndexWord and IndexDWord implementations on i386.
In general, REP operations are slow. REP MOVSB and REP STOSB were explicitly getting significant attention which became ERMSB (“Enhanced REP MOVSB and STOSB”) feature starting from Ivy Bridge (still to be only on par with good handwritten implementations), but that’s pretty much all; some other operations are improved only slightly and only in recent Golden Cove. They are viable as inline operations in cold places, but not in dedicated functions like Index*
.
To get an idea of how slow they are, try this benchmark on i386
: IndexDWord.pas. It multiplies some 4×4 matrices, and then searches their last (#15) elements with IndexDWord
. My results look as if analyzing 16 uint32
s with System.IndexDWord
was as time-consuming as 64 FP multiplications (plus 48 additions) of Matrix4_x_Matrix4
, and I never managed to get analyzing 1000+ uint32
s faster than the ordinary loop:
Matrix4_x_Matrix4: 22 ns/call ←15 with -CfAVX
System.IndexDWord(#15): 23 ns/call (*1)
Generic IndexDWord(#15): 7.0 ns/call
System.IndexDWord(#1007): 510 ns/call
Generic IndexDWord(#1007): 509 ns/call (*2)
(*1) REP SCAS is especially inefficient for short arrays because of the large startup cost.
(*2) On my computer, 2× faster if the loop is aligned on 16 bytes (switching between -O3 and -O4 can indirectly affect this in both directions by moving code around), otherwise equal. I’m stating the equal result here for fairness.
Read also this comment and below.
Hence, unless you have other numbers (or better implementations), I propose to remove this and this.