AVX2 CompareByte for i386, sharing branches with SSE2 version.
Might even be a tiny bit less of a joke than !391 because there was already a CPU dispatcher, so it has no extra costs other than adding 300 code bytes for CompareByte_AVX2
+ 80 for AVX2Support
into each application. (But I also shortened the SSE2 version by 50 or so.)
Benchmark: CompareByteI386AVX2Benchmark.pas.
My results:
AVX2 SSE2
CompareByte(#0 / 1): 2.0 ns/call 1.8 ns/call
CompareByte(#6 / 7): 2.7 ns/call 2.3 ns/call
CompareByte(#19 / 20): 2.7 ns/call 2.6 ns/call
CompareByte(#39 / 40): 2.9 ns/call 3.3 ns/call
CompareByte(#1 / 100): 2.4 ns/call 2.1 ns/call
CompareByte(#50 / 100): 2.7 ns/call 3.7 ns/call
CompareByte(#99 / 100): 3.5 ns/call 4.9 ns/call
CompareByte(#100 / 200): 3.7 ns/call 4.9 ns/call
CompareByte(#199 / 200): 5.2 ns/call 7.5 ns/call
CompareByte(#999 / 1000): 15 ns/call 27 ns/call
CompareByte(#5000 / 10000): 109 ns/call 138 ns/call
CompareByte(#9999 / 10000): 208 ns/call 264 ns/call
Edited by Rika