*sigh* AVX2 Index* and Compare* for x64. (!391) · Merge requests · FPC / FPC / FPC Source

Rika requested to merge runewalsh/source:ic-x64-avx2 into main Mar 21, 2023

Your ~~soldiers~~ programmers may find the experience of ~~being fired towards a combat zone at nearly a thousand miles an hour~~ heating up their CPUs with AVX2 a little discomfiting, but I’m afraid they will have to get used to it. I see little scope for further innovation where ~~dropships~~ x86-64 Index* / Compare* are concerned; the ~~Valkyrie~~ AVX2 will be the final design we produce.

Half a joke MR because:

Unlike !390 (merged) and !369 (merged), I don’t see any speedup in my CompareByte scenarios. .__. Of course I dreamed of another 20% for the whole application, but retrospectively, zero effect is well understandable: theoretical speedup is only 2× from SSE2 and only on large cases, while !369 (merged) could be an order of magnitude for small cases which are the most common. You might have examples friendlier to AVX2 (like this comparison of always equal 128-byte records), but even in the compiler, small cases must dominate, and not sure that you can’t have even worse ones just as well.
AVX2 is only 10 years old, down from 20 for SSE2. Well, SSE2 implementations in x86_64.inc date 2011 which was 10 years since its introduction, but even then, SSE2 support was always mandated while you can still easily encounter a CPU without AVX2.
Making up more reasons why AVX2 might be slower by itself, even without an indirection: vzeroupper can be nontrivial on certain CPUs, notably older or server like Xeon Phi (this can be tuned with stricter conditions to choose AVX2 versions than just “CPU supports AVX2”, but I suspect that simply ignoring this concern is completely viable), and wider registers reduce the over-readability chances from (1 − 15 / 4096)² ≈ 99.3% to mere (1 − 31 / 4096)² ≈ 98.5% while doubling the potential non-over-readable tail length. GCC loop vectorizer sometimes even handles the rest after ymm loop with one more xmm-wide step before switching to one-by-one loop... adding an awful lot of code.

On the positive side, AVX2 versions can often be obtained from existing SSE2 by changing xmm registers to ymm, adjusting related constants, and adding vzeroupper, and that’s what I did, in particular with existing masterwork IndexByte and IndexWord.

Benchmark: IndexCompareBenchmarkX64.pas.

My results.

                             SSE2              AVX2
IndexByte(#0 .. #1):         3.2 ns/call       2.7 ns/call
IndexByte(#0 .. #14):        3.0 ns/call       2.5 ns/call
IndexByte(#0 .. #15):        3.0 ns/call       2.5 ns/call
IndexByte(#0 .. #16):        3.0 ns/call       2.5 ns/call
IndexByte(#0 .. #49):        3.5 ns/call       2.5 ns/call
IndexByte(#0 .. #99):        4.4 ns/call       3.1 ns/call
IndexByte(#373 .. #626):     24 ns/call        14 ns/call

IndexWord(#0 .. #1):         3.8 ns/call       3.0 ns/call
IndexWord(#0 .. #14):        3.8 ns/call       2.9 ns/call
IndexWord(#0 .. #15):        3.8 ns/call       2.9 ns/call
IndexWord(#0 .. #16):        3.8 ns/call       3.0 ns/call
IndexWord(#0 .. #49):        4.6 ns/call       3.9 ns/call
IndexWord(#0 .. #99):        6.2 ns/call       4.8 ns/call
IndexWord(#0 .. #999):       40 ns/call        27 ns/call

IndexDWord(#0 .. #1):        3.1 ns/call       2.6 ns/call
IndexDWord(#0 .. #14):       2.7 ns/call       2.7 ns/call
IndexDWord(#0 .. #15):       2.7 ns/call       2.7 ns/call
IndexDWord(#0 .. #16):       2.8 ns/call       2.6 ns/call
IndexDWord(#0 .. #49):       4.9 ns/call       3.4 ns/call
IndexDWord(#0 .. #99):       8.3 ns/call       4.5 ns/call
IndexDWord(#0 .. #999):      69 ns/call        35 ns/call

                             (plain)
IndexQWord(#0 .. #1):        2.5 ns/call       3.0 ns/call
IndexQWord(#0 .. #14):       4.5 ns/call       2.7 ns/call
IndexQWord(#0 .. #15):       5.2 ns/call       2.8 ns/call
IndexQWord(#0 .. #16):       4.9 ns/call       2.9 ns/call
IndexQWord(#0 .. #49):       12 ns/call        4.1 ns/call
IndexQWord(#0 .. #99):       26 ns/call        6.3 ns/call
IndexQWord(#0 .. #999):      139 ns/call       57 ns/call

                             (SSE2)
CompareByte(#0 / 1):         1.8 ns/call       2.4 ns/call
CompareByte(#6 / 7):         2.0 ns/call       2.4 ns/call
CompareByte(#14 / 15):       2.0 ns/call       2.4 ns/call
CompareByte(#30 / 31):       2.4 ns/call       2.4 ns/call
CompareByte(#1 / 100):       2.1 ns/call       2.3 ns/call
CompareByte(#99 / 100):      5.5 ns/call       4.1 ns/call
CompareByte(#199 / 200):     9.3 ns/call       5.9 ns/call
CompareByte(#999 / 1000):    37 ns/call        20 ns/call
CompareByte(#9999 / 10000):  359 ns/call       222 ns/call

CompareWord(#0 / 1):         2.1 ns/call       2.1 ns/call
CompareWord(#6 / 7):         2.5 ns/call       3.7 ns/call
CompareWord(#14 / 15):       3.1 ns/call       3.8 ns/call
CompareWord(#30 / 31):       4.2 ns/call       4.0 ns/call
CompareWord(#1 / 100):       2.3 ns/call       2.7 ns/call
CompareWord(#99 / 100):      4.0 ns/call       3.3 ns/call
CompareWord(#199 / 200):     16 ns/call        10 ns/call
CompareWord(#999 / 1000):    79 ns/call        35 ns/call
CompareWord(#9999 / 10000):  328 ns/call       173 ns/call

CompareDWord(#0 / 1):        2.3 ns/call       2.2 ns/call
CompareDWord(#6 / 7):        2.9 ns/call       3.1 ns/call
CompareDWord(#14 / 15):      3.2 ns/call       2.8 ns/call
CompareDWord(#30 / 31):      4.9 ns/call       3.6 ns/call
CompareDWord(#1 / 100):      2.2 ns/call       2.3 ns/call
CompareDWord(#99 / 100):     2.8 ns/call       2.6 ns/call
CompareDWord(#199 / 200):    18 ns/call        9.8 ns/call
CompareDWord(#999 / 1000):   150 ns/call       78 ns/call
CompareDWord(#9999 / 10000): 595 ns/call       329 ns/call

C++ sources.

The rest was made from existing SSE2 code with the transform described above.

#include <immintrin.h>
#include <cstdint>
using namespace std;

const uintptr_t YMM_MASK = 31;
const uintptr_t PAGE_SIZE = 4096;

__attribute__((__ms_abi__))
ptrdiff_t index_dword_aligned(uint32_t* buf, ptrdiff_t len, uint32_t v)
{
	if (len == 0) return -1;
	uint32_t *aligned_buf = (uint32_t*)((uintptr_t)buf & ~YMM_MASK);
	__m256i ref = _mm256_set1_epi32(v);
	__m256i sample = *(__m256i*)aligned_buf;
	ptrdiff_t buf_pos = (uintptr_t)aligned_buf + 32 - (uintptr_t)buf;

	int cmp_mask = (uint64_t)_mm256_movemask_epi8(_mm256_cmpeq_epi32(ref, sample)) << buf_pos >> 32 << 32 >> buf_pos;
	buf_pos = (uintptr_t)buf_pos / 4;
	buf_pos -= 8;
	goto test_mask;

	while ((uintptr_t)buf_pos < (uintptr_t)len)
	{
		sample = *(__m256i*)(buf + buf_pos);
		cmp_mask = _mm256_movemask_epi8(_mm256_cmpeq_epi32(ref, sample));
	test_mask:
		if (cmp_mask != 0)
		{
			buf_pos += (ptrdiff_t)((uint32_t)__builtin_ctz(cmp_mask) / 4);
			if ((uintptr_t)buf_pos >= (uintptr_t)len) return -1;
			return buf_pos;
		}
		buf_pos += 8;
	}

	return -1;
}

__attribute__((__ms_abi__))
ptrdiff_t index_dword_unaligned(uint32_t* buf, ptrdiff_t len, uint32_t v)
{
	uint32_t *bufp = buf, *bufe = bufp + len;
	if ((uintptr_t)len >> 61 != 0)
	{
		bufe = bufp;
		goto dwordwise_body;
	}
	if (len >= 4)
	{
		uint32_t *bufepart = bufp + (len & -ptrdiff_t{8});
		__m256i ref = _mm256_set1_epi32(v);
		while (bufepart != bufp)
		{
			__m256i sample = _mm256_loadu_si256((__m256i*)bufp);
			int cmp_mask = _mm256_movemask_epi8(_mm256_cmpeq_epi32(sample, ref));
			if (cmp_mask != 0)
				return ((uintptr_t)bufp - (uintptr_t)buf + __builtin_ctz(cmp_mask)) / 4;
			bufp += 8;
		}
		if (bufp != bufe && ((uintptr_t)bufp ^ ((uintptr_t)bufp + YMM_MASK)) < PAGE_SIZE)
		{
			__m256i sample = _mm256_loadu_si256((__m256i*)bufp);
			int cmp_mask = _mm256_movemask_epi8(_mm256_cmpeq_epi32(sample, ref));
			if (cmp_mask == 0) return -1;
			bufp = (uint32_t*)((uintptr_t)bufp + __builtin_ctz(cmp_mask));
			if (bufp >= bufe) return -1;
			return ((uintptr_t)bufp - (uintptr_t)buf) / 4;
		}
	}
	while (bufp != bufe)
	{ dwordwise_body:
		if (*bufp == v) return ((uintptr_t)bufp - (uintptr_t)buf) / 4;
		bufp++;
	}
	return -1;
}

__attribute__((__ms_abi__))
ptrdiff_t index_qword(uint64_t* buf, ptrdiff_t len, uint64_t v)
{
	uint64_t *bufp = buf, *bufe = bufp + len;
	if ((uintptr_t)len >> 60 != 0)
	{
		bufe = bufp;
		goto qwordwise_body;
	}
	while (bufp != bufe)
	{ qwordwise_body:
		if (*bufp == v) return ((uintptr_t)bufp - (uintptr_t)buf) / 8;
		bufp++;
	}
	return -1;
}

Edited Apr 04, 2023 by Rika

*sigh* AVX2 Index* and Compare* for x64.

Merge request reports

sigh AVX2 Index* and Compare* for x64.