Incorrect results when using pnmsub and pnmadd functions with Complex float/double packets.
Summary
Incorrect results when using pnmsub and pnmadd functions with Complex float/double packets in SSE/AVX/AVX512.
Environment
- Operating System : Linux
- Architecture : x64/
- Eigen Version : 3.4.9
- Compiler Version : GCC 13.3.0
- Compile Flags : -march=native
- Vector Extension : SSE/SSE2/SSE3/SSE4.1/FMA/AVX/AVX2
Minimal Example
#include <iostream>
#include <complex>
#include <Eigen/Core>
#include <array>
using namespace Eigen::internal;
int main() {
std::array<std::complex<float>, 4> tmp, tmp2, tmp3;
for (int i = 0; i < 4; ++i) {
tmp[i] = {2.0f * i, 2.0f * i + 1};
tmp2[i] = {9.0f + 2.0f * i, 10.0f + 2.0f * i};
tmp3[i] = {19.0f + 2.0f * i, 20.0f + 2.0f * i};
}
Packet4cf x = ploadu<Packet4cf>(tmp.data());
Packet4cf y = ploadu<Packet4cf>(tmp2.data());
Packet4cf z = ploadu<Packet4cf>(tmp3.data());
Packet4cf pnmadd_result = pnmadd(x, y, z);
Packet4cf correct_result = psub(z, pmul(x, y));
std::array<std::complex<float>, 4> pnmadd_array, correct_array;
pstoreu(pnmadd_array.data(), pnmadd_result);
pstoreu(correct_array.data(), correct_result);
for (int i = 0; i < 4; ++i) {
std::cout << pnmadd_array[i] << "=" << correct_array[i] << std::endl;
}
return 0;
}
Steps to reproduce
- Set up an enviroment where FMA instructions are available.
- Run the minimal example one time and compare the results.
What is the current bug behavior?
In Core/SSE/Complex.h, Core/AVX/Complex.h, Core/AVX512/Complex.h, pnmadd, pnmsub functions returns incorrect results when using it With Complex Packets. In the minimal example, I just showcased Packet4cf and pnmadd function but the bug exists for pnmadd and pnmsub functions for complex types.
What is the expected correct behavior?
pnmadd_array and correct_array should be equal(ignoring the accuracy difference when using fma instruction).
Relevant logs
(29,29)=(29,11) (35,79)=(35,-35) (41,145)=(41,-97) (47,227)=(47,-175)
Anything else that might help
This is the current implementation of pnmadd in Eigen.
template <>
EIGEN_STRONG_INLINE Packet4cf pnmadd(const Packet4cf& a, const Packet4cf& b, const Packet4cf& c) {
__m256 a_odd = _mm256_movehdup_ps(a.v);
__m256 a_even = _mm256_moveldup_ps(a.v);
__m256 b_swap = _mm256_permute_ps(b.v, _MM_SHUFFLE(2, 3, 0, 1));
__m256 result = _mm256_fmaddsub_ps(a_odd, b_swap, _mm256_fmaddsub_ps(a_even, b.v, c.v));
return Packet4cf(result);
}
One of the correct implementation is below with a small change.
template <>
EIGEN_STRONG_INLINE Packet4cf pnmadd(const Packet4cf& a, const Packet4cf& b, const Packet4cf& c) {
__m256 a_odd = _mm256_movehdup_ps(a.v);
__m256 a_even = _mm256_moveldup_ps(a.v);
__m256 b_swap = _mm256_permute_ps(b.v, _MM_SHUFFLE(2, 3, 0, 1));
__m256 b_swap_conj = pconj(b_swap);
__m256 result = _mm256_fmsub_ps(a_odd, b_swap_conj, _mm256_fmsub_ps(a_even, b.v, c.v));
return Packet4cf(result);
}
And for pnmsubb. Current implementation is below:
template <>
EIGEN_STRONG_INLINE Packet4cf pnmsub(const Packet4cf& a, const Packet4cf& b, const Packet4cf& c) {
__m256 a_odd = _mm256_movehdup_ps(a.v);
__m256 a_even = _mm256_moveldup_ps(a.v);
__m256 b_swap = _mm256_permute_ps(b.v, _MM_SHUFFLE(2, 3, 0, 1));
__m256 result = _mm256_fmaddsub_ps(a_odd, b_swap, _mm256_fmsubadd_ps(a_even, b.v, c.v));
return Packet4cf(result);
}
One of the correct implementation is below with a small change.
template <>
EIGEN_STRONG_INLINE Packet4cf pnmsub(const Packet4cf& a, const Packet4cf& b, const Packet4cf& c) {
__m256 a_odd = _mm256_movehdup_ps(a.v);
__m256 a_even = _mm256_moveldup_ps(a.v);
__m256 b_swap = _mm256_permute_ps(b.v, _MM_SHUFFLE(2, 3, 0, 1));
__m256 b_swap_conj = pconj(b_swap);
__m256 result = _mm256_fmsub_ps(a_odd, b_swap, _mm256_fmadd_ps(a_even, b.v, c.v));
return Packet4cf(result);
}
As I said before the bug exists for Complex float,double types in architectures SSE/AVX/AVX512.