Incorrect results when using pnmsub and pnmadd functions with Complex float/double packets.

Summary

Incorrect results when using pnmsub and pnmadd functions with Complex float/double packets in SSE/AVX/AVX512.

Environment

  • Operating System : Linux
  • Architecture : x64/
  • Eigen Version : 3.4.9
  • Compiler Version : GCC 13.3.0
  • Compile Flags : -march=native
  • Vector Extension : SSE/SSE2/SSE3/SSE4.1/FMA/AVX/AVX2

Minimal Example

#include <iostream>
#include <complex>
#include <Eigen/Core>
#include <array>

using namespace Eigen::internal;

int main() {
    std::array<std::complex<float>, 4> tmp, tmp2, tmp3;
    for (int i = 0; i < 4; ++i) {
        tmp[i] = {2.0f * i, 2.0f * i + 1};
        tmp2[i] = {9.0f + 2.0f * i, 10.0f + 2.0f * i};
        tmp3[i] = {19.0f + 2.0f * i, 20.0f + 2.0f * i};
    }
    Packet4cf x = ploadu<Packet4cf>(tmp.data());
    Packet4cf y = ploadu<Packet4cf>(tmp2.data());
    Packet4cf z = ploadu<Packet4cf>(tmp3.data());
    Packet4cf pnmadd_result = pnmadd(x, y, z);
    Packet4cf correct_result = psub(z, pmul(x, y));
    std::array<std::complex<float>, 4> pnmadd_array, correct_array;
    pstoreu(pnmadd_array.data(), pnmadd_result);
    pstoreu(correct_array.data(), correct_result);
    for (int i = 0; i < 4; ++i) {
        std::cout << pnmadd_array[i] << "=" << correct_array[i] << std::endl;
    }
    return 0;
}

Steps to reproduce

  1. Set up an enviroment where FMA instructions are available.
  2. Run the minimal example one time and compare the results.

What is the current bug behavior?

In Core/SSE/Complex.h, Core/AVX/Complex.h, Core/AVX512/Complex.h, pnmadd, pnmsub functions returns incorrect results when using it With Complex Packets. In the minimal example, I just showcased Packet4cf and pnmadd function but the bug exists for pnmadd and pnmsub functions for complex types.

What is the expected correct behavior?

pnmadd_array and correct_array should be equal(ignoring the accuracy difference when using fma instruction).

Relevant logs

(29,29)=(29,11) (35,79)=(35,-35) (41,145)=(41,-97) (47,227)=(47,-175)

Anything else that might help

This is the current implementation of pnmadd in Eigen.

template <>
EIGEN_STRONG_INLINE Packet4cf pnmadd(const Packet4cf& a, const Packet4cf& b, const Packet4cf& c) {
  __m256 a_odd = _mm256_movehdup_ps(a.v);
  __m256 a_even = _mm256_moveldup_ps(a.v);
  __m256 b_swap = _mm256_permute_ps(b.v, _MM_SHUFFLE(2, 3, 0, 1));
  __m256 result = _mm256_fmaddsub_ps(a_odd, b_swap, _mm256_fmaddsub_ps(a_even, b.v, c.v));
  return Packet4cf(result);
}

One of the correct implementation is below with a small change.

template <>
EIGEN_STRONG_INLINE Packet4cf pnmadd(const Packet4cf& a, const Packet4cf& b, const Packet4cf& c) {
  __m256 a_odd = _mm256_movehdup_ps(a.v);
  __m256 a_even = _mm256_moveldup_ps(a.v);
  __m256 b_swap = _mm256_permute_ps(b.v, _MM_SHUFFLE(2, 3, 0, 1));
  __m256 b_swap_conj = pconj(b_swap);
  __m256 result = _mm256_fmsub_ps(a_odd, b_swap_conj, _mm256_fmsub_ps(a_even, b.v, c.v));
  return Packet4cf(result);
}

And for pnmsubb. Current implementation is below:

template <>
EIGEN_STRONG_INLINE Packet4cf pnmsub(const Packet4cf& a, const Packet4cf& b, const Packet4cf& c) {
  __m256 a_odd = _mm256_movehdup_ps(a.v);
  __m256 a_even = _mm256_moveldup_ps(a.v);
  __m256 b_swap = _mm256_permute_ps(b.v, _MM_SHUFFLE(2, 3, 0, 1));
  __m256 result = _mm256_fmaddsub_ps(a_odd, b_swap, _mm256_fmsubadd_ps(a_even, b.v, c.v));
  return Packet4cf(result);
}

One of the correct implementation is below with a small change.

template <>
EIGEN_STRONG_INLINE Packet4cf pnmsub(const Packet4cf& a, const Packet4cf& b, const Packet4cf& c) {
  __m256 a_odd = _mm256_movehdup_ps(a.v);
  __m256 a_even = _mm256_moveldup_ps(a.v);
  __m256 b_swap = _mm256_permute_ps(b.v, _MM_SHUFFLE(2, 3, 0, 1));
  __m256 b_swap_conj = pconj(b_swap);
  __m256 result = _mm256_fmsub_ps(a_odd, b_swap, _mm256_fmadd_ps(a_even, b.v, c.v));
  return Packet4cf(result);
}

As I said before the bug exists for Complex float,double types in architectures SSE/AVX/AVX512.

Edited by emrys53