Incorrect results when using pnmsub and pnmadd functions with Complex float/double packets.

Summary

Incorrect results when using pnmsub and pnmadd functions with Complex float/double packets in SSE/AVX/AVX512.

Environment

Operating System : Linux
Architecture : x64/
Eigen Version : 3.4.9
Compiler Version : GCC 13.3.0
Compile Flags : -march=native
Vector Extension : SSE/SSE2/SSE3/SSE4.1/FMA/AVX/AVX2

Minimal Example

#include <iostream>
#include <complex>
#include <Eigen/Core>
#include <array>

using namespace Eigen::internal;

int main() {
    std::array<std::complex<float>, 4> tmp, tmp2, tmp3;
    for (int i = 0; i < 4; ++i) {
        tmp[i] = {2.0f * i, 2.0f * i + 1};
        tmp2[i] = {9.0f + 2.0f * i, 10.0f + 2.0f * i};
        tmp3[i] = {19.0f + 2.0f * i, 20.0f + 2.0f * i};
    }
    Packet4cf x = ploadu<Packet4cf>(tmp.data());
    Packet4cf y = ploadu<Packet4cf>(tmp2.data());
    Packet4cf z = ploadu<Packet4cf>(tmp3.data());
    Packet4cf pnmadd_result = pnmadd(x, y, z);
    Packet4cf correct_result = psub(z, pmul(x, y));
    std::array<std::complex<float>, 4> pnmadd_array, correct_array;
    pstoreu(pnmadd_array.data(), pnmadd_result);
    pstoreu(correct_array.data(), correct_result);
    for (int i = 0; i < 4; ++i) {
        std::cout << pnmadd_array[i] << "=" << correct_array[i] << std::endl;
    }
    return 0;
}

Steps to reproduce

Set up an enviroment where FMA instructions are available.
Run the minimal example one time and compare the results.

What is the current bug behavior?

In Core/SSE/Complex.h, Core/AVX/Complex.h, Core/AVX512/Complex.h, pnmadd, pnmsub functions returns incorrect results when using it With Complex Packets. In the minimal example, I just showcased Packet4cf and pnmadd function but the bug exists for pnmadd and pnmsub functions for complex types.

What is the expected correct behavior?

pnmadd_array and correct_array should be equal(ignoring the accuracy difference when using fma instruction).

Relevant logs

(29,29)=(29,11) (35,79)=(35,-35) (41,145)=(41,-97) (47,227)=(47,-175)

Anything else that might help

This is the current implementation of pnmadd in Eigen.

template <>
EIGEN_STRONG_INLINE Packet4cf pnmadd(const Packet4cf& a, const Packet4cf& b, const Packet4cf& c) {
  __m256 a_odd = _mm256_movehdup_ps(a.v);
  __m256 a_even = _mm256_moveldup_ps(a.v);
  __m256 b_swap = _mm256_permute_ps(b.v, _MM_SHUFFLE(2, 3, 0, 1));
  __m256 result = _mm256_fmaddsub_ps(a_odd, b_swap, _mm256_fmaddsub_ps(a_even, b.v, c.v));
  return Packet4cf(result);
}

One of the correct implementation is below with a small change.

template <>
EIGEN_STRONG_INLINE Packet4cf pnmadd(const Packet4cf& a, const Packet4cf& b, const Packet4cf& c) {
  __m256 a_odd = _mm256_movehdup_ps(a.v);
  __m256 a_even = _mm256_moveldup_ps(a.v);
  __m256 b_swap = _mm256_permute_ps(b.v, _MM_SHUFFLE(2, 3, 0, 1));
  __m256 b_swap_conj = pconj(b_swap);
  __m256 result = _mm256_fmsub_ps(a_odd, b_swap_conj, _mm256_fmsub_ps(a_even, b.v, c.v));
  return Packet4cf(result);
}

And for pnmsubb. Current implementation is below:

template <>
EIGEN_STRONG_INLINE Packet4cf pnmsub(const Packet4cf& a, const Packet4cf& b, const Packet4cf& c) {
  __m256 a_odd = _mm256_movehdup_ps(a.v);
  __m256 a_even = _mm256_moveldup_ps(a.v);
  __m256 b_swap = _mm256_permute_ps(b.v, _MM_SHUFFLE(2, 3, 0, 1));
  __m256 result = _mm256_fmaddsub_ps(a_odd, b_swap, _mm256_fmsubadd_ps(a_even, b.v, c.v));
  return Packet4cf(result);
}

One of the correct implementation is below with a small change.

template <>
EIGEN_STRONG_INLINE Packet4cf pnmsub(const Packet4cf& a, const Packet4cf& b, const Packet4cf& c) {
  __m256 a_odd = _mm256_movehdup_ps(a.v);
  __m256 a_even = _mm256_moveldup_ps(a.v);
  __m256 b_swap = _mm256_permute_ps(b.v, _MM_SHUFFLE(2, 3, 0, 1));
  __m256 b_swap_conj = pconj(b_swap);
  __m256 result = _mm256_fmsub_ps(a_odd, b_swap, _mm256_fmadd_ps(a_even, b.v, c.v));
  return Packet4cf(result);
}

As I said before the bug exists for Complex float,double types in architectures SSE/AVX/AVX512.

Edited Mar 10, 2025 by emrys53