Bálint Soproni requested to merge bs_ffast_math_no_slp_fast_float3 into main Oct 05, 2022

Background

We found that SLP produces non-optimal instructions on gfx90a sometimes. A workaround is to disable SLP with -fno-slp-vectorize.

However, float2 and float4 are compiled to packed math even without SLP because they use LLVM vector extensions.

For example, the below snippet will generate assembly using packed math instructions even when SLP is disabled. (ROCm version 5.2.3, compile_line: hipcc --offload-arch=gfx90a --save-temps=obj -O3 -fno-slp-vectorize show_float2_4.cpp)

__global__ void test_float2_instruction(float2* l_1,  float2 r_3){
    l_1[threadIdx.x] = l_1[threadIdx.x] + r_3;
}

        s_load_dwordx2 s[0:1], s[4:5], 0x0
        s_load_dwordx2 s[2:3], s[4:5], 0x8
        v_lshlrev_b32_e32 v2, 3, v0
        s_waitcnt lgkmcnt(0)
        global_load_dwordx2 v[0:1], v2, s[0:1]
        s_waitcnt vmcnt(0)
        v_pk_add_f32 v[0:1], s[2:3], v[0:1]
        global_store_dwordx2 v2, v[0:1], s[0:1]
        s_endpgm

However, vectors of three values like float3 are unique because of alignment requirements. Therefore they cannot be implemented with compiler vectors, and they have their own implementation.

Therefore without SLP, vector instructions of size 3 are treated as individual values. For example, three separate additions are generated in the code snippet below. (Compile line and compilation flags are the same.)

__global__ void test_float3_instruction(float3* l_1, float3 r_3){
    l_1[threadIdx.x] = l_1[threadIdx.x] + r_3;
}

        s_load_dwordx2 s[6:7], s[4:5], 0x0
        s_load_dwordx4 s[0:3], s[4:5], 0x10
        s_waitcnt lgkmcnt(0)
        v_mad_u64_u32 v[4:5], s[4:5], v0, 12, s[6:7]
        global_load_dwordx3 v[0:2], v[4:5], off
        s_waitcnt vmcnt(0)
        v_add_f32_e32 v0, s0, v0
        v_add_f32_e32 v1, s1, v1
        v_add_f32_e32 v2, s2, v2
        global_store_dwordx3 v[4:5], v[0:2], off
        s_endpgm

A possible improvement to the current situation could be implementing float3 as a type constructed from float2 and a separate float. This will result in optimized packed instructions for at least some of the float3 type.

struct fast_float3
{
    typedef float __attribute__((ext_vector_type(2))) Native_float2_;

    union
    {
        struct __attribute__((packed)) { Native_float2_ dxy; float dz; };
        struct { float x, y, z; };
    };

    __host__ __device__
    fast_float3() = default;

    __host__ __device__
    fast_float3(float x_, float y_, float z_) : dxy{ x_, y_ }, dz{ z_ } {}

    __host__ __device__
    fast_float3(Native_float2_ xy_, float z_) : dxy{ xy_ }, dz{ z_ } {}

    __host__ __device__
    fast_float3& operator=(const fast_float3& x)
    {
        dxy = x.dxy;
        dz = x.dz;
        return *this;
    }
};

__forceinline__ __host__ __device__
fast_float3 operator+(fast_float3 x, fast_float3 y)
{
    return fast_float3{ x.dxy + y.dxy, x.dz + y.dz };
}

__global__ void test_float3_instruction(fast_float3* l_1,  fast_float3 r_3){
    l_1[threadIdx.x] = l_1[threadIdx.x] + r_3;
}

        s_load_dwordx2 s[0:1], s[4:5], 0x0
        s_load_dword s6, s[4:5], 0x10
        s_load_dwordx2 s[2:3], s[4:5], 0x14
        s_waitcnt lgkmcnt(0)
        v_mad_u64_u32 v[4:5], s[0:1], v0, 12, s[0:1]
        global_load_dwordx3 v[0:2], v[4:5], off
        s_mov_b32 s7, s2
        s_waitcnt vmcnt(0)
        v_pk_add_f32 v[0:1], s[6:7], v[0:1]
        v_add_f32_e32 v2, s3, v2
        global_store_dwordx2 v[4:5], v[0:1], off
        global_store_dword v[4:5], v2, off offset:8
        s_endpgm

The original investigation and the workaround was found by Anton Gorenko.

Performance Evaluation

I moved these changes to the SYCL backend, and measured the performance with different sized water boxes. Results show no significant effect for turning of SLP, however the above optimizations still produce a significatn speed-up for medium sized problems:

Conclusion

Based on this I believe the workaround proposed is worth implementing. However since the optimization is not GROMACS specific, this might be best contributed to hipSYCL directly.

Add CDNA II optimized float3 implementation

Background

Performance Evaluation

Conclusion

Merge request reports