As noted on #4854 (closed), moving back from SoA to AoS for fci register buffer resolve most (although not all) regressions between 5.3 and 5.5/5.6. There is a clear change in register allocation in ROCm >5.5 which seems to get confused by the SoA buffers, but plays quite well with AoS.
It would still be desirable to have a much better in-depth understanding what is happening here so we can avoid it (and/or the compiler can be improved), @jdmaia could you help with that? In particular, since the innermost i-loop is fully unrolled, naively I'd think the compiler should be able to allocate i-force accumulation buffers however this is best suited for later operations.
With a better understanding of what is happening here and what are differences between ROCm 5.3/5.4 and >5.5, we should be able to decide what the path forward is in terms of code transformation for main and whether we can find an acceptable minimal solution for 2023.
I did validate that there is no regression on the HIP fork already for adh-cubic w/ PME on MI210s between ROCm 5.4.3 and 5.6.1 - which makes sense in light of your comment that the regression is triggered by expressing fCi as a SoA, and we never do that on the HIP fork.
Both ROCm versions do around 250 ns/day, which is not ideal, but the MI210 is a nerfed MI250x and there is no discrepancy in performance. It would be good to test on Dardel as well. I'll see if there's a way to get me in an allocation there.
I'll follow back w/ data from the SYCL kernels.
PS @pszilard is there an easy, automated way to run all flavors of the nbk to check for regressions on particular flavors?
As noted on #4854 (closed), moving back from SoA to AoS for fci register buffer resolve most (although not all) regressions between 5.3 and 5.5/5.6. There is a clear change in register allocation in ROCm >5.5 which seems to get confused by the SoA buffers, but plays quite well with AoS.
A concrete case where SoA->AoS does not fully resolve the regression is Analytical Ewald with LJ Force Swtich:
I validated the regression here with ROCm 5.5.1 on a MI250x node. Performance on adh-cubic goes from 202 to 186, yikes.
Let me figure out what's wrong w/ the split version first so I can discuss this w/ the compiler team.
OK, Here's what I'm seeing across ROCm versions for the nonbonded kernel:
ROCM Version:
5.4.3
5.5.1
NumSGPRS
54
58
NumVGPRs
76
78
CodeLenInByte
18452
22624
Occupancy
6
6
There is indeed an increase in register pressure (4 Scalar registers and 2 Vector registers), but it's not high enough to decrease the # of waves per SIMD (6 on both ROCm versions), so it shouldn't be impacting performance IMO.
What I'm seeing is a very large increase in the size of the generated shader ( in # of instructions), which leads me to believe there has been a change on how the compiler decides to inline or unroll the code, or the register file allocation changed and it has to generate more instructions to deal with plain floats to pack them into consecutive registers.
I still don't have a definite answer here but I'm eyeballing the ISA here to figure out what went wrong.
Attaching the ISA here for now.
It sounds like the main issue here is that, as we expected, the compiler is deciding to aggressively optimize the accumulation of fCi here to get packed math.
On ROCm 551 there are a lot of extra v_pk_mov instructions generated after we finish unrolling the i loop which are not present on 543. I'm opening a ticket with the compiler team right now.
OK, Here's what I'm seeing across ROCm versions for the nonbonded kernel:
Which flavor of the kernel were you referring to?
As I noted above, when we move back to AoS in many of the kernel flavors performance improves, but in some (see above) cases less than the performance reduction observed even after manually packing.
That suggests the SoA→AoS change may not directly address the reason for the slowdown (or there is another source of regression that dominates in kernels like the LJ Force Switch).