performance regression with ROCm 5.5/5.6 on MI250

added AdaptiveCpp GPU acceleration performance labels

changed the description

As noted on #4854 (closed), moving back from SoA to AoS for fci register buffer resolve most (although not all) regressions between 5.3 and 5.5/5.6. There is a clear change in register allocation in ROCm >5.5 which seems to get confused by the SoA buffers, but plays quite well with AoS.

It would still be desirable to have a much better in-depth understanding what is happening here so we can avoid it (and/or the compiler can be improved), @jdmaia could you help with that? In particular, since the innermost i-loop is fully unrolled, naively I'd think the compiler should be able to allocate i-force accumulation buffers however this is best suited for later operations.

With a better understanding of what is happening here and what are differences between ROCm 5.3/5.4 and >5.5, we should be able to decide what the path forward is in terms of code transformation for main and whether we can find an acceptable minimal solution for 2023.

Hey Szilard,

@jdmaia could you help with that?

Yeah, sure thing.

I did validate that there is no regression on the HIP fork already for adh-cubic w/ PME on MI210s between ROCm 5.4.3 and 5.6.1 - which makes sense in light of your comment that the regression is triggered by expressing fCi as a SoA, and we never do that on the HIP fork. Both ROCm versions do around 250 ns/day, which is not ideal, but the MI210 is a nerfed MI250x and there is no discrepancy in performance. It would be good to test on Dardel as well. I'll see if there's a way to get me in an allocation there.

I'll follow back w/ data from the SYCL kernels.

PS @pszilard is there an easy, automated way to run all flavors of the nbk to check for regressions on particular flavors?

PS @pszilard is there an easy, automated way to run all flavors of the nbk to check for regressions on particular flavors?

gmx nonbonded-benchmark microbenchmark tool only works CPU kernels, irdoes not have GPU kernel support. What I use is a representative set of tpr inputs (and runtime switching to test e.g. Ewald tabulated or analytical flavor or twin-range cutoff), e.g see https://gitlab.com/gromacs/gromacs/uploads/068185fda4307a0a72e68aeeac79aa30/kernel-flavor-test_mdps.tar.gz and an input sufficiently large for the GPU in question to avoid tail effects (these days that's 200-400k), e.g. one of these https://ftp.gromacs.org/benchmarks/water_bare_hbonds.tar.gz

As noted on #4854 (closed), moving back from SoA to AoS for fci register buffer resolve most (although not all) regressions between 5.3 and 5.5/5.6. There is a clear change in register allocation in ROCm >5.5 which seems to get confused by the SoA buffers, but plays quite well with AoS.

A concrete case where SoA->AoS does not fully resolve the regression is Analytical Ewald with LJ Force Swtich:

	ROCm 5.3.3	ROCm 5.5.1	ROCM 5.6.1
current main	1698198	2069364	2070473
SoA→AoS	1690207	1797000	1797515
SoA→AoS & FastFloat3	1617240	1724530	1724277

Small update:

I validated the regression here with ROCm 5.5.1 on a MI250x node. Performance on adh-cubic goes from 202 to 186, yikes. Let me figure out what's wrong w/ the split version first so I can discuss this w/ the compiler team.

OK, Here's what I'm seeing across ROCm versions for the nonbonded kernel:

ROCM Version:	5.4.3	5.5.1
NumSGPRS	54	58
NumVGPRs	76	78
CodeLenInByte	18452	22624
Occupancy	6	6

There is indeed an increase in register pressure (4 Scalar registers and 2 Vector registers), but it's not high enough to decrease the # of waves per SIMD (6 on both ROCm versions), so it shouldn't be impacting performance IMO.

What I'm seeing is a very large increase in the size of the generated shader ( in # of instructions), which leads me to believe there has been a change on how the compiler decides to inline or unroll the code, or the register file allocation changed and it has to generate more instructions to deal with plain floats to pack them into consecutive registers. I still don't have a definite answer here but I'm eyeballing the ISA here to figure out what went wrong. Attaching the ISA here for now.

It sounds like the main issue here is that, as we expected, the compiler is deciding to aggressively optimize the accumulation of fCi here to get packed math.

On ROCm 551 there are a lot of extra v_pk_mov instructions generated after we finish unrolling the i loop which are not present on 543. I'm opening a ticket with the compiler team right now.

Do you have a suggestion for a simpler workaround to avoid the regression in our release branch?

OK, Here's what I'm seeing across ROCm versions for the nonbonded kernel:

Which flavor of the kernel were you referring to?

As I noted above, when we move back to AoS in many of the kernel flavors performance improves, but in some (see above) cases less than the performance reduction observed even after manually packing. That suggests the SoA→AoS change may not directly address the reason for the slowdown (or there is another source of regression that dominates in kernels like the LJ Force Switch).

mentioned in merge request !3838 (merged)

mentioned in commit 56e7168c

mentioned in commit dd0d97bd

mentioned in commit 859093d3

mentioned in merge request !3859 (merged)

mentioned in commit a546a828

mentioned in commit 7710fbbc

mentioned in merge request !3865 (merged)

mentioned in commit ad4f2655

mentioned in commit 40961962

mentioned in commit cb23de8f

closed with merge request !3870 (merged)

I think we resolved it well enough in the 2023 branch, but there are still things to do (or at least figure out) in the main branch. Reopening.

reopened

@jdmaia @acmnpv any udpate on this?

we are working on this with the compiler team, trying to find the commit that changed the behaviour

		F			FV
		rocm 5.4	rocm 5.6	rel perf.	rocm 5.4	rocm 5.6	rel perf.
ew-ana	fsw	1691885	2078273	1.23	2749135	2760895	1.00
	ljpme-geom	2032967	2422091	1.19	2925697	2929056	1.00
	psh	1392691	1429430	1.03	2212733	2213132	1.00
	psw	1825736	1949873	1.07	2778256	2761695	0.99

	ew-tab
	fsw	1596204	1827641	1.14	2494254	2529213	1.01
	ljpme-geom	1944850	2288760	1.18	2702735	2708815	1.00
	psh	1338316	1598744	1.19	1989211	1999291	1.01
	psw	2013485	2122755	1.05	2568415	2544413	0.99

rf		1059485	1037376	0.98	1307607	1278566	0.98

performance regression with ROCm 5.5/5.6 on MI250

Designs

Child items ...

Activity