Compiler (?) issue with `CalculateCellCenteredField`
The partition manager fails for CUDA runs when using more than one OpenMP thread in combination with a different loop pattern, see https://gitlab.com/pgrete/kathena/-/jobs/214674971 Using a single OpenMP thread with Cuda works fine. Also for pure OpeMP (non-cuda) runs there's no problem.
After kernel bisect the problem seems to stem from the CalculateCellCenteredField
function.
In particular the else
branch
if (uniform_ave_x1 == true) {
lw = 0.5;
rw = 0.5;
} else {
const Real& x1f_i = pco_x1f(i);
const Real& x1f_ip = pco_x1f(i+1);
const Real& x1v_i = pco_x1v(i);
const Real& dx1_i = pco_dx1f(i);
lw = (x1f_ip - x1v_i)/dx1_i;
rw = (x1v_i - x1f_i)/dx1_i;
}
which is never called at this point (for PLM in Cartesian coordinates uniform_ave_x1
is always true), causes the problem.
Just commenting the 8 lines results in correct code (i.e., the partmgr test succeeds using >1 OpenMP threads + Cuda).
This points towards a compiler problem.
@glinesfo brought up that this may be related to register spillover.
As an intermediate solution refactoring the function (single kernel with branches to single kernels, see 6c1e5984) seems to solve the problem.