Commit 66ec44e6 authored by Szilárd Páll's avatar Szilárd Páll 🚴🏻 Committed by Mark Abraham
Browse files

Fix mdrun hanging upon exit with sep PME ranks

Commit 1d2d95e3 introduced a check and early return to skip printing perf
stats when no valid wallcycle data was collected (due to missed reset).
However, as the validity of wallcycle data does not get checked/recorded
on separate PME ranks, mdrun deadlocks before exit in collective comm
that PME ranks never enter.

This change fixes the hang by refactoring the printing code to use a
boolean rather than an early return. This means the normal code path
is unaffected in all cases (only the simulation master can ever write
reports), and the case where it is invalid to write a report
(premature termination) works correctly because all ranks communicate
the data for the report that is never written (and efficiency is not
of concern in this case).

Fixes #2131

Change-Id: If8b0813444d0b00a1a9a4a21d30fc8655c52752a
parent a2d8c56c
......@@ -2574,12 +2574,22 @@ void finish_run(FILE *fplog, t_commrec *cr,
elapsed_time_over_all_ranks,
elapsed_time_over_all_threads,
elapsed_time_over_all_threads_over_all_ranks;
/* Control whether it is valid to print a report. Only the
simulation master may print, but it should not do so if the run
terminated e.g. before a scheduled reset step. This is
complicated by the fact that PME ranks are unaware of the
reason why they were sent a pmerecvqxFINISH. To avoid
communication deadlocks, we always do the communication for the
report, even if we've decided not to write the report, because
how long it takes to finish the run is not important when we've
decided not to report on the simulation performance. */
bool printReport = SIMMASTER(cr);
if (!walltime_accounting_get_valid_finish(walltime_accounting))
{
md_print_warn(cr, fplog,
"Simulation ended prematurely, no performance report will be written.");
return;
printReport = false;
}
if (cr->nnodes > 1)
......@@ -2617,7 +2627,7 @@ void finish_run(FILE *fplog, t_commrec *cr,
}
#endif
if (SIMMASTER(cr))
if (printReport)
{
print_flop(fplog, nrnb_tot, &nbfs, &mflop);
}
......@@ -2640,7 +2650,7 @@ void finish_run(FILE *fplog, t_commrec *cr,
wallcycle_scale_by_num_threads(wcycle, cr->duty == DUTY_PME, nthreads_pp, nthreads_pme);
auto cycle_sum(wallcycle_sum(cr, wcycle));
if (SIMMASTER(cr))
if (printReport)
{
struct gmx_wallclock_gpu_t* gputimes = use_GPU(nbv) ? nbnxn_gpu_get_timings(nbv->gpu_nbv) : NULL;
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment