Incorrect results with Nose-Hoover temperature coupling - Redmine #2418

Archive from user: Marvin Bernhardt

I have a segmentation fault, when trying to run a simulation on our new workstation.
Observations:

It only appears, when tcoupl = nose-hoover.
It only appears, when -ntmpi >1 or unset (using both processors)
If i do not write the energy at every step it fails with: Fatal error:
3720 particles communicated to PME rank 4 are more than 2/3 times the cut-off
out of the domain decomposition cell of their charge group in dimension x.
This usually means that your system is not well equilibrated.
On at least one other machine with two processors this works fine
It does not matter if I use GPU or not (-nb cpu)

Since this is machine dependent, here is the hardware detected from md.log:

Hardware detected:
  CPU info:
    Vendor: Intel
    Brand:  Intel(R) Xeon(R) CPU E5-2630 v4 @ 2.20GHz
    Family: 6   Model: 79   Stepping: 1
    Features: aes apic avx avx2 clfsh cmov cx8 cx16 f16c fma hle htt intel lahf mmx msr nonstop_tsc pcid pclmuldq pdcm pdpe1gb popcnt pse rd
rnd rdtscp rtm sse2 sse3 sse4.1 sse4.2 ssse3 tdt x2apic
  Hardware topology: Basic
    Sockets, cores, and logical processors:
      Socket  0: [   0  20] [   1  21] [   2  22] [   3  23] [   4  24] [   5  25] [   6  26] [   7  27] [   8  28] [   9  29]
      Socket  1: [  10  30] [  11  31] [  12  32] [  13  33] [  14  34] [  15  35] [  16  36] [  17  37] [  18  38] [  19  39]
  GPU info:
    Number of GPUs detected: 2
    #0: NVIDIA GeForce GTX 1080 Ti, compute ca: 6.1, ECC:  no, stat: compatible
    #1: NVIDIA GeForce GTX 1080 Ti, compute ca: 6.1, ECC:  no, stat: compatible

My colleague told me to compile gromacs in debug mode, which i did. Here is the output and backtrace, even though I don’t understand it:

GROMACS:      gmx mdrun, version 2018
Executable:   /cluster/local/software/gromacs-2018-debug/bin/gmx
Data prefix:  /cluster/local/software/gromacs-2018-debug
Working dir:  /home/mbernhardt/run/bug-mdrun-pme-rank
Command line:
  gmx mdrun


Back Off! I just backed up md.log to ./#md.log.1#
[New Thread 0x7fffe35a0700 (LWP 30015)]
[New Thread 0x7fffe2d9f700 (LWP 30016)]
[New Thread 0x7fffe21ff700 (LWP 30018)]
[Thread 0x7fffe21ff700 (LWP 30018) exited]
[Thread 0x7fffe2d9f700 (LWP 30016) exited]
[New Thread 0x7fffe2d9f700 (LWP 30019)]
[New Thread 0x7fffe21ff700 (LWP 30020)]
[Thread 0x7fffe21ff700 (LWP 30020) exited]
[Thread 0x7fffe2d9f700 (LWP 30019) exited]
Reading file topol.tpr, VERSION 2018 (single precision)
[New Thread 0x7fffe2d9f700 (LWP 30021)]
[New Thread 0x7fffe21ff700 (LWP 30022)]
[New Thread 0x7fffe19fe700 (LWP 30023)]
[New Thread 0x7fffe11fd700 (LWP 30024)]
[New Thread 0x7fffe09fc700 (LWP 30025)]
[New Thread 0x7fffcbfff700 (LWP 30026)]
[New Thread 0x7fffcb7fe700 (LWP 30027)]
Changing nstlist from 10 to 100, rlist from 1.2 to 1.304

No option -multi
No option -multi
No option -multi
Using 8 MPI threads
No option -multi
No option -multi
No option -multi
No option -multi
No option -multi
Using 5 OpenMP threads per tMPI thread

On host gpu0 2 GPUs auto-selected for this run.
Mapping of GPU IDs to the 8 GPU tasks in the 8 ranks on this node:
  PP:0,PP:0,PP:0,PP:0,PP:1,PP:1,PP:1,PP:1
[New Thread 0x7fffcaffd700 (LWP 30029)]
[New Thread 0x7fffca7fc700 (LWP 30028)]
[New Thread 0x7fffc99ff700 (LWP 30033)]
[New Thread 0x7fffc91fe700 (LWP 30034)]
[New Thread 0x7fff94fff700 (LWP 30042)]
[New Thread 0x7fffaa1fc700 (LWP 30041)]
[New Thread 0x7fffa89f9700 (LWP 30040)]
[New Thread 0x7fffa99fb700 (LWP 30035)]
[New Thread 0x7fffab9ff700 (LWP 30037)]
[New Thread 0x7fffab1fe700 (LWP 30036)]
[New Thread 0x7fffaa9fd700 (LWP 30039)]
[New Thread 0x7fffa91fa700 (LWP 30038)]
[New Thread 0x7fff947fe700 (LWP 30043)]
[New Thread 0x7fff91ff9700 (LWP 30047)]
[New Thread 0x7fff93ffd700 (LWP 30046)]
[New Thread 0x7fff92ffb700 (LWP 30045)]
[New Thread 0x7fff937fc700 (LWP 30044)]
[New Thread 0x7fff927fa700 (LWP 30048)]
[New Thread 0x7fff917f8700 (LWP 30049)]
[New Thread 0x7fff90ff7700 (LWP 30050)]
[New Thread 0x7fff907f6700 (LWP 30051)]
[New Thread 0x7fff8d7f0700 (LWP 30056)]
[New Thread 0x7fff8fff5700 (LWP 30054)]
[New Thread 0x7fff8f7f4700 (LWP 30053)]
[New Thread 0x7fff8eff3700 (LWP 30052)]
[New Thread 0x7fff8e7f2700 (LWP 30055)]
[New Thread 0x7fff8cfef700 (LWP 30057)]
[New Thread 0x7fff8dff1700 (LWP 30058)]
[New Thread 0x7fff8c7ee700 (LWP 30059)]
[New Thread 0x7fff897e8700 (LWP 30065)]
[New Thread 0x7fff8bfed700 (LWP 30064)]
[New Thread 0x7fff8b7ec700 (LWP 30063)]
[New Thread 0x7fff8afeb700 (LWP 30061)]
[New Thread 0x7fff8a7ea700 (LWP 30060)]
[New Thread 0x7fff89fe9700 (LWP 30062)]
[New Thread 0x7fff88fe7700 (LWP 30066)]

Back Off! I just backed up traj_comxtc to ./#traj_comxtc.1#

Back Off! I just backed up ener.edr to ./#ener.edr.1#

NOTE: DLB will not turn on during the first phase of PME tuning
starting mdrun 'PNiPAMWaterSalt in water'
10 steps,      0.0 ps.

Thread 31 "gmx" received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x7fff927fa700 (LWP 30048)]
0x00007ffff475dfeb in evaluate_single (r2=-nan(0x7fff18), tabscale=500, vftab=0x7fffcc0b0300, tableStride=12, qq=-2.08403182, 
    c6=0.00192321674, c12=2.06313848e-06, velec=0x7fff927f9928, vvdw=0x7fff927f992c)
    at /home/mbernhardt/build/gromacs-2018/src/gromacs/listed-forces/pairs.cpp:113
113     Y                = vftab[ntab];
+(gdb) backtrace
#0  0x00007ffff475dfeb in evaluate_single (r2=-nan(0x7fff18), tabscale=500, vftab=0x7fffcc0b0300, tableStride=12, qq=-2.08403182, 
    c6=0.00192321674, c12=2.06313848e-06, velec=0x7fff927f9928, vvdw=0x7fff927f992c)
    at /home/mbernhardt/build/gromacs-2018/src/gromacs/listed-forces/pairs.cpp:113
#1  0x00007ffff476013b in do_pairs_general (ftype=33, nbonds=51, iatoms=0x7fffcc25551c, iparams=0x7fffcc013c90, x=0x7fffcc341500, 
    f=0x7ffefc23e080, fshift=0x7ffefc000b40, pbc=0x7fffe19fbf20, g=0x0, lambda=0x7fffcc22ebb8, dvdl=0x7fffcc0e8840, md=0x7fffcc0fca40, 
    fr=0x7fffcc0a7590, grppener=0x7fffcc0e8808, global_atom_index=0x7fffcc2f89f0)
    at /home/mbernhardt/build/gromacs-2018/src/gromacs/listed-forces/pairs.cpp:507
#2  0x00007ffff476055c in do_pairs (ftype=33, nbonds=51, iatoms=0x7fffcc25551c, iparams=0x7fffcc013c90, x=0x7fffcc341500, 
    f=0x7ffefc23e080, fshift=0x7ffefc000b40, pbc=0x7fffe19fbf20, g=0x0, lambda=0x7fffcc22ebb8, dvdl=0x7fffcc0e8840, md=0x7fffcc0fca40, 
    fr=0x7fffcc0a7590, bCalcEnergyAndVirial=768, grppener=0x7fffcc0e8808, global_atom_index=0x7fffcc2f89f0)
    at /home/mbernhardt/build/gromacs-2018/src/gromacs/listed-forces/pairs.cpp:698
#3  0x00007ffff47554b1 in (anonymous namespace)::calc_one_bond (thread=2, ftype=33, idef=0x7fffcc22e230, x=0x7fffcc341500, 
    f=0x7ffefc23e080, fshift=0x7ffefc000b40, fr=0x7fffcc0a7590, pbc=0x7fffe19fbf20, g=0x0, grpp=0x7fffcc0e8808, nrnb=0x7fffcc0a71b0, 
    lambda=0x7fffcc22ebb8, dvdl=0x7fffcc0e8840, md=0x7fffcc0fca40, fcd=0x7fffcc04f460, bCalcEnerVir=768, global_atom_index=0x7fffcc2f89f0)
    at /home/mbernhardt/build/gromacs-2018/src/gromacs/listed-forces/listed-forces.cpp:389
#4  0x00007ffff4756a60 in calcBondedForces () at /home/mbernhardt/build/gromacs-2018/src/gromacs/listed-forces/listed-forces.cpp:471
#5  0x00007ffff3a108ee in gomp_thread_start (xdata=<optimized out>) at /build/gcc/src/gcc/libgomp/team.c:120
#6  0x00007ffff35cc08c in start_thread () from /usr/lib/libpthread.so.0
#7  0x00007ffff3303e7f in clone () from /usr/lib/libc.so.6

(from redmine: issue id 2418, created on 2018-02-21 by gmxdefault, closed on 2018-02-23)

Changesets:
- Revision ee8b06ea by Berk Hess on 2018-02-23T13:55:06Z:

Fix md integrator with Nose-Hoover coupling

When applying NH T-coupling at an MD step and no PR P-coupling,
the md integrator could apply pressure scaling with an uninitialized
or outdated PR scaling matrix.

Fixes #2418

Change-Id: I835db72776e7782ac044807961bb899e4f8c6c7b

Uploads:
- grompp.mdp
- topol.top
- urea.itp
- conf.gro
- md.log

Admin message

Incorrect results with Nose-Hoover temperature coupling - Redmine #2418