Incorrect results with Nose-Hoover temperature coupling - Redmine #2418
Archive from user: Marvin Bernhardt
I have a segmentation fault, when trying to run a simulation on our new
workstation.
Observations:
- It only appears, when tcoupl = nose-hoover.
- It only appears, when -ntmpi >1 or unset (using both processors)
- If i do not write the energy at every step it fails with: Fatal
error:
3720 particles communicated to PME rank 4 are more than 2/3 times the cut-off
out of the domain decomposition cell of their charge group in dimension x.
This usually means that your system is not well equilibrated. - On at least one other machine with two processors this works fine
- It does not matter if I use GPU or not (-nb cpu)
Since this is machine dependent, here is the hardware detected from md.log:
Hardware detected:
CPU info:
Vendor: Intel
Brand: Intel(R) Xeon(R) CPU E5-2630 v4 @ 2.20GHz
Family: 6 Model: 79 Stepping: 1
Features: aes apic avx avx2 clfsh cmov cx8 cx16 f16c fma hle htt intel lahf mmx msr nonstop_tsc pcid pclmuldq pdcm pdpe1gb popcnt pse rd
rnd rdtscp rtm sse2 sse3 sse4.1 sse4.2 ssse3 tdt x2apic
Hardware topology: Basic
Sockets, cores, and logical processors:
Socket 0: [ 0 20] [ 1 21] [ 2 22] [ 3 23] [ 4 24] [ 5 25] [ 6 26] [ 7 27] [ 8 28] [ 9 29]
Socket 1: [ 10 30] [ 11 31] [ 12 32] [ 13 33] [ 14 34] [ 15 35] [ 16 36] [ 17 37] [ 18 38] [ 19 39]
GPU info:
Number of GPUs detected: 2
#0: NVIDIA GeForce GTX 1080 Ti, compute ca: 6.1, ECC: no, stat: compatible
#1: NVIDIA GeForce GTX 1080 Ti, compute ca: 6.1, ECC: no, stat: compatible
My colleague told me to compile gromacs in debug mode, which i did. Here is the output and backtrace, even though I don’t understand it:
GROMACS: gmx mdrun, version 2018
Executable: /cluster/local/software/gromacs-2018-debug/bin/gmx
Data prefix: /cluster/local/software/gromacs-2018-debug
Working dir: /home/mbernhardt/run/bug-mdrun-pme-rank
Command line:
gmx mdrun
Back Off! I just backed up md.log to ./#md.log.1#
[New Thread 0x7fffe35a0700 (LWP 30015)]
[New Thread 0x7fffe2d9f700 (LWP 30016)]
[New Thread 0x7fffe21ff700 (LWP 30018)]
[Thread 0x7fffe21ff700 (LWP 30018) exited]
[Thread 0x7fffe2d9f700 (LWP 30016) exited]
[New Thread 0x7fffe2d9f700 (LWP 30019)]
[New Thread 0x7fffe21ff700 (LWP 30020)]
[Thread 0x7fffe21ff700 (LWP 30020) exited]
[Thread 0x7fffe2d9f700 (LWP 30019) exited]
Reading file topol.tpr, VERSION 2018 (single precision)
[New Thread 0x7fffe2d9f700 (LWP 30021)]
[New Thread 0x7fffe21ff700 (LWP 30022)]
[New Thread 0x7fffe19fe700 (LWP 30023)]
[New Thread 0x7fffe11fd700 (LWP 30024)]
[New Thread 0x7fffe09fc700 (LWP 30025)]
[New Thread 0x7fffcbfff700 (LWP 30026)]
[New Thread 0x7fffcb7fe700 (LWP 30027)]
Changing nstlist from 10 to 100, rlist from 1.2 to 1.304
No option -multi
No option -multi
No option -multi
Using 8 MPI threads
No option -multi
No option -multi
No option -multi
No option -multi
No option -multi
Using 5 OpenMP threads per tMPI thread
On host gpu0 2 GPUs auto-selected for this run.
Mapping of GPU IDs to the 8 GPU tasks in the 8 ranks on this node:
PP:0,PP:0,PP:0,PP:0,PP:1,PP:1,PP:1,PP:1
[New Thread 0x7fffcaffd700 (LWP 30029)]
[New Thread 0x7fffca7fc700 (LWP 30028)]
[New Thread 0x7fffc99ff700 (LWP 30033)]
[New Thread 0x7fffc91fe700 (LWP 30034)]
[New Thread 0x7fff94fff700 (LWP 30042)]
[New Thread 0x7fffaa1fc700 (LWP 30041)]
[New Thread 0x7fffa89f9700 (LWP 30040)]
[New Thread 0x7fffa99fb700 (LWP 30035)]
[New Thread 0x7fffab9ff700 (LWP 30037)]
[New Thread 0x7fffab1fe700 (LWP 30036)]
[New Thread 0x7fffaa9fd700 (LWP 30039)]
[New Thread 0x7fffa91fa700 (LWP 30038)]
[New Thread 0x7fff947fe700 (LWP 30043)]
[New Thread 0x7fff91ff9700 (LWP 30047)]
[New Thread 0x7fff93ffd700 (LWP 30046)]
[New Thread 0x7fff92ffb700 (LWP 30045)]
[New Thread 0x7fff937fc700 (LWP 30044)]
[New Thread 0x7fff927fa700 (LWP 30048)]
[New Thread 0x7fff917f8700 (LWP 30049)]
[New Thread 0x7fff90ff7700 (LWP 30050)]
[New Thread 0x7fff907f6700 (LWP 30051)]
[New Thread 0x7fff8d7f0700 (LWP 30056)]
[New Thread 0x7fff8fff5700 (LWP 30054)]
[New Thread 0x7fff8f7f4700 (LWP 30053)]
[New Thread 0x7fff8eff3700 (LWP 30052)]
[New Thread 0x7fff8e7f2700 (LWP 30055)]
[New Thread 0x7fff8cfef700 (LWP 30057)]
[New Thread 0x7fff8dff1700 (LWP 30058)]
[New Thread 0x7fff8c7ee700 (LWP 30059)]
[New Thread 0x7fff897e8700 (LWP 30065)]
[New Thread 0x7fff8bfed700 (LWP 30064)]
[New Thread 0x7fff8b7ec700 (LWP 30063)]
[New Thread 0x7fff8afeb700 (LWP 30061)]
[New Thread 0x7fff8a7ea700 (LWP 30060)]
[New Thread 0x7fff89fe9700 (LWP 30062)]
[New Thread 0x7fff88fe7700 (LWP 30066)]
Back Off! I just backed up traj_comxtc to ./#traj_comxtc.1#
Back Off! I just backed up ener.edr to ./#ener.edr.1#
NOTE: DLB will not turn on during the first phase of PME tuning
starting mdrun 'PNiPAMWaterSalt in water'
10 steps, 0.0 ps.
Thread 31 "gmx" received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x7fff927fa700 (LWP 30048)]
0x00007ffff475dfeb in evaluate_single (r2=-nan(0x7fff18), tabscale=500, vftab=0x7fffcc0b0300, tableStride=12, qq=-2.08403182,
c6=0.00192321674, c12=2.06313848e-06, velec=0x7fff927f9928, vvdw=0x7fff927f992c)
at /home/mbernhardt/build/gromacs-2018/src/gromacs/listed-forces/pairs.cpp:113
113 Y = vftab[ntab];
+(gdb) backtrace
#0 0x00007ffff475dfeb in evaluate_single (r2=-nan(0x7fff18), tabscale=500, vftab=0x7fffcc0b0300, tableStride=12, qq=-2.08403182,
c6=0.00192321674, c12=2.06313848e-06, velec=0x7fff927f9928, vvdw=0x7fff927f992c)
at /home/mbernhardt/build/gromacs-2018/src/gromacs/listed-forces/pairs.cpp:113
#1 0x00007ffff476013b in do_pairs_general (ftype=33, nbonds=51, iatoms=0x7fffcc25551c, iparams=0x7fffcc013c90, x=0x7fffcc341500,
f=0x7ffefc23e080, fshift=0x7ffefc000b40, pbc=0x7fffe19fbf20, g=0x0, lambda=0x7fffcc22ebb8, dvdl=0x7fffcc0e8840, md=0x7fffcc0fca40,
fr=0x7fffcc0a7590, grppener=0x7fffcc0e8808, global_atom_index=0x7fffcc2f89f0)
at /home/mbernhardt/build/gromacs-2018/src/gromacs/listed-forces/pairs.cpp:507
#2 0x00007ffff476055c in do_pairs (ftype=33, nbonds=51, iatoms=0x7fffcc25551c, iparams=0x7fffcc013c90, x=0x7fffcc341500,
f=0x7ffefc23e080, fshift=0x7ffefc000b40, pbc=0x7fffe19fbf20, g=0x0, lambda=0x7fffcc22ebb8, dvdl=0x7fffcc0e8840, md=0x7fffcc0fca40,
fr=0x7fffcc0a7590, bCalcEnergyAndVirial=768, grppener=0x7fffcc0e8808, global_atom_index=0x7fffcc2f89f0)
at /home/mbernhardt/build/gromacs-2018/src/gromacs/listed-forces/pairs.cpp:698
#3 0x00007ffff47554b1 in (anonymous namespace)::calc_one_bond (thread=2, ftype=33, idef=0x7fffcc22e230, x=0x7fffcc341500,
f=0x7ffefc23e080, fshift=0x7ffefc000b40, fr=0x7fffcc0a7590, pbc=0x7fffe19fbf20, g=0x0, grpp=0x7fffcc0e8808, nrnb=0x7fffcc0a71b0,
lambda=0x7fffcc22ebb8, dvdl=0x7fffcc0e8840, md=0x7fffcc0fca40, fcd=0x7fffcc04f460, bCalcEnerVir=768, global_atom_index=0x7fffcc2f89f0)
at /home/mbernhardt/build/gromacs-2018/src/gromacs/listed-forces/listed-forces.cpp:389
#4 0x00007ffff4756a60 in calcBondedForces () at /home/mbernhardt/build/gromacs-2018/src/gromacs/listed-forces/listed-forces.cpp:471
#5 0x00007ffff3a108ee in gomp_thread_start (xdata=<optimized out>) at /build/gcc/src/gcc/libgomp/team.c:120
#6 0x00007ffff35cc08c in start_thread () from /usr/lib/libpthread.so.0
#7 0x00007ffff3303e7f in clone () from /usr/lib/libc.so.6
(from redmine: issue id 2418, created on 2018-02-21 by gmxdefault, closed on 2018-02-23)
- Changesets:
- Revision ee8b06ea by Berk Hess on 2018-02-23T13:55:06Z:
Fix md integrator with Nose-Hoover coupling
When applying NH T-coupling at an MD step and no PR P-coupling,
the md integrator could apply pressure scaling with an uninitialized
or outdated PR scaling matrix.
Fixes #2418
Change-Id: I835db72776e7782ac044807961bb899e4f8c6c7b
- Uploads: