libvdw + Intel compiler gives core dump
Hi @askhl
I have issues building a version of GPAW with libvdw support using the Intel compiler. Unfortunately, it dumps core in one of the libvdw tests. Apparently the core dump is sometimes in an MPI call, sometimes in MKL, it happens both with the iomkl toolchain (OpenMPI) and the intel toolchain (Intel MPI).
On Niflheim, the module is GPAW/1.4.0-iomkl-2018b-libvdwxc-Python-3.6.6 located in /home/niflheim/schiotz/easybuild_experimental/broadwell/modules/all (note that I thought I had to nuke that folder when the list of modules stopped updating. I rebuilt, but if the module does not appear try removing the folder from MODULEPATH, log out and in, add it back, log out and in again. That made stuff work again, nuking the folder did not.)
The error itself looks like this:
13:09 [slid] ~$ module purge
13:09 [slid] ~$ module load GPAW/1.4.0-iomkl-2018b-libvdwxc-Python-3.6.6
13:10 [slid] ~$ gpaw -P2 test
python-3.6.6 /home/niflheim/schiotz/easybuild_experimental/broadwell/software/GPAW/1.4.0-iomkl-2018b-libvdwxc-Python-3.6.6/bin/gpaw-python
gpaw-1.4.0 /home/niflheim/schiotz/easybuild_experimental/broadwell/software/GPAW/1.4.0-iomkl-2018b-libvdwxc-Python-3.6.6/lib/python3.6/site-packages/gpaw/
ase-3.16.2 /home/niflheim/schiotz/easybuild_experimental/broadwell/software/ASE/3.16.2-iomkl-2018b-Python-3.6.6/lib/python3.6/site-packages/ase-3.16.2-py3.6.egg/ase/
numpy-1.15.0 /home/niflheim/schiotz/easybuild_experimental/broadwell/software/Python/3.6.6-iomkl-2018b/lib/python3.6/site-packages/numpy-1.15.0-py3.6-linux-x86_64.egg/numpy/
scipy-1.1.0 /home/niflheim/schiotz/easybuild_experimental/broadwell/software/Python/3.6.6-iomkl-2018b/lib/python3.6/site-packages/scipy-1.1.0-py3.6-linux-x86_64.egg/scipy/
_gpaw built-in
parallel /home/niflheim/schiotz/easybuild_experimental/broadwell/software/GPAW/1.4.0-iomkl-2018b-libvdwxc-Python-3.6.6/bin/gpaw-python
FFTW yes
scalapack yes
libvdwxc yes
PAW-datasets 1: /home/modules/software/GPAW-setups/0.9.20000
2: /home/niflheim/s121839/gpaw-setups
Running tests in /tmp/gpaw-test-gfyq98vk
Jobs: 1, Cores: 2, debug-mode: False
=============================================================================
linalg/gemm_complex.py 0.048 OK
ase_features/ase3k_version.py 0.007 OK
kpt.py 0.011 OK
mpicomm.py 0.008 OK
pathological/numpy_core_multiarray_dot.py 0.007 OK
eigen/cg2.py 0.009 OK
fd_ops/laplace.py 0.000 SKIPPED
linalg/lapack.py 0.008 OK
linalg/eigh.py 0.008 OK
parallel/submatrix_redist.py 0.010 OK
lfc/second_derivative.py 0.015 OK
parallel/parallel_eigh.py 0.007 OK
lfc/gp2.py 0.009 OK
linalg/blas.py 0.009 OK
Gauss.py 0.011 OK
symmetry/check.py 0.412 OK
fd_ops/nabla.py 0.137 OK
linalg/dot.py 0.008 OK
linalg/mmm.py 0.007 OK
xc/lxc_fxc.py 0.008 OK
xc/pbe_pw91.py 0.007 OK
fd_ops/gradient.py 0.010 OK
maths/erf.py 0.007 OK
lfc/lf.py 0.010 OK
maths/fsbt.py 0.030 OK
parallel/compare.py 0.009 OK
vdw/libvdwxc_functionals.py [slid:06024] *** Process received signal ***
[slid:06024] Signal: Segmentation fault (11)
[slid:06024] Signal code: (128)
[slid:06024] Failing at address: (nil)
[slid:06023] *** Process received signal ***
[slid:06023] Signal: Segmentation fault (11)
[slid:06023] Signal code: (128)
[slid:06023] Failing at address: (nil)
[slid:06024] [ 0] [slid:06023] [ 0] /lib64/libpthread.so.0(+0xf6d0)[0x7f2483acf6d0]
[slid:06023] /lib64/libpthread.so.0(+0xf6d0)[0x7f33737cd6d0]
[slid:06024] [ 1] [ 1] /home/modules/software/imkl/2018.3.222-iompi-2018b/mkl/lib/intel64/libmkl_intel_lp64.so(fftw_execute_dft_r2c+0x63)[0x7f3379495133]
[slid:06024] [ 2] /home/niflheim/schiotz/easybuild_experimental/broadwell/software/libvdwxc/0.3.2-iomkl-2018b/lib/libvdwxc.so.0(vdwxc_calculate_mpi+0x16)[0x7f3372a75545]
[slid:06024] [ 3] /home/modules/software/imkl/2018.3.222-iompi-2018b/mkl/lib/intel64/libmkl_intel_lp64.so(fftw_execute_dft_r2c+0x63)[0x7f2489797133]
[slid:06023] [ 2] /home/niflheim/schiotz/easybuild_experimental/broadwell/software/libvdwxc/0.3.2-iomkl-2018b/lib/libvdwxc.so.0(vdwxc_calculate_mpi+0x16)[0x7f2482d77545]
[slid:06023] [ 3] /home/niflheim/schiotz/easybuild_experimental/broadwell/software/libvdwxc/0.3.2-iomkl-2018b/lib/libvdwxc.so.0(vdwxc_calculate_anyspin+0x66)[0x7f2482d76622]
[slid:06023] [ 4] /home/niflheim/schiotz/easybuild_experimental/broadwell/software/libvdwxc/0.3.2-iomkl-2018b/lib/libvdwxc.so.0(vdwxc_calculate_anyspin+0x66)[0x7f3372a74622]
[slid:06024] [ 4] /home/niflheim/schiotz/easybuild_experimental/broadwell/software/libvdwxc/0.3.2-iomkl-2018b/lib/libvdwxc.so.0(vdwxc_calculate+0x2f)[0x7f3372a74585]
[slid:06024] [ 5] /home/niflheim/schiotz/easybuild_experimental/broadwell/software/libvdwxc/0.3.2-iomkl-2018b/lib/libvdwxc.so.0(vdwxc_calculate+0x2f)[0x7f2482d76585]
[slid:06023] [ 5] gpaw-python(libvdwxc_calculate+0x80)[0x44f5b0]
[slid:06023] [ 6] gpaw-python(libvdwxc_calculate+0x80)[0x44f5b0]
[slid:06024] [ 6] /home/niflheim/schiotz/easybuild_experimental/broadwell/software/Python/3.6.6-iomkl-2018b/lib/libpython3.6m.so.1.0(_PyCFunction_FastCallDict+0xb6)[0x7f2481cb5e36]
[slid:06023] [ 7] /home/niflheim/schiotz/easybuild_experimental/broadwell/software/Python/3.6.6-iomkl-2018b/lib/libpython3.6m.so.1.0(_PyCFunction_FastCallDict+0xb6)[0x7f33719b3e36]
[slid:06024] [ 7] /home/niflheim/schiotz/easybuild_experimental/broadwell/software/Python/3.6.6-iomkl-2018b/lib/libpython3.6m.so.1.0(+0x1a97ac)[0x7f2481d637ac]
....... cut ........
[slid:06023] [29] /home/niflheim/schiotz/easybuild_experimental/broadwell/software/Python/3.6.6-iomkl-2018b/lib/libpython3.6m.so.1.0(_PyCFunction_FastCallDict+0xb6)[0x7f33719b3e36]
[slid:06024] *** End of error message ***
/home/niflheim/schiotz/easybuild_experimental/broadwell/software/Python/3.6.6-iomkl-2018b/lib/libpython3.6m.so.1.0(_PyCFunction_FastCallDict+0xb6)[0x7f2481cb5e36]
[slid:06023] *** End of error message ***
-------------------------------------------------------
Primary job terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
-------------------------------------------------------
--------------------------------------------------------------------------
mpiexec noticed that process rank 1 with PID 0 on node slid exited on signal 11 (Segmentation fault).
--------------------------------------------------------------------------
13:10 [slid] ~$