Skip to content

Mpi master merge

bchareyre requested to merge mpi-master-merge into master

Current status:

  1. The merge py3(master)+mpi builds and pass all previous tests correctly with and without mpi enabled. I add to change some tests related to collider since the mpi branch contains a number of fixes and improvement wrt collider. Namely: detection will not occur repeatedly at iterations 1,2 and 3; extended bounds and collider striding will start from the beginning. Also we will no longer have interactions floating in limbo and never erased. Those changes make it impossible to reproduce the previous (bad) behavior, and so the tests detected changes. I changed colliderCorrectness.py to make it check real interactions, internals of the collider have no reason to be constant across versions since they depend on multiple parameters and the algorithms can be (and will be) improved.

  2. There is a checkTest for distributed memory execution, thanks to the interactive mode it was possible to run it without mpiexec. The script is execfiled by the checkList.py, then it spawns workers and distribute memory.

  3. The check-test is successful on 16.04 and Buster, it fails on 18.04 and Stretch for the same reason. Hard to tell if it is specific to mpi inside docker (could be) or a general problem with mpi3/mpi4py on those distros. It will need tests on local systems (I'm testing 16.04 only).

###################################
running:  checkMPI.py
1;Master: I will spawn 3 workers
 checkMPI.py  failure, caught exception  Exception :  MPI_ERR_SPAWN: could not spawn processes 
###################################
  1. Build fails for suse15 because log is too long (Eigen3 warnings). I hardly see a relation with mpi...

  2. I suggest to merge the current state into master ASAP since it brings some general improvements and waiting more will make the merge even more difficult. We can disable the MPI checkTest selectively on some distro to not block the pipeline and get it tested still (done). Not sure what to do with suse15 but it seems unrelated to this branch.

EDIT: spawn failure could be explained here, mpi4py built with an openmpi version different from the installed one

EDIT2: I could reproduce mpi checkScript failure locally on ubuntu 18.04. A workaround is to uninstall python3-mpi4py and reinstall it with pip. Strangely, this procedure returns failure in the pipeline (https://gitlab.com/yade-dev/trunk/-/jobs/245814147 + possible solution: https://github.com/docker/compose/issues/6617). Even more strange this procedure also fixed the problem locally:

  • apt remove python3-mpi4py
  • pip3 install mpi4py
  • pip3 uninstall mpi4py
  • apt install python3-mpi4py
Edited by bchareyre

Merge request reports