Skip to content

BUG: WL simulations sometimes hang when using multiprocessing

Summary

When running multiple WL simulations in parallel with hwlp of the multiprocessing module, which is convenient when using binning, some of the processes never starts, as is indicated by the fact that no data container is created for some of the energy windows. Possibly this only occurs when restarting jobs, since the cause is probably that logging and multiprocessing modules are not entirely compatible.

Steps to reproduce

(Re)start multiple WL simulations using, e.g., multiprocessing.Pool.map() and compare the number of active processes to the number of unconverged simulations, in some cases there are more of the latter than the former.

What is the current bug behavior?

Due to the fact that a icet.input_output.logging_tools.logger instance is initiated when initiating a WangLandauEnsemble, and because the logger module uses locks, it is possible that some processes might hang, as is explained in the python bug report referenced earlier. In this particular case, this is likely to occur during restarts since the logger issues a warning when it discovers that the simulation has already converged.

What is the expected correct behavior?

One should be able to run, and rerun, multiple WL simulations without any risk that some of them hangs.

Possible fixes

Use warnings.warn, instead of creating a logger and issuing a logger.warning, to inform the user that the simulation has already converged.

To upload designs, you'll need to enable LFS and have an admin enable hashed storage. More information