parallel_runs.rst 14.6 KB
 dulak committed Feb 24, 2009 1 2 3 4 5 6 .. _parallel_runs: ============= Parallel runs =============  7 8 9 10 .. toctree:: :maxdepth: 1 .. _parallel_running_jobs:  dulak committed Feb 24, 2009 11 12 13 14  Running jobs in parallel ========================  Jussi Enkovaara committed Jan 14, 2021 15 Parallel calculations are done primarily with MPI.  Tuomas Rossi committed Aug 16, 2021 16 17 18 The parallelization can be done over the **k**-points, bands, and using real-space domain decomposition. The code will try to make a sensible domain  19 20 decomposition that match both the number of processors and the size of the unit cell. This choice can be overruled, see  Jussi Enkovaara committed Jan 14, 2021 21 22 23 24 :ref:manual_parallelization_types. Complementary OpenMP parallelization can improve the performance in some cases, see :ref:manual_openmp.  25   Jens Jørgen Mortensen committed May 03, 2016 26 27 Before starting a parallel calculation, it might be useful to check how the parallelization corresponding to the given number of processes would be done  Jens Jørgen Mortensen committed Mar 08, 2021 28 with the --dry-run=N command line option::  dulak committed Feb 24, 2009 29   Jens Jørgen Mortensen committed Oct 23, 2019 30  $gpaw python --dry-run=8 script.py  dulak committed Feb 24, 2009 31   dulak committed Dec 12, 2012 32 The output will contain also the "Calculator" RAM Memory estimate per process.  dulak committed Jan 19, 2011 33   Jens Jørgen Mortensen committed Oct 23, 2019 34 35 36 In order to run GPAW in parallel, you do one of these two::  Jens Jørgen Mortensen committed Jan 28, 2020 37 38 39 $ mpiexec -n gpaw python script.py $gpaw -P python script.py$ mpiexec -n python3 script.py  Jens Jørgen Mortensen committed Oct 23, 2019 40   Jens Jørgen Mortensen committed Jan 28, 2020 41 42 The first two are the recommended ones: The *gpaw* script will make sure that imports are done in an efficient way.  Jens Jørgen Mortensen committed Oct 23, 2019 43 44 45 46 47 48 49 50  Submitting a job to a queuing system ==================================== You can write a shell-script that contains this line:: mpiexec gpaw python script.py  dulak committed Feb 24, 2009 51   Jens Jørgen Mortensen committed Oct 23, 2019 52 and then submit that with sbatch, qsub or some other command.  dulak committed Feb 24, 2009 53   Jens Jørgen Mortensen committed Oct 23, 2019 54 Alternatives:  dulak committed Feb 24, 2009 55   Jens Jørgen Mortensen committed Oct 23, 2019 56 57 * If you are on a SLURM system: use the :ref:sbatch  sub-command of the gpaw command-line tool::  dulak committed Feb 24, 2009 58   Jens Jørgen Mortensen committed Oct 23, 2019 59  $gpaw sbatch -- [sbatch options] script.py [script options]  dulak committed Feb 24, 2009 60   Jens Jørgen Mortensen committed Oct 23, 2019 61 * Use MyQueue_::  dulak committed Feb 24, 2009 62   Jens Jørgen Mortensen committed Oct 23, 2019 63 $ mq submit "script.py [script options]" -R  dulak committed Feb 24, 2009 64   Jens Jørgen Mortensen committed Oct 23, 2019 65 66 * Write you own *submit* script. See this example: :git:doc/platforms/gbar/qsub.py.  dulak committed Feb 24, 2009 67   Jens Jørgen Mortensen committed Oct 23, 2019 68 .. _MyQueue: https://myqueue.readthedocs.io/  dulak committed Feb 24, 2009 69 70 71 72 73 74 75 76 77  Alternative submit tool ======================= Alternatively, the script gpaw-runscript can be used, try:: $gpaw-runscript -h  Jens Jørgen Mortensen committed May 03, 2016 78 79 to get the architectures implemented and the available options. As an example, use::  dulak committed Feb 24, 2009 80 81 82 $ gpaw-runscript script.py 32  Jens Jørgen Mortensen committed Mar 16, 2016 83 to write a job sumission script running script.py on 32 cpus.  miwalter committed Mar 18, 2013 84 85 86 87 88 The tool tries to guess the architecture/host automatically. By default it uses the following environment variables to write the runscript: =============== ===================================  Ask Hjorth Larsen committed Feb 28, 2016 89 variable meaning  miwalter committed Mar 18, 2013 90 91 =============== =================================== HOSTNAME name used to assing host type  Jens Jørgen Mortensen committed Jul 28, 2017 92 PYTHONPATH path for Python  miwalter committed Mar 18, 2013 93 94 95 GPAW_SETUP_PATH where to find the setups GPAW_MAIL where to send emails about the jobs =============== ===================================  dulak committed Feb 24, 2009 96   Jens Jørgen Mortensen committed Jun 22, 2016 97   dulak committed Feb 24, 2009 98 99 100 Writing to files ================  Jens Jørgen Mortensen committed May 03, 2016 101 102 Be careful when writing to files in a parallel run. Instead of f = open('data', 'w'), use:  dulak committed Feb 24, 2009 103 104 105 106  >>> from ase.parallel import paropen >>> f = paropen('data', 'w')  Jens Jørgen Mortensen committed May 03, 2016 107 108 Using paropen, you get a real file object on the master node, and dummy objects on the slaves. It is equivalent to this:  dulak committed Feb 24, 2009 109   Jens Jørgen Mortensen committed Sep 05, 2019 110 111 >>> from ase.parallel import world >>> if world.rank == 0:  dulak committed Feb 24, 2009 112 113 114 115 ... f = open('data', 'w') ... else: ... f = open('/dev/null', 'w')  Jens Jørgen Mortensen committed May 03, 2016 116 117 If you *really* want all nodes to write something to files, you should make sure that the files have different names:  dulak committed Feb 24, 2009 118   Jens Jørgen Mortensen committed Sep 05, 2019 119 120 >>> from ase.parallel import world >>> f = open('data.{}'.format(world.rank), 'w')  dulak committed Feb 24, 2009 121   jensj committed Feb 16, 2016 122   miwalter committed Mar 10, 2010 123 124 125 126 127 128 129 Writing text output =================== Text output written by the print statement is written by all nodes. To avoid this use: >>> from ase.parallel import parprint  jensj committed Feb 16, 2016 130 >>> print('This is written by all nodes')  miwalter committed Mar 10, 2010 131 132 133 134 >>> parprint('This is written by the master only') which is equivalent to  Jens Jørgen Mortensen committed Sep 05, 2019 135 >>> from ase.parallel import world  jensj committed Feb 16, 2016 136 >>> print('This is written by all nodes')  Jens Jørgen Mortensen committed Sep 05, 2019 137 >>> if world.rank == 0:  jensj committed Feb 16, 2016 138 ... print('This is written by the master only')  miwalter committed Mar 10, 2010 139   jensj committed Feb 16, 2016 140   jussie committed Oct 03, 2014 141 142 .. _different_calculations_in parallel:  dulak committed Feb 24, 2009 143 144 145 146 147 148 Running different calculations in parallel ========================================== A GPAW calculator object will per default distribute its work on all available processes. If you want to use several different calculators at the same time, however, you can specify a set of processes to be  s032082 committed Nov 26, 2009 149 150 151 used by each calculator. The processes are supplied to the constructor, either by specifying an :ref:MPI Communicator object , or simply a list of ranks. Thus, you may write::  dulak committed Feb 24, 2009 152 153 154 155 156  from gpaw import GPAW import gpaw.mpi as mpi # Create a calculator using ranks 0, 3 and 4 from the mpi world communicator  jensj committed Jan 25, 2012 157 158 159 160 161  ranks = [0, 3, 4] comm = mpi.world.new_communicator(ranks) if mpi.world.rank in ranks: calc = GPAW(communicator=comm) ...  dulak committed Feb 24, 2009 162 163 164 165 166 167 168 169 170  Be sure to specify different output files to each calculator, otherwise their outputs will be mixed up. Here is an example which calculates the atomization energy of a nitrogen molecule using two processes: .. literalinclude:: parallel_atomization.py  171   Jens Jørgen Mortensen committed Jun 22, 2016 172 .. _manual_parallelization_types:  s032082 committed Apr 27, 2010 173 174 .. _manual_parallel:  askhl committed Jul 01, 2015 175 176 Parallelization options =======================  s032082 committed Apr 27, 2010 177   Jens Jørgen Mortensen committed Mar 16, 2016 178 In version 0.7, a new keyword called parallel was introduced to provide  s032082 committed Apr 27, 2010 179 a unified way of specifying parallelization-related options. Similar to  Jens Jørgen Mortensen committed Mar 16, 2016 180 the way we :ref:specify convergence criteria  with the  s032082 committed Apr 27, 2010 181 182 183 184 185 convergence keyword, a Python dictionary is used to contain all such options in a single keyword. The default value corresponds to this Python dictionary::  askhl committed Sep 29, 2014 186 187  {'kpt': None, 'domain': None,  s032082 committed Apr 27, 2010 188  'band': 1,  askhl committed Sep 29, 2014 189  'order': 'kdb',  s032082 committed Apr 27, 2010 190  'stridebands': False,  askhl committed Jan 23, 2012 191  'sl_auto': False,  s032082 committed Apr 27, 2010 192 193 194  'sl_default': None, 'sl_diagonalize': None, 'sl_inverse_cholesky': None,  askhl committed Jan 23, 2012 195  'sl_lcao': None,  askhl committed Sep 29, 2014 196  'sl_lrtddft': None,  Ask Hjorth Larsen committed Nov 21, 2018 197 198  'use_elpa': False, 'elpasolver': '2stage',  askhl committed Jan 23, 2012 199  'buffer_size': None}  s032082 committed Apr 27, 2010 200 201 202  In words:  Jens Jørgen Mortensen committed Jun 22, 2016 203 204 205 206 207 208 * 'kpt' is an integer and denotes the number of groups of k-points over which to parallelize. k-point parallelization is the most efficient type of parallelization for most systems with many electrons and/or many k-points. If unspecified, the calculator will choose a parallelization itself which maximizes the k-point parallelization unless that leads to load imbalance; in that case, it may prioritize domain decomposition.  Tuomas Rossi committed Aug 16, 2021 209 210  Note: parallelization over spin is not possible in :ref:GPAW 20.10.0 and newer versions .  askhl committed Sep 29, 2014 211 212  * The 'domain' value specifies either an integer n or a tuple  Jens Jørgen Mortensen committed Jun 22, 2016 213 214 215  (nx,ny,nz) of 3 integers for :ref:domain decomposition . If not specified (i.e. None), the calculator will try to determine the  Tuomas Rossi committed Aug 16, 2021 216  best domain parallelization size based on number of kpoints etc.  s032082 committed Apr 27, 2010 217   Jens Jørgen Mortensen committed Jun 22, 2016 218 219 220 * The 'band' value specifies the number of parallelization groups to use for :ref:band parallelization  and defaults to one, i.e. no band parallelization.  s032082 committed Apr 27, 2010 221   askhl committed Sep 29, 2014 222 223 224 225 226 227 228 229 230 * 'order' specifies how different parallelization modes are nested within the calculator's world communicator. Must be a permutation of the characters 'kdb' which is the default. The characters denote k-point, domain or band parallelization respectively. The last mode will be assigned contiguous ranks and thus, depending on network layout, probably becomes more efficient. Usually for static calculations the most efficient order is 'kdb' whereas for TDDFT it is 'kbd'.  Jens Jørgen Mortensen committed Jun 22, 2016 231 232 * The 'stridebands' value only applies when band parallelization is used, and can be used to toggle between grouped and strided band distribution.  s032082 committed Apr 27, 2010 233   Jens Jørgen Mortensen committed Jun 22, 2016 234 235 * If 'sl_auto' is True, ScaLAPACK will be enabled with automatically chosen parameters and using all available CPUs.  askhl committed Jan 23, 2012 236   Jens Jørgen Mortensen committed Jun 22, 2016 237 238 * The other 'sl_...' values are for using ScaLAPACK with different parameters in different operations.  askhl committed Jan 23, 2012 239 240 241  Each can be specified as a tuple (m,n,mb) of 3 integers to indicate an m*n grid of CPUs and a block size of mb. If any of the three latter keywords are not  naromero committed Jan 13, 2011 242  specified (i.e. None), they default to the value of  askhl committed Jan 23, 2012 243 244  'sl_default'. Presently, 'sl_inverse_cholesky' must equal 'sl_diagonalize'.  naromero committed Jan 13, 2011 245   Ask Hjorth Larsen committed Nov 21, 2018 246 247 248 249 * If the Elpa library is installed, enable it by setting use_elpa to True. Elpa will be used to diagonalize the Hamiltonian. The Elpa distribution relies on BLACS and ScaLAPACK, and hence can only be used alongside sl_auto, sl_default, or a similar keyword.  250 251  Enabling Elpa is highly recommended as it significantly speeds up the diagonalization step. See also :ref:lcao.  Ask Hjorth Larsen committed Nov 21, 2018 252 253 254 255 256  * elpasolver indicates which solver to use with Elpa. By default it uses the two-stage solver, '2stage'. The other allowed value is '1stage'. This setting will only have effect if Elpa is enabled.  naromero committed Jan 13, 2011 257 258 259 260 261 262 263 264 * The 'buffer_size' is specified as an integer and corresponds to the size of the buffer in KiB used in the 1D systolic parallel matrix multiply algorithm. The default value corresponds to sending all wavefunctions simultaneously. A reasonable value would be the size of the largest cache (L2 or L3) divide by the number of MPI tasks per CPU. Values larger than the default value are non-sensical and internally reset to the default value.  s032082 committed Apr 27, 2010 265 266 267 268 269 270 271 272 273 .. note:: With the exception of 'stridebands', these parameters all have an equivalent command line argument which can equally well be used to specify these parallelization options. Note however that the values explicitly given in the parallel keyword to a calculator will override those given via the command line. As such, the command line arguments thus merely redefine the default values which are used in case the parallel keyword doesn't specifically state otherwise.  274   s032082 committed Feb 02, 2012 275 .. _manual_parsize_domain:  276 277 278 279  Domain decomposition --------------------  s032082 committed Apr 27, 2010 280 281 282 283 284 285 286 287 288 289 290 291 292 293 Any choice for the domain decomposition can be forced by specifying domain in the parallel keyword. It can be given in the form parallel={'domain': (nx,ny,nz)} to force the decomposition into nx, ny, and nz boxes in x, y, and z direction respectively. Alternatively, one may just specify the total number of domains to decompose into, leaving it to an internal cost-minimizer algorithm to determine the number of domains in the x, y and z directions such that parallel efficiency is optimal. This is achieved by giving the domain argument as parallel={'domain': n} where n is the total number of boxes. .. tip:: parallel={'domain': world.size} will force all parallelization to be carried out solely in terms of domain decomposition, and will in general be much more efficient than e.g. parallel={'domain': (1,1,world.size)}.  ErpanArkin committed May 03, 2019 294  You might have to add from gpaw.mpi import world to the script to  miwalter committed Sep 02, 2010 295  define world.  s032082 committed Apr 27, 2010 296   297   s032082 committed Apr 27, 2010 298 .. _manual_parsize_bands:  299   dulak committed Feb 24, 2009 300 Band parallelization  301 --------------------  dulak committed Feb 24, 2009 302   s032082 committed Apr 27, 2010 303 Parallelization over Kohn-Sham orbitals (i.e. bands) becomes favorable when  Jens Jørgen Mortensen committed Apr 11, 2016 304 the number of bands N is so large that \mathcal{O}(N^2)  s032082 committed Apr 27, 2010 305 306 307 operations begin to dominate in terms of computational time. Linear algebra for orthonormalization and diagonalization of the wavefunctions is the most noticeable contributor in this regime, and therefore, band parallelization  s032082 committed Jul 13, 2010 308 309 310 can be used to distribute the computational load over several CPUs. This is achieved by giving the band argument as parallel={'band': nbg} where nbg is the number of band groups to parallelize over.  s032082 committed Apr 27, 2010 311 312 313 314 315 316  .. tip:: Whereas band parallelization in itself will reduce the amount of operations each CPU has to carry out to calculate e.g. the overlap matrix, the actual linear algebra necessary to solve such linear systems is in fact still done using serial LAPACK by default. It is therefor advisable to use both  naromero committed Jan 13, 2011 317  band parallelization and ScaLAPACK in conjunction to reduce this  s032082 committed Apr 27, 2010 318 319 320 321  potential bottleneck. More information about these topics can be found here:  dulak committed Feb 24, 2009 322 323 324 325 .. toctree:: :maxdepth: 1 band_parallelization/band_parallelization  s032082 committed Apr 27, 2010 326   Jens Jørgen Mortensen committed Apr 24, 2017 327   naromero committed Jan 13, 2011 328 329 330 .. _manual_ScaLAPACK: ScaLAPACK  Ask Hjorth Larsen committed Feb 28, 2016 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 --------- ScaLAPACK improves performance of calculations beyond a certain size. This size depends on whether using FD, LCAO, or PW mode. In FD or PW mode, ScaLAPACK operations are applied to arrays of size nbands by nbands, whereas in LCAO mode, the arrays are generally the number of orbitals by the number of orbitals and therefore larger, making ScaLAPACK particularly important for LCAO calculations. With LCAO, it starts to become an advantage to use ScaLAPACK at around 800 orbitals which corresponds to about 50 normal (non-hydrogen, non-semicore) atoms with standard DZP basis set. In FD mode, calculations with nbands > 500 will benefit from ScaLAPACK; otherwise, the default serial LAPACK might as well be used. The ScaLAPACK parameters  349 350 are defined using the parallel keyword dictionary, e.g., sl_default=(m, n, block).  Ask Hjorth Larsen committed Feb 28, 2016 351 352 353 354 355 356  A block size of 64 has been found to be a universally good choice both in all modes. In LCAO mode, it is normally best to assign as many cores as possible, which means that m and n should multiply to the total number of cores  Tuomas Rossi committed Aug 16, 2021 357 divided by the k-point parallelization.  Ask Hjorth Larsen committed Feb 28, 2016 358 359 360 361 362 363 364 365 For example with 128 cores and parallelizing by 4 over k-points, there are 32 cores per k-point available per scalapack and a sensible choice is m=8, n=4. You can use sl_auto=True to make such a choice automatically. In FD or PW mode, a good guess for these parameters on most systems is related to the numbers of bands. We recommend for FD/PW::  naromero committed Jan 14, 2011 366 367 368  mb = 64 m = floor(sqrt(nbands/mb))  naromero committed Jan 13, 2011 369 370  n = m  naromero committed Jan 31, 2011 371 There are a total of four 'sl_...' keywords. Most people will be  Ask Hjorth Larsen committed Feb 28, 2016 372 fine just using 'sl_default' or even 'sl_auto'. Here we use the same  naromero committed Jan 31, 2011 373 ScaLAPACK parameters in three different places: i) general eigensolve  Ask Hjorth Larsen committed Feb 28, 2016 374 in the LCAO intilization ii) standard eigensolve in the FD calculation and  naromero committed Jan 31, 2011 375 376 iii) Cholesky decomposition in the FD calculation. It is currently possible to use different ScaLAPACK parameters in the LCAO  Ask Hjorth Larsen committed Feb 28, 2016 377 initialization and the FD calculation by using two of the ScaLAPACK  naromero committed Jan 31, 2011 378 379 keywords in tandem, e.g::  380  GPAW(..., parallel={'sl_lcao': (p, q, p), 'sl_default': (m, n, mb)})  naromero committed Jan 31, 2011 381   Ask Hjorth Larsen committed Feb 28, 2016 382 where p, q, pb, m, n, and mb all  naromero committed Jan 31, 2011 383 have different values. The most general case is the combination  384 385 of three ScaLAPACK keywords. Note that some combinations of keywords may not be supported.  Jussi Enkovaara committed Jan 14, 2021 386   Jens Jørgen Mortensen committed Mar 08, 2021 387   Jussi Enkovaara committed Jan 14, 2021 388 389 .. _manual_openmp:  Jussi Enkovaara committed Jan 14, 2021 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 Hybrid OpenMP/MPI parallelization --------------------------------- In some hardware the performance of large FD and LCAO and calculations can be improved by using OpenMP parallelization in addition to MPI. When GPAW is built with OpenMP support, hybrid parallelization is enabled by setting the OMP_NUM_THREADS environment variable:: export OMP_NUM_THREADS=4 mpiexec -n 512 gpaw python script.py This would run the calculation with a total of 2048 CPU cores. As the optimum MPI task / OpenMP thread ratio depends a lot on the particular input and underlying hardware, it is recommended to experiment with different settings before production calculations.