parallel_runs.rst 14.6 KB
Newer Older
1
2
3
4
5
6
.. _parallel_runs:

=============
Parallel runs
=============

7
8
9
10
.. toctree::
   :maxdepth: 1

.. _parallel_running_jobs:
11
12
13
14

Running jobs in parallel
========================

15
Parallel calculations are done primarily with MPI.
16
17
18
The parallelization can be done over the **k**-points, bands,
and using real-space domain decomposition.
The code will try to make a sensible domain
19
20
decomposition that match both the number of processors and the size of
the unit cell.  This choice can be overruled, see
21
22
23
24
:ref:`manual_parallelization_types`. Complementary OpenMP
parallelization can improve the performance in some cases, see
:ref:`manual_openmp`.

25

26
27
Before starting a parallel calculation, it might be useful to check how the
parallelization corresponding to the given number of processes would be done
Jens Jørgen Mortensen's avatar
Jens Jørgen Mortensen committed
28
with the ``--dry-run=N`` command line option::
29

Jens Jørgen Mortensen's avatar
Jens Jørgen Mortensen committed
30
    $ gpaw python --dry-run=8 script.py
31

32
The output will contain also the "Calculator" RAM Memory estimate per process.
dulak's avatar
dulak committed
33

Jens Jørgen Mortensen's avatar
Jens Jørgen Mortensen committed
34
35
36
In order to run GPAW in parallel, you
do one of these two::

37
38
39
    $ mpiexec -n <cores> gpaw python script.py
    $ gpaw -P <cores> python script.py
    $ mpiexec -n <cores> python3 script.py
Jens Jørgen Mortensen's avatar
Jens Jørgen Mortensen committed
40

41
42
The first two are the recommended ones:  The *gpaw* script will make sure
that imports are done in an efficient way.
Jens Jørgen Mortensen's avatar
Jens Jørgen Mortensen committed
43
44
45
46
47
48
49
50


Submitting a job to a queuing system
====================================

You can write a shell-script that contains this line::

    mpiexec gpaw python script.py
51

Jens Jørgen Mortensen's avatar
Jens Jørgen Mortensen committed
52
and then submit that with ``sbatch``, ``qsub`` or some other command.
53

Jens Jørgen Mortensen's avatar
Jens Jørgen Mortensen committed
54
Alternatives:
55

Jens Jørgen Mortensen's avatar
Jens Jørgen Mortensen committed
56
57
* If you are on a SLURM system:  use the :ref:`sbatch <cli>` sub-command
  of the ``gpaw`` command-line tool::
58

Jens Jørgen Mortensen's avatar
Jens Jørgen Mortensen committed
59
      $ gpaw sbatch -- [sbatch options] script.py [script options]
60

Jens Jørgen Mortensen's avatar
Jens Jørgen Mortensen committed
61
* Use MyQueue_::
62

Jens Jørgen Mortensen's avatar
Jens Jørgen Mortensen committed
63
      $ mq submit "script.py [script options]" -R <resources>
64

Jens Jørgen Mortensen's avatar
Jens Jørgen Mortensen committed
65
66
* Write you own *submit* script.  See this example:
  :git:`doc/platforms/gbar/qsub.py`.
67

Jens Jørgen Mortensen's avatar
Jens Jørgen Mortensen committed
68
.. _MyQueue: https://myqueue.readthedocs.io/
69
70
71
72
73
74
75
76
77


Alternative submit tool
=======================

Alternatively, the script gpaw-runscript can be used, try::

  $ gpaw-runscript -h

78
79
to get the architectures implemented and the available options. As an
example, use::
80
81
82

  $ gpaw-runscript script.py 32

83
to write a job sumission script running script.py on 32 cpus.
84
85
86
87
88
The tool tries to guess the architecture/host automatically.

By default it uses the following environment variables to write the runscript:

=============== ===================================
89
variable        meaning
90
91
=============== ===================================
HOSTNAME        name used to assing host type
92
PYTHONPATH      path for Python
93
94
95
GPAW_SETUP_PATH where to find the setups
GPAW_MAIL       where to send emails about the jobs
=============== ===================================
96

97

98
99
100
Writing to files
================

101
102
Be careful when writing to files in a parallel run.  Instead of ``f =
open('data', 'w')``, use:
103
104
105
106

>>> from ase.parallel import paropen
>>> f = paropen('data', 'w')

107
108
Using ``paropen``, you get a real file object on the master node, and dummy
objects on the slaves.  It is equivalent to this:
109

110
111
>>> from ase.parallel import world
>>> if world.rank == 0:
112
113
114
115
...     f = open('data', 'w')
... else:
...     f = open('/dev/null', 'w')

116
117
If you *really* want all nodes to write something to files, you should make
sure that the files have different names:
118

119
120
>>> from ase.parallel import world
>>> f = open('data.{}'.format(world.rank), 'w')
121

jensj's avatar
jensj committed
122

miwalter's avatar
miwalter committed
123
124
125
126
127
128
129
Writing text output
===================

Text output written by the ``print`` statement is written by all nodes.
To avoid this use:

>>> from ase.parallel import parprint
jensj's avatar
jensj committed
130
>>> print('This is written by all nodes')
miwalter's avatar
miwalter committed
131
132
133
134
>>> parprint('This is written by the master only')

which is equivalent to

135
>>> from ase.parallel import world
jensj's avatar
jensj committed
136
>>> print('This is written by all nodes')
137
>>> if world.rank == 0:
jensj's avatar
jensj committed
138
...     print('This is written by the master only')
miwalter's avatar
miwalter committed
139

jensj's avatar
jensj committed
140

jussie's avatar
jussie committed
141
142
.. _different_calculations_in parallel:

143
144
145
146
147
148
Running different calculations in parallel
==========================================

A GPAW calculator object will per default distribute its work on all
available processes. If you want to use several different calculators
at the same time, however, you can specify a set of processes to be
149
150
151
used by each calculator. The processes are supplied to the constructor,
either by specifying an :ref:`MPI Communicator object <communicators>`,
or simply a list of ranks. Thus, you may write::
152
153
154
155
156

  from gpaw import GPAW
  import gpaw.mpi as mpi

  # Create a calculator using ranks 0, 3 and 4 from the mpi world communicator
jensj's avatar
jensj committed
157
158
159
160
161
  ranks = [0, 3, 4]
  comm = mpi.world.new_communicator(ranks)
  if mpi.world.rank in ranks:
      calc = GPAW(communicator=comm)
      ...
162
163
164
165
166
167
168
169
170

Be sure to specify different output files to each calculator,
otherwise their outputs will be mixed up.

Here is an example which calculates the atomization energy of a
nitrogen molecule using two processes:

.. literalinclude:: parallel_atomization.py

171

172
.. _manual_parallelization_types:
173
174
.. _manual_parallel:

askhl's avatar
askhl committed
175
176
Parallelization options
=======================
177

178
In version 0.7, a new keyword called ``parallel`` was introduced to provide
179
a unified way of specifying parallelization-related options. Similar to
180
the way we :ref:`specify convergence criteria <manual_convergence>` with the
181
182
183
184
185
``convergence`` keyword, a Python dictionary is used to contain all such
options in a single keyword.

The default value corresponds to this Python dictionary::

186
187
  {'kpt':                 None,
   'domain':              None,
188
   'band':                1,
189
   'order':               'kdb',
190
   'stridebands':         False,
191
   'sl_auto':             False,
192
193
194
   'sl_default':          None,
   'sl_diagonalize':      None,
   'sl_inverse_cholesky': None,
195
   'sl_lcao':             None,
196
   'sl_lrtddft':          None,
197
198
   'use_elpa':            False,
   'elpasolver':          '2stage',
199
   'buffer_size':         None}
200
201
202

In words:

203
204
205
206
207
208
* ``'kpt'`` is an integer and denotes the number of groups of k-points over
  which to parallelize.  k-point parallelization is the most efficient type of
  parallelization for most systems with many electrons and/or many k-points. If
  unspecified, the calculator will choose a parallelization itself which
  maximizes the k-point parallelization unless that leads to load imbalance; in
  that case, it may prioritize domain decomposition.
209
210
  Note: parallelization over spin is not possible in
  :ref:`GPAW 20.10.0 and newer versions <releasenotes>`.
211
212

* The ``'domain'`` value specifies either an integer ``n`` or a tuple
213
214
215
  ``(nx,ny,nz)`` of 3 integers for
  :ref:`domain decomposition <manual_parsize_domain>`.
  If not specified (i.e. ``None``), the calculator will try to determine the
216
  best domain parallelization size based on number of kpoints etc.
217

218
219
220
* The ``'band'`` value specifies the number of parallelization groups to use
  for :ref:`band parallelization <manual_parsize_bands>` and defaults to one,
  i.e. no band parallelization.
221

222
223
224
225
226
227
228
229
230
* ``'order'`` specifies how different parallelization modes are nested
  within the calculator's world communicator.  Must be a permutation
  of the characters ``'kdb'`` which is the default.  The characters
  denote k-point, domain or band parallelization respectively.  The
  last mode will be assigned contiguous ranks and thus, depending on
  network layout, probably becomes more efficient.  Usually for static
  calculations the most efficient order is ``'kdb'`` whereas for TDDFT
  it is ``'kbd'``.

231
232
* The ``'stridebands'`` value only applies when band parallelization is used,
  and can be used to toggle between grouped and strided band distribution.
233

234
235
* If ``'sl_auto'`` is ``True``, ScaLAPACK will be enabled with automatically
  chosen parameters and using all available CPUs.
236

237
238
* The other ``'sl_...'`` values are for using ScaLAPACK with different
  parameters in different operations.
239
240
241
  Each can be specified as a tuple ``(m,n,mb)`` of 3 integers to
  indicate an ``m*n`` grid of CPUs and a block size of ``mb``.
  If any of the three latter keywords are not
naromero's avatar
naromero committed
242
  specified (i.e. ``None``), they default to the value of
243
244
  ``'sl_default'``. Presently, ``'sl_inverse_cholesky'`` must equal
  ``'sl_diagonalize'``.
naromero's avatar
naromero committed
245

246
247
248
249
* If the Elpa library is installed, enable it by setting ``use_elpa``
  to ``True``.  Elpa will be used to diagonalize the Hamiltonian.  The
  Elpa distribution relies on BLACS and ScaLAPACK, and hence can only
  be used alongside ``sl_auto``, ``sl_default``, or a similar keyword.
250
251
  Enabling Elpa is highly recommended as it significantly
  speeds up the diagonalization step.  See also :ref:`lcao`.
252
253
254
255
256

* ``elpasolver`` indicates which solver to use with Elpa.  By default
  it uses the two-stage solver, ``'2stage'``.  The other allowed value
  is ``'1stage'``.  This setting will only have effect if Elpa is enabled.

naromero's avatar
naromero committed
257
258
259
260
261
262
263
264
* The ``'buffer_size'``  is specified as an integer and corresponds to
  the size of the buffer in KiB used in the 1D systolic parallel
  matrix multiply algorithm. The default value corresponds to sending all
  wavefunctions simultaneously. A reasonable value would be the size
  of the largest cache (L2 or L3) divide by the number of MPI tasks
  per CPU. Values larger than the default value are non-sensical and
  internally reset to the default value.

265
266
267
268
269
270
271
272
273
.. note::
   With the exception of ``'stridebands'``, these parameters all have an
   equivalent command line argument which can equally well be used to specify
   these parallelization options. Note however that the values explicitly given
   in the ``parallel`` keyword to a calculator will override those given via
   the command line. As such, the command line arguments thus merely redefine
   the default values which are used in case the ``parallel`` keyword doesn't
   specifically state otherwise.

274

275
.. _manual_parsize_domain:
276
277
278
279

Domain decomposition
--------------------

280
281
282
283
284
285
286
287
288
289
290
291
292
293
Any choice for the domain decomposition can be forced by specifying
``domain`` in the ``parallel`` keyword. It can be given in the form
``parallel={'domain': (nx,ny,nz)}`` to force the decomposition into ``nx``,
``ny``, and ``nz`` boxes in x, y, and z direction respectively. Alternatively,
one may just specify the total number of domains to decompose into, leaving
it to an internal cost-minimizer algorithm to determine the number of domains
in the x, y and z directions such that parallel efficiency is optimal. This
is achieved by giving the ``domain`` argument as ``parallel={'domain': n}``
where ``n`` is the total number of boxes.

.. tip::
   ``parallel={'domain': world.size}`` will force all parallelization to be
   carried out solely in terms of domain decomposition, and will in general
   be much more efficient than e.g. ``parallel={'domain': (1,1,world.size)}``.
294
   You might have to add ``from gpaw.mpi import world`` to the script to
miwalter's avatar
miwalter committed
295
   define ``world``.
296

297

298
.. _manual_parsize_bands:
299

300
Band parallelization
301
--------------------
302

303
Parallelization over Kohn-Sham orbitals (i.e. bands) becomes favorable when
Jens Jørgen Mortensen's avatar
Jens Jørgen Mortensen committed
304
the number of bands `N` is so large that `\mathcal{O}(N^2)`
305
306
307
operations begin to dominate in terms of computational time. Linear algebra
for orthonormalization and diagonalization of the wavefunctions is the most
noticeable contributor in this regime, and therefore, band parallelization
308
309
310
can be used to distribute the computational load over several CPUs. This
is achieved by giving the ``band`` argument as ``parallel={'band': nbg}``
where ``nbg`` is the number of band groups to parallelize over.
311
312
313
314
315
316

.. tip::
   Whereas band parallelization in itself will reduce the amount of operations
   each CPU has to carry out to calculate e.g. the overlap matrix, the actual
   linear algebra necessary to solve such linear systems is in fact still
   done using serial LAPACK by default. It is therefor advisable to use both
naromero's avatar
naromero committed
317
   band parallelization and ScaLAPACK in conjunction to reduce this
318
319
320
321
   potential bottleneck.

More information about these topics can be found here:

322
323
324
325
.. toctree::
   :maxdepth: 1

   band_parallelization/band_parallelization
326

327

naromero's avatar
naromero committed
328
329
330
.. _manual_ScaLAPACK:

ScaLAPACK
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
---------

ScaLAPACK improves performance of calculations beyond a certain size.
This size depends on whether using FD, LCAO, or PW mode.

In FD or PW mode, ScaLAPACK operations are applied to arrays of size
nbands by nbands, whereas in LCAO mode, the arrays are generally the
number of orbitals by the number of orbitals and therefore larger,
making ScaLAPACK particularly important for LCAO calculations.

With LCAO, it starts to become an advantage to use ScaLAPACK at around
800 orbitals which corresponds to about 50 normal (non-hydrogen,
non-semicore) atoms with standard DZP basis set.
In FD mode, calculations with nbands > 500 will
benefit from ScaLAPACK; otherwise, the default serial LAPACK might as
well be used.

The ScaLAPACK parameters
349
350
are defined using the parallel
keyword dictionary, e.g., ``sl_default=(m, n, block)``.
351
352
353
354
355
356

A block size of 64 has been found to be a universally good choice both
in all modes.

In LCAO mode, it is normally best to assign as many cores as possible,
which means that ``m`` and ``n`` should multiply to the total number of cores
357
divided by the k-point parallelization.
358
359
360
361
362
363
364
365
For example with 128 cores and parallelizing by 4 over k-points,
there are 32 cores per k-point available per scalapack and a sensible
choice is ``m=8``, ``n=4``.  You can use ``sl_auto=True`` to make
such a choice automatically.

In FD or PW mode, a good guess for these
parameters on most systems is related to the numbers of bands.
We recommend for FD/PW::
naromero's avatar
naromero committed
366
367
368

  mb = 64
  m = floor(sqrt(nbands/mb))
naromero's avatar
naromero committed
369
370
  n = m

naromero's avatar
naromero committed
371
There are a total of four ``'sl_...'`` keywords. Most people will be
372
fine just using ``'sl_default'`` or even ``'sl_auto'``. Here we use the same
naromero's avatar
naromero committed
373
ScaLAPACK parameters in three different places: i) general eigensolve
374
in the LCAO intilization ii) standard eigensolve in the FD calculation and
naromero's avatar
naromero committed
375
376
iii) Cholesky decomposition in the FD calculation. It is currently
possible to use different ScaLAPACK parameters in the LCAO
377
initialization and the FD calculation by using two of the ScaLAPACK
naromero's avatar
naromero committed
378
379
keywords in tandem, e.g::

380
  GPAW(..., parallel={'sl_lcao': (p, q, p), 'sl_default': (m, n, mb)})
naromero's avatar
naromero committed
381

382
where ``p``, ``q``, ``pb``, ``m``, ``n``, and ``mb`` all
naromero's avatar
naromero committed
383
have different values. The most general case is the combination
384
385
of three ScaLAPACK keywords.
Note that some combinations of keywords may not be supported.
386

Jens Jørgen Mortensen's avatar
Jens Jørgen Mortensen committed
387

388
389
.. _manual_openmp:

390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
Hybrid OpenMP/MPI parallelization
---------------------------------

In some hardware the performance of large FD and LCAO and calculations
can be improved by using OpenMP parallelization in addition to
MPI. When GPAW is built with OpenMP support, hybrid parallelization
is enabled by setting the OMP_NUM_THREADS environment variable::

  export OMP_NUM_THREADS=4
  mpiexec -n 512 gpaw python script.py

This would run the calculation with a total of 2048 CPU cores. As the
optimum MPI task / OpenMP thread ratio depends a lot on the particular
input and underlying hardware, it is recommended to experiment with
different settings before production calculations.