c2db: More intelligent parallelization

Currently we use the default parallelization of GPAW. The heuristic was written back when 16 cores was a lot.

Sometimes it chooses relatively bad parallelization (e.g. very little kpoint parallelization).

What we should do is to write a function that returns a parallelization that's non-bad for all materials. The function can do a "dry run" using gpaw.calcinfo to see what symmetry reduction/kpoints GPAW would choose, then decide parallelization based on how many cores are there.

Or we can add a choose_parallelization() callback to GPAW so we inject a call to our implementation organically (that's probably worth having, generally speaking, even though the calcinfo functionality serves to make callbacks less necessary).

Considerations:

Priority of kpoints (and rounding errors/waste) vs domain parallelization? This is the main question since currently we sometimes get very little kpoint parallelization.
Enable some band parallelization if there are many cores compared to problem size? This is secondary.
Enable scalapack? (E.g. for Davidson) Probably not important for C2DB.
Enable augment_grids (I keep forgetting: What do we have in terms of augment_grids in GPAW new)?