Skip to content

Improve EXX USPP threading and vectorization

Ye Luo requested to merge ye-luo/q-e:exx-thread-vectorization into develop

The USPP EXX tqr=.false. code path is much faster now. Even a small system like uspp-hyb-k.in from the test suite, the walltime drops by half. Both threading and vectorization of addusxx_g and newdxx_g are improved.

All the inner loop threading are replaced. The loop of g vectors are chucked into size 256 and the loop of chunks are lifted. This modification introduces cache blocking and the vectorization plays nicely within a chunk. In addusxx_g, g-vectors are all independent and thus the loop over chunks are threaded. In newdxx_g, the dot product doesn't allow threading over chunks without synchronization but the loop of the atoms of the same species can be threaded. In principle, we can directly thread all atoms instead of by species but the thread load balance may depends on the input file. For example 2 H2O molecule, HHHHOO and HOHHOH may have very different performance characteristics depending on the OpenMP runtime scheduling. The parallel region now covers almost the whole routine and the fork/join cost is nothing.

To do the threading over atoms of the same species, I need to have the grouping information of atoms ready however only CP does that not PW. So I add sort_specie in ions_base and call it in read_cards_pw. My knowledge of input file handling is very limited and perhaps what I did is suboptimal. Please let me know if other places needs to call sort_species. I'm thinking of restart and other codes/plug-ins in QE if they need to call vexx.

Merge request reports