Improve the threading of paralllel 3D FFT
The threading scaling of parallel 3D FFT is improved by threading the scatter routines which are the current bottleneck. In general, if you don't compile the code with threading, the performance should not change or improve a tiny bit. If you do use threading, the time on FFT will stay quite flat with a wide range of thread counts for a fixed resource. When the task group is used, threading works even better.
More detailed explanation is on my wiki page.