ForceResetter and ForceContainer::sync() are Yade's bottleneck
Disclaimer: the title of this issue is true only in particular cases.
- in MPI runs the typical timing of a worker on 256 cores is as follows (~1e6 spheres):
Worker2: ##### Worker 2 ######
Name Count Time Rel. time
-------------------------------------------------------------------------------------------------------
ForceResetter 354 10858713.821us 49.43%
"collider" 0 0.0us 0.00%
"interactionLoop" 354 2174313.105us 9.90%
"isendRecvForcesRunner" 354 83541.895us 0.38%
"newton" 354 1423152.201us 6.48%
forces sync 354 119666.283us 8.41%
motion integration 354 1302690.43us 91.54%
sync max vel 354 352.339us 0.02%
terminate 354 50.07us 0.00%
TOTAL 1416 1422759.122us 99.97%
"sendRecvStatesRunner" 354 6910055.019us 31.46%
"waitForcesRunner" 354 12394.212us 0.06%
"collisionChecker" 354 504052.049us 2.29%
TOTAL 21966222.302us 100.00%
ForceResetter takes 50%. With 1000 cores it would take ~87.5%. The reason is that the force and torque vectors always have size = maxId of the entire simulation, so we keep resetting zero values, only a fraction 1/numProc of the container is meaningfull. There is the same issue with ForceContainer::sync() (since I already fixed it locally it's not visible in the above timings).
- What it reveals also affects simulations where bodies are continuously inserted and erased (without MPI) since
Yade [3]: for k in range(1000):
O.bodies.append(Body())
O.bodies.erase(O.bodies[-1].id)
...:
Yade [4]: len(O.bodies)
-> [4]: 1000
and yade has engines doing such things (SphereFactory). In addition, this ever growing maxId is somehow a memory leak since multiple containers take that size.
- The problem is amplified with openmp parallelization since 3.1) there is a force container per thread (quite a lot of mostly empty containers), and 3.2) if openmp schedule is static most threads loop on empty content and need wait for a couple of them looping on real data.
On the top of this, the cost is amplified by the fact that the force container is actually twice longer than O.bodies for no reason (in current master at least, this commit was enough to divide the costs by two).
Conclusion: Currently I'm solving the problem for the MPI and MPI+OpenMP cases with a short list of relevant ids. However I wonder if we should not roll back BodyContainer: years ago insert() would first fill free spots instead of growing maxId again and again. It seems to be the only way to avoid 2).