ForceResetter and ForceContainer::sync() are Yade's bottleneck

Disclaimer: the title of this issue is true only in particular cases.

in MPI runs the typical timing of a worker on 256 cores is as follows (~1e6 spheres):

Worker2: #####  Worker 2  ######
Name                                                    Count                 Time            Rel. time
-------------------------------------------------------------------------------------------------------
ForceResetter                                       354       10858713.821us               49.43%      
"collider"                                            0                0.0us                0.00%      
"interactionLoop"                                   354        2174313.105us                9.90%      
"isendRecvForcesRunner"                             354          83541.895us                0.38%      
"newton"                                            354        1423152.201us                6.48%      
  forces sync                                         354         119666.283us                8.41%    
  motion integration                                  354         1302690.43us               91.54%    
  sync max vel                                        354            352.339us                0.02%    
  terminate                                           354              50.07us                0.00%    
  TOTAL                                              1416        1422759.122us               99.97%    
"sendRecvStatesRunner"                              354        6910055.019us               31.46%      
"waitForcesRunner"                                  354          12394.212us                0.06%      
"collisionChecker"                                  354         504052.049us                2.29%      
TOTAL                                                         21966222.302us              100.00%

ForceResetter takes 50%. With 1000 cores it would take ~87.5%. The reason is that the force and torque vectors always have size = maxId of the entire simulation, so we keep resetting zero values, only a fraction 1/numProc of the container is meaningfull. There is the same issue with ForceContainer::sync() (since I already fixed it locally it's not visible in the above timings).

What it reveals also affects simulations where bodies are continuously inserted and erased (without MPI) since

Yade [3]: for k in range(1000):
    O.bodies.append(Body())
    O.bodies.erase(O.bodies[-1].id)
     ...:     

Yade [4]: len(O.bodies)
 ->  [4]: 1000

and yade has engines doing such things (SphereFactory). In addition, this ever growing maxId is somehow a memory leak since multiple containers take that size.

The problem is amplified with openmp parallelization since 3.1) there is a force container per thread (quite a lot of mostly empty containers), and 3.2) if openmp schedule is static most threads loop on empty content and need wait for a couple of them looping on real data.

On the top of this, the cost is amplified by the fact that the force container is actually twice longer than O.bodies for no reason (in current master at least, this commit was enough to divide the costs by two).

Conclusion: Currently I'm solving the problem for the MPI and MPI+OpenMP cases with a short list of relevant ids. However I wonder if we should not roll back BodyContainer: years ago insert() would first fill free spots instead of growing maxId again and again. It seems to be the only way to avoid 2).

Edited Sep 16, 2019 by bchareyre