Measure also start and end of whole step

Currently we measure start and end of just method calls. But we could also measure start and end of the whole step. So by summing together length of steps of method calls and comparing that with the time of the whole step, we can determine the runtime overhead of the whole step (and similarly pipeline run).

This would be depend on a particular runtime through, so information how large is overhead might not translate to some other runtime.

This could solve the question of how to measure time/overhead of preparing/cloning primitives for hyper-parameter values and other similar cases. Where maybe there is a difference between two primitives in how much of time they require for that. See #259 (closed).

Edited by Mitar