Reproducibility: differences if primitives are pickled between run phases, or if get_params/set_params is used instead
We are trying to assure full reproducibility of pipeline runs with idea that somebody else can rerun the pipeline and get the same result. But there are edge cases with capturing randomness we should consider and see how to handle. So the basic idea is simple: we provide random seed to the primitive and primitive should use it to base any (pseudo)randomness on it.
But the issue is when primitives (or pipeline/runtime) gets pickled/unpickled during pipeline execution (either inside same phase, because runtime tries to cache/reuse primitives; or even between fit and produce phases), how do we assure that in all those cases random state is advanced in the same way. The issue is that there are currently different ways to do this saving and restoring of primitive instances, and also those ways can be differently implemented. Consider a primitive which, as we are suggesting in docstrings, use provided random_seed constructor argument in numpy.random.RandomState(random_seed) to obtain an instance of a random generator it can later on use whenever it needs next random value.
- We could use pickling to save and restore primitive instance, but primitive's implementation might not save
RandomStatestate during pickling. In that case when restoring the instance,RandomStatewould be re-initialized to the initial state given originalrandom_seedargumnt. So every time you save and restore use pickling you make the progression of random values different from what it would be if this saving and restoring would not happen. - Because of this it is suggested in docstrings that you should add to pickled state also
RandomStatestate. This is demonstrated in random forest common primitive where we store random state in additional to the rest of the state. In this case pickling and unpickling works. - But problem with pickling is that it works only inside same environment, see #159 (closed).
volumesandtemporary_directoryconstructor arguments do not work on some other environment. So it is suggested that one should instead manually create a primitive instance with updated argument values, and then useset_paramsto restore primitive's state obtained withget_params. But this again does mean random state is re-initialized at that point.
We could store random state as part of Params to address this issue, but the problem is that random state is not really "fitted" state of a primitive. Or is it?