fit_produce() method
Our team has been having trouble for a while now adapting our primitive to the API here, and now with the pipeline spec we really need a resolution (which currently is just to do duplicate work- effectively doubling the time it takes to train).
The issue is that we don’t know the full state to save until after completing both the fit and produce step. When we are given training data, feature computation is a 2-step process. We generate an initial list of feature computations, then we calculate those, and then we do a special one-hot-encoding variant on the calculated values. We save the final list of one-hot-encoded features, and on the test set, we compute directly using this list of one-hot-encoded features (rather than using the 2-step process).
The details are less important than the underlying issue. We can resolve by doing 2 duplicate computations:
dfs.fit() # generates & encodes features, saves state needed to recompute
dfs.produce(train_ds) # computes features on train set for second time using saved features
dfs.produce(test_ds) # computes features on test set using saved features```
SKLearn has a way to get around this duplicate computation with the `fit_transform()` method.
If we had a `fit_produce()`, we could return the result that we need to compute anyway, and avoid computing it a second time in `produce()`.