Add split primitive and a new primitive base
This primitive splits the input dataset for cross validation. There is also a corresponding primitivebase TestPrimitive
.
- The thing we achieve is that when we split the data in the main table, we can detect which rows in other tables are needed, then we only keep those rows that are needed. There is option to decide whether to delete rows in other tables.
Algorithm for finding the rows in all tables that are needed:
- I run graph traversal based on rows in tables. For each row needed in the main table, we start from this row and traverse the relation graph. We enqueue rows in other tables that are first visited.
- I make some acceleration in the code that does not influence the algorithm because some tables could contain hundreds of millions of rows. Briefly, the acceleration I make is that if a list of rows in a table has the same value, then instead of enqueue each row, I enqueue the whole list of rows as a list once.
About the implementation:
- In function
fit
, I generate the split indexes of the main resource and also generate the relation graph(by calling_build_relation_graph
function) for this dataset. - The
inputs
parameter in functionproduce
andproduce_test
is a list of indexes, where each index corresponds to which fold we choose as testing set in cross validation. Thedatasets
parameter in the functionset_training_data
is a dataset instance. - The graph traversal code is from line 146 to line 192.
- The function
_cut_dataset
is used when we have decided which rows to keep in each table. We can pass the rows we needed and generate a new dataset instance which will keep only the rows needed. - The output of this primitive is a list of dataset which is of the same length as the input indexes list, where each dataset corresponds to index.
- The reason why I implement a new primitive base
TestPrimitive
is that theGeneratorPrimitiveBase
doesn't allow inputs. So theTestPrimitive
is created mainly to add input to the primitive.
Note:
- About the test for this primitive, the test is in
common_primitives/test_split.py
. I put the test here is because i make a new primitive base incommon_primitives/TestPrimitive.py
.
Edited by Mingjie Sun