Skip to content

Add split primitive and a new primitive base

Mingjie Sun requested to merge split into master

This primitive splits the input dataset for cross validation. There is also a corresponding primitivebase TestPrimitive.

  • The thing we achieve is that when we split the data in the main table, we can detect which rows in other tables are needed, then we only keep those rows that are needed. There is option to decide whether to delete rows in other tables.

Algorithm for finding the rows in all tables that are needed:

  1. I run graph traversal based on rows in tables. For each row needed in the main table, we start from this row and traverse the relation graph. We enqueue rows in other tables that are first visited.
  2. I make some acceleration in the code that does not influence the algorithm because some tables could contain hundreds of millions of rows. Briefly, the acceleration I make is that if a list of rows in a table has the same value, then instead of enqueue each row, I enqueue the whole list of rows as a list once.

About the implementation:

  1. In function fit, I generate the split indexes of the main resource and also generate the relation graph(by calling _build_relation_graph function) for this dataset.
  2. The inputs parameter in function produce and produce_test is a list of indexes, where each index corresponds to which fold we choose as testing set in cross validation. The datasets parameter in the function set_training_data is a dataset instance.
  3. The graph traversal code is from line 146 to line 192.
  4. The function _cut_dataset is used when we have decided which rows to keep in each table. We can pass the rows we needed and generate a new dataset instance which will keep only the rows needed.
  5. The output of this primitive is a list of dataset which is of the same length as the input indexes list, where each dataset corresponds to index.
  6. The reason why I implement a new primitive base TestPrimitive is that the GeneratorPrimitiveBase doesn't allow inputs. So the TestPrimitive is created mainly to add input to the primitive.

Note:

  1. About the test for this primitive, the test is in common_primitives/test_split.py. I put the test here is because i make a new primitive base in common_primitives/TestPrimitive.py.
Edited by Mingjie Sun

Merge request reports