Skip to content

Make denormalization process recursive and also keep tables which are not joined in denormalize primitive.

Mingjie Sun requested to merge denormalize into master

The primitive now has two additional features:

  1. If hyperparameter recursive is set to True, it will join tables recursively. For example, if table 1(main table) has a foreign key that points to table 2, and table 2 has as foreign key that points to table 3, then after table 2 is jointed into table 1, table 1 will have a key that points to table 3. Then the process continues.

  2. It will keep those tables which are not joined in the denormalization process.

Now for the implementation:

  1. To avoid repetition of the code for the join process, I move the code for the join process in a separate function _denormalize. In this way, in either way hyperparameter recursive is set, we could just call function _denormalize instead of repeating it in the code.

  2. The function _prepare_metadata is created because when we are doing recursive denormalization, for each round of denormalize, it will start from a new metadata and adds the column metadata during the denormalize process.

  3. In the primitive, if a table other than the main resource is not joined, then we will keep this table in the output. So the key point is that if this table contains a foreign key that points to a joined table, then we will move the pointer of this foreign key to the main resource.

TODO:

  1. In dataset_to_dataframe primitive, it currently only allows the input dataset to compute one table. Also, the circle-ci test for the current merge request fails because in the output there might be multiple tables. There are two solutions in mind:
    1. Change the dataset_to_dataframe primitive to allow input dataset to have multiple tables
    2. Make whether we keep the tables that are not joined a hyperparameter of the primitive.
  2. Consider the case where there are loops in foreign keys.
Edited by Mingjie Sun

Merge request reports