Add primitive for flattening a column of multidimensional ndarrays into multiple columns (!109) · Merge requests · datadrivendiscovery / common-primitives

Mark Poscablo requested to merge mark-poscablo/common-primitives:ndarray-flatten into master Nov 26, 2019

This primitive aims to convert a column of multidimensional ndarrays into flattened multiple columns. Below is a visual illustration, taken from the docstring:

       col_A  col_B  col_C                            ndarrays
    0      9      a      d  [[[9], [8], [7]], [[6], [5], [4]]]
    1     29      b      e  [[[6], [7], [8]], [[9], [0], [1]]]
    2     49      c      f  [[[2], [3], [4]], [[5], [6], [7]]]

    yields:

       col_A  col_B  col_C  0  1  2  3  4  5
    0      9      a      d  9  8  7  6  5  4
    1     29      b      e  6  7  8  9  0  1
    2     49      c      f  2  3  4  5  6  7

This is different from the already existing Dataframe Flatten primitive, whose output is illustrated below (taken from its docstring):

    [
        a, b, [w, x],
        c, d, [y, z],
    ]

    yields:

    [
        a, b, w,
        a, b, x,
        c, d, y,
        c, d, z
    ]

More specifically, this primitive does not create a duplicate row for each element in the ndarray. Rather, it turns each row's multidimensional array into a feature vector - in the form of new columns (as many as the number of elements in the multidimensional array).

The need for this arose when the authors of this MR tried to build a pipeline that uses an image dataset for a classifier primitive (d3m.primitives.classification.logistic_regression.SKlearn, to be specific). The pipeline includes a step that uses the Columns Image Reader primitive (d3m.primitives.data_preprocessing.image_reader.Common), but there was no suitable existing primitive that they found that converts the output of this step to a form that the classifier primitive step can consume. This was therefore built as the glue primitive that bridges Columns Image Reader and the classifier primitive. More generally, this aims to bridge any primitive that produces multidimensional ndarrays in a single column with any primitive that can only consume features in the form of separate, multiple columns (one per feature).

All feedback is welcome. The WIP prefix will be removed when reviewers find that this MR is ready to be merged.

Edited Oct 25, 2021 by Mitar

Add primitive for flattening a column of multidimensional ndarrays into multiple columns

Merge request reports