Skip to content

Suggestion: tree-shakeable transformations

Created by: luucvanderzee

While our transformations currently have a pretty decent API, there is one big drawback to the current class-based piping/chaining approach: the transformations are not tree-shakeable. This means that you always send the code of all transformations to the client.

I think I just came up with an approach to circumvent this problem. Say that we want to do this with the current API:

import DataContainer from '@snlab/florence-datacontainer'

const data = new Datacontainer({ 
  fruit: ['apple', 'apple', 'banana', 'banana'],
  price: [1, 2, 3, 4]
})

const meanPricePerFruit = data
  .groupBy('fruit')
  .summarise({ mean_price: { price: 'mean' } })
  .arrange({ mean_price: 'descending' })

In the new proposed API, this would become

import DataContainer, { groupBy, summarise, arrange } from '@snlab/florence-datacontainer'

const data = new Datacontainer({ 
  fruit: ['apple', 'apple', 'banana', 'banana'],
  price: [1, 2, 3, 4]
})

const meanPricePerFruit = data.pipe(
  groupBy('fruit'),
  summarise({ mean_price: { price: 'mean' } }),
  arrange({ mean_price: 'descending' })
)

The advantages of this method are

  1. Tree-shakeable transformations, like mentioned above
  2. Easy for users to write and use custom transformations (see below)
  3. Transformations can be used without a DataContainer (if you are using column-oriented data at least)
  4. Cleaner separation of code/tests can just focus purely on the transformations

An example of how a user could write a custom toQuantitative transformation to convert categorical data to quantitative data:

import DataContainer from '@snlab/florence-datacontainer'

const toQuantitatve = columnName => {
  return data => {
    const categoricalColumn = data[columnName]
    data[columnName] = categoricalColumn.map(value => parseFloat(value))

    return data
  }
}

const dataContainer = new DataContainer({ 
  amount: [1, 2, 3, 4], 
  price: ['1', '2', '3', '4']
}).pipe(toQuantitative('price'))

console.log(dataContainer.column('price')) // [1, 2, 3, 4]

Thoughts?