Discuss supporting Apache Arrow in Utopia

So, a while ago, I replaced hdf5 with apache-arrow for data handling in my project. I did that in order to have access to more modern data formats that support easy compression of what in hdf5 would be vlen data, and to get rid of the inherent complexity of hdf5. Arrow supports a number of data formats geared towards tabular, columnar data, most noteworthily parquet and the 'arrow/ipc' format itself, which lend themselves well for complex data analysis tasks. But it also supports csv for instance through a largely unified interface. Additionally, the underlying memory model allows for easy cross-language compatibility without copying/reorganizing data, i.e., without us having to take care about this ourselves. For instance, the polars library uses arrow under the hood, and to great effect as far as my experience goes. Finally, we gain a link to the apache big data world in this way, which might be interesting for outsiders. A disadvantage is that we lose the everything-in-a-single-file thing we have in hdf5. On the other hand, data size is reduced severely using arrow, among other things because it has better data compression support.

The very very ramshackle code I use arrow in atm can be found here. It suffers from a severe lack of comments and a bad case of spaghetti syndrome, but should give some coarse overview of how stuff works.

I suggest collecting opinions on adding it as an optional dependency to utopia in this issue. If we decide to have it, we can discuss how to include support for it in a follow up.

Edited Jan 02, 2023 by Harald Mack

To upload designs, you'll need to enable LFS and have an admin enable hashed storage. More information