fix performance of envelope.concat
on a dataset of ~60,000 units (where ~2,000 of them exist in both sets), envelope.concatData
takes over 60 minutes on my machine.
this is largely due to the implementation of utils.concatManyWith
I wrote some horrible code that reduces the time of envelope.concatData
to 2 minutes on the same dataset, just to see if it was possible.
// in envelope.js
const crapConcat = curry((merger, xs, ys) => {
console.log('concat');
const x = keyBy('_lf_id_hash')(xs);
const y = keyBy('_lf_id_hash')(ys);
const kx = keys(x);
const ky = keys(y);
const same = intersection(kx, ky);
const different = xor(kx, ky);
const merged = map(id => merger(x[id], y[id]))(same);
const notmerged = map(id => x[id] || y[id])(different);
return sortBy('_lf_id_hash', loConcat(merged, notmerged));
});
export const concat = curry((e1, e2) => {
const data = crapConcat(ds.concatOne, e1.data || ds.empty(), e2.data || ds.empty());
const queries = qs.concat(e1.queries || qs.empty(), e2.queries || qs.empty());
return envelope(data, queries);
});
this code has a lot of obvious problems.
- im not actually sure it works the same, though at a cursory glance the output seems ok.
- it fails all the envelope tests
- monoid left
- monoid right
- it relies on
_lf_id_hash
in the unit, so will possibly only work in envelope.
It would be great to get some advice on how to get utils.concatManyWith
to function that fast, or if envelope should just have it's own concat function, where we can assume there will be an _lf_id_hash
available.
Happy to take this on once i've gotten some sagely advice.
:)