Performance vs DataFrames.jl
I run the following quick test:
using DataFrames
import FlexiJoins
df1 = DataFrame(a = 1:1_000_000);
df2 = shuffle(df1);
shuffle!(df1);
df1.id1 = axes(df1, 1);
df2.id2 = axes(df2, 1);
@time innerjoin(df1, df2, on=:a); # timing, after compilation is around 0.06 second
@time FlexiJoins.innerjoin((df1, df2), FlexiJoins.by_key(:a)); # timing is around 0.4 seconds and more allocations
(I also checked joining on string columns and the results are similar)
Therefore I have one thought: maybe the cost is that too much is converted?
I looked at https://gitlab.com/aplavin/FlexiJoins.jl/-/blob/master/src/FlexiJoins.jl#L44 and my comments for your consideration are:
-
DataFramein the signature is too restrictive, I recommendAbstractDataFrame(so that views also can be passed for joins); - I think it would be more efficient to only pass columns on which you join and some indicator column (like my
id1andid2columns above) and later use this indicator column to compose final data frame (this would probably avoid unnecessary movement of data)
I also have a question regarding output. Here is an MWE:
julia> df1 = DataFrame(a=1:2, id1=1:2)
2×2 DataFrame
Row │ a id1
│ Int64 Int64
julia> df1 = DataFrame(a=1:2, id1=1:2)
2×2 DataFrame
Row │ a id1
│ Int64 Int64
─────┼──────────────
1 │ 1 1
2 │ 2 2
julia> df2 = DataFrame(a=[2, 1], id2=1:2)
2×2 DataFrame
Row │ a id2
│ Int64 Int64
─────┼──────────────
1 │ 2 1
2 │ 1 2
julia> innerjoin(df1, df2, on=:a)
2×3 DataFrame
Row │ a id1 id2
│ Int64 Int64 Int64
─────┼─────────────────────
1 │ 2 2 1
2 │ 1 1 2
julia> FlexiJoins.innerjoin((df1, df2), FlexiJoins.by_key(:a))
2×4 DataFrame
Row │ a id1 a_1 id2
│ Int64 Int64 Int64 Int64
─────┼────────────────────────────
1 │ 1 1 1 2
2 │ 2 2 2 1
My question is why do you repeat column a_1 in the output (I understand it might be needed in more flexible join models, but in an equality join the values should be the same - right)?
Finally: what is the guaranteed row order (if there is one) of the produced output (i.e. does it follow left table?). Related to this I noticed that there is significant difference in timing of FlexiJoin.jl vs DataFrames.jl when working with unbalanced tables:
julia> using BenchmarkTools
julia> df1 = DataFrame(a = 1:10);
julia> df2 = DataFrame(a=1:10^8);
julia> @benchmark innerjoin($df1, $df2, on=:a)
BenchmarkTools.Trial: 45 samples with 1 evaluation.
Range (min … max): 110.431 ms … 123.599 ms ┊ GC (min … max): 0.00% … 0.00%
Time (median): 113.138 ms ┊ GC (median): 0.00%
Time (mean ± σ): 113.592 ms ± 2.453 ms ┊ GC (mean ± σ): 0.00% ± 0.00%
▃ ▁█▁ ▁▁
▄▄▁█▄▄▄███▄▁██▁▁▄▇▄▇▇▄▄▄▄▄▁▄▁▁▁▁▁▁▁▁▄▁▁▁▁▄▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▄ ▁
110 ms Histogram: frequency by time 124 ms <
Memory estimate: 9.94 KiB, allocs estimate: 148.
julia> @benchmark innerjoin($df2, $df1, on=:a)
BenchmarkTools.Trial: 44 samples with 1 evaluation.
Range (min … max): 110.546 ms … 138.237 ms ┊ GC (min … max): 0.00% … 0.00%
Time (median): 113.484 ms ┊ GC (median): 0.00%
Time (mean ± σ): 114.483 ms ± 4.388 ms ┊ GC (mean ± σ): 0.00% ± 0.00%
▁ ▃▆▁█▃▃▁
▄█▇███████▇▄▄▁▄▁▁▄▁▁▁▁▄▁▁▁▁▄▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▄ ▁
111 ms Histogram: frequency by time 138 ms <
Memory estimate: 9.94 KiB, allocs estimate: 148.
julia> @benchmark FlexiJoins.innerjoin(($df1, $df2), FlexiJoins.by_key(:a))
BenchmarkTools.Trial: 3 samples with 1 evaluation.
Range (min … max): 2.214 s … 2.245 s ┊ GC (min … max): 0.00% … 0.00%
Time (median): 2.217 s ┊ GC (median): 0.00%
Time (mean ± σ): 2.225 s ± 16.902 ms ┊ GC (mean ± σ): 0.00% ± 0.00%
█ █ █
█▁▁▁▁█▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁█ ▁
2.21 s Histogram: frequency by time 2.24 s <
Memory estimate: 9.75 KiB, allocs estimate: 133.
julia> @benchmark FlexiJoins.innerjoin(($df2, $df1), FlexiJoins.by_key(:a))
BenchmarkTools.Trial: 3 samples with 1 evaluation.
Range (min … max): 2.257 s … 2.309 s ┊ GC (min … max): 0.00% … 0.00%
Time (median): 2.270 s ┊ GC (median): 0.00%
Time (mean ± σ): 2.278 s ± 27.112 ms ┊ GC (mean ± σ): 0.00% ± 0.00%
█ █ █
█▁▁▁▁▁▁▁▁▁▁▁▁▁█▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁█ ▁
2.26 s Histogram: frequency by time 2.31 s <
Memory estimate: 9.75 KiB, allocs estimate: 133.
Sorry for a long list - I am reporting things I noticed when comparing the packages. I hope it will be useful.