Performance vs DataFrames.jl

I run the following quick test:

using DataFrames
import FlexiJoins
df1 = DataFrame(a = 1:1_000_000);
df2 = shuffle(df1);
shuffle!(df1);
df1.id1 = axes(df1, 1);
df2.id2 = axes(df2, 1);
@time innerjoin(df1, df2, on=:a); # timing, after compilation is around 0.06 second
@time FlexiJoins.innerjoin((df1, df2), FlexiJoins.by_key(:a)); # timing is around 0.4 seconds and more allocations

(I also checked joining on string columns and the results are similar)

Therefore I have one thought: maybe the cost is that too much is converted?

I looked at https://gitlab.com/aplavin/FlexiJoins.jl/-/blob/master/src/FlexiJoins.jl#L44 and my comments for your consideration are:

  • DataFrame in the signature is too restrictive, I recommend AbstractDataFrame (so that views also can be passed for joins);
  • I think it would be more efficient to only pass columns on which you join and some indicator column (like my id1 and id2 columns above) and later use this indicator column to compose final data frame (this would probably avoid unnecessary movement of data)

I also have a question regarding output. Here is an MWE:

julia> df1 = DataFrame(a=1:2, id1=1:2)
2×2 DataFrame
 Row │ a      id1
     │ Int64  Int64
julia> df1 = DataFrame(a=1:2, id1=1:2)
2×2 DataFrame
 Row │ a      id1
     │ Int64  Int64
─────┼──────────────
   1 │     1      1
   2 │     2      2

julia> df2 = DataFrame(a=[2, 1], id2=1:2)
2×2 DataFrame
 Row │ a      id2
     │ Int64  Int64
─────┼──────────────
   1 │     2      1
   2 │     1      2

julia> innerjoin(df1, df2, on=:a)
2×3 DataFrame
 Row │ a      id1    id2
     │ Int64  Int64  Int64
─────┼─────────────────────
   1 │     2      2      1
   2 │     1      1      2

julia> FlexiJoins.innerjoin((df1, df2), FlexiJoins.by_key(:a))
2×4 DataFrame
 Row │ a      id1    a_1    id2
     │ Int64  Int64  Int64  Int64
─────┼────────────────────────────
   1 │     1      1      1      2
   2 │     2      2      2      1

My question is why do you repeat column a_1 in the output (I understand it might be needed in more flexible join models, but in an equality join the values should be the same - right)?

Finally: what is the guaranteed row order (if there is one) of the produced output (i.e. does it follow left table?). Related to this I noticed that there is significant difference in timing of FlexiJoin.jl vs DataFrames.jl when working with unbalanced tables:

julia> using BenchmarkTools

julia> df1 = DataFrame(a = 1:10);

julia> df2 = DataFrame(a=1:10^8);

julia> @benchmark innerjoin($df1, $df2, on=:a)
BenchmarkTools.Trial: 45 samples with 1 evaluation.
 Range (min … max):  110.431 ms … 123.599 ms  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     113.138 ms               ┊ GC (median):    0.00%
 Time  (mean ± σ):   113.592 ms ±   2.453 ms  ┊ GC (mean ± σ):  0.00% ± 0.00%

     ▃   ▁█▁  ▁▁
  ▄▄▁█▄▄▄███▄▁██▁▁▄▇▄▇▇▄▄▄▄▄▁▄▁▁▁▁▁▁▁▁▄▁▁▁▁▄▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▄ ▁
  110 ms           Histogram: frequency by time          124 ms <

 Memory estimate: 9.94 KiB, allocs estimate: 148.

julia> @benchmark innerjoin($df2, $df1, on=:a)
BenchmarkTools.Trial: 44 samples with 1 evaluation.
 Range (min … max):  110.546 ms … 138.237 ms  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     113.484 ms               ┊ GC (median):    0.00%
 Time  (mean ± σ):   114.483 ms ±   4.388 ms  ┊ GC (mean ± σ):  0.00% ± 0.00%

   ▁ ▃▆▁█▃▃▁
  ▄█▇███████▇▄▄▁▄▁▁▄▁▁▁▁▄▁▁▁▁▄▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▄ ▁
  111 ms           Histogram: frequency by time          138 ms <

 Memory estimate: 9.94 KiB, allocs estimate: 148.

julia> @benchmark FlexiJoins.innerjoin(($df1, $df2), FlexiJoins.by_key(:a))
BenchmarkTools.Trial: 3 samples with 1 evaluation.
 Range (min … max):  2.214 s …   2.245 s  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     2.217 s              ┊ GC (median):    0.00%
 Time  (mean ± σ):   2.225 s ± 16.902 ms  ┊ GC (mean ± σ):  0.00% ± 0.00%

  █    █                                                  █
  █▁▁▁▁█▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁█ ▁
  2.21 s         Histogram: frequency by time        2.24 s <

 Memory estimate: 9.75 KiB, allocs estimate: 133.

julia> @benchmark FlexiJoins.innerjoin(($df2, $df1), FlexiJoins.by_key(:a))
BenchmarkTools.Trial: 3 samples with 1 evaluation.
 Range (min … max):  2.257 s …   2.309 s  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     2.270 s              ┊ GC (median):    0.00%
 Time  (mean ± σ):   2.278 s ± 27.112 ms  ┊ GC (mean ± σ):  0.00% ± 0.00%

  █             █                                         █
  █▁▁▁▁▁▁▁▁▁▁▁▁▁█▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁█ ▁
  2.26 s         Histogram: frequency by time        2.31 s <

 Memory estimate: 9.75 KiB, allocs estimate: 133.

Sorry for a long list - I am reporting things I noticed when comparing the packages. I hope it will be useful.