Currently join copies all data to a new table (correct me please if I am wrong here).
This can be very expensive for huge DataFrames. E.g. data.table in R allows to update a table with columns from other table in-place.
Any opinion if something like this would be a desirable option for :left and :right joins?
Why not. I imagine we could call this join!?
(Though I think improving the efficiency of join by using optimized hashing algorithms for each combination of column types (in particular CategoricalArray) should be higher priority.)
I have just investigated h2o benchmarks - I think improving joins performance should be a priority for the near future as we are really slow here.
Couple of notes on my usage of join, that y'all probably know about but:
It causes repl crashes frequently, as once it eats all my ram, it keeps trying to perform join.
I don't have a great fix, but I have to work around by joining subsets of the original array of dataframes, and then joining these.
When I am able to interrupt the join, it doesn't seem like the memory is freed until you kill the repl.
Yes - this is the reality: https://h2oai.github.io/db-benchmark/ (click on join - 50GB tests all fail, 5GB tests are very bad).
If no one else works on it probably I will have a look at the code some time in the future as this is the major thing that requires fixing in terms of performance in DataFrames.jl.
Since you probably know a lot more of the implementation, what is causing the problem?
Hopefully I can start work on a PR, join.jl is a pretty long file
I often want to join with splat join(dfs..., on=:x1) but know that it will crash, even for relatively small dataframes, so there is probably optimization possibilities for this case.
edit: join! makes a lot of sense
Actually it is the only part of source code that I never touched before.
For sure innerjoin(dfs..., on=:x1) can be optimized (now we just recursively perform a join), but I would start with a single join.
I'm not sure if this helps, but StructArrays has a utility function to iterate pairs of ranges i0:i1, j0:j1 such that lkeys[sortperm(lkeys)[i0:i1]] and rkeys[sortperm(rkeys)[j0:j1]] are all the entries with a given value of keys. It works best if lkeys, rkeys are StructArrays whose columns are all of isbits elements (i.e. after pooling).
You would call it with:
StructArrays.GroupJoinPerm(lkeys::StructArray, rkeys::StructArray,
lperm=sortperm(lkeys), rperm=sortperm(rkeys))
Again, sortperm(lkeys) should be pretty efficient if all cols are isbits, better if integers. For string arrays with many unique entries, probably better to call radixsort I imagine.
If you are going with the sorting direction than hashing, I guess using the galloping search could be beneficial? When I tried it with simple "dot products" sum(x1 .* x2 .* ... .* xn) on sparse vectors (which is like fused reduce-innerjoin), it was much faster than linear search (which is what GroupJoinPerm does?) when there are more than two vectors. I guess it depends on the distribution of the keys, though.
This may be an aside error, but InterruptException on a join prevents the memory from being freed, essentially forcing a repl restart. Would there be anyway to deallocate on interrupt?
If this happens it probably should be reported to Julia as it seems to be a bug with GC.
Most helpful comment
I have just investigated h2o benchmarks - I think improving joins performance should be a priority for the near future as we are really slow here.