Dataframes.jl: Memory efficiency of join

Created on 29 Dec 2017 · 10Comments · Source: JuliaData/DataFrames.jl

Currently join copies all data to a new table (correct me please if I am wrong here).

This can be very expensive for huge DataFrames. E.g. data.table in R allows to update a table with columns from other table in-place.

Any opinion if something like this would be a desirable option for :left and :right joins?

non-breaking performance priority

Source

bkamins

👍2

Most helpful comment

I have just investigated h2o benchmarks - I think improving joins performance should be a priority for the near future as we are really slow here.

bkamins on 9 May 2020

❤2

All 10 comments

Why not. I imagine we could call this join!?

(Though I think improving the efficiency of join by using optimized hashing algorithms for each combination of column types (in particular CategoricalArray) should be higher priority.)

nalimilan on 30 Dec 2017

👍2

I have just investigated h2o benchmarks - I think improving joins performance should be a priority for the near future as we are really slow here.

bkamins on 9 May 2020

❤2

Couple of notes on my usage of join, that y'all probably know about but:

It causes repl crashes frequently, as once it eats all my ram, it keeps trying to perform join.

I don't have a great fix, but I have to work around by joining subsets of the original array of dataframes, and then joining these.

When I am able to interrupt the join, it doesn't seem like the memory is freed until you kill the repl.

anandijain on 12 May 2020

Yes - this is the reality: https://h2oai.github.io/db-benchmark/ (click on join - 50GB tests all fail, 5GB tests are very bad).

If no one else works on it probably I will have a look at the code some time in the future as this is the major thing that requires fixing in terms of performance in DataFrames.jl.

bkamins on 12 May 2020

Since you probably know a lot more of the implementation, what is causing the problem?
Hopefully I can start work on a PR, join.jl is a pretty long file

I often want to join with splat join(dfs..., on=:x1) but know that it will crash, even for relatively small dataframes, so there is probably optimization possibilities for this case.
edit: join! makes a lot of sense

anandijain on 12 May 2020

Actually it is the only part of source code that I never touched before.
For sure innerjoin(dfs..., on=:x1) can be optimized (now we just recursively perform a join), but I would start with a single join.

bkamins on 13 May 2020

I'm not sure if this helps, but StructArrays has a utility function to iterate pairs of ranges i0:i1, j0:j1 such that lkeys[sortperm(lkeys)[i0:i1]] and rkeys[sortperm(rkeys)[j0:j1]] are all the entries with a given value of keys. It works best if lkeys, rkeys are StructArrays whose columns are all of isbits elements (i.e. after pooling).

You would call it with:

StructArrays.GroupJoinPerm(lkeys::StructArray, rkeys::StructArray,
    lperm=sortperm(lkeys), rperm=sortperm(rkeys))

Again, sortperm(lkeys) should be pretty efficient if all cols are isbits, better if integers. For string arrays with many unique entries, probably better to call radixsort I imagine.

piever on 14 May 2020

👍1

If you are going with the sorting direction than hashing, I guess using the galloping search could be beneficial? When I tried it with simple "dot products" sum(x1 .* x2 .* ... .* xn) on sparse vectors (which is like fused reduce-innerjoin), it was much faster than linear search (which is what GroupJoinPerm does?) when there are more than two vectors. I guess it depends on the distribution of the keys, though.

tkf on 22 May 2020

This may be an aside error, but InterruptException on a join prevents the memory from being freed, essentially forcing a repl restart. Would there be anyway to deallocate on interrupt?

anandijain on 22 May 2020

If this happens it probably should be reported to Julia as it seems to be a bug with GC.

bkamins on 23 May 2020

Was this page helpful?

0 / 5 - 0 ratings