Dataframes.jl: Memory efficiency of join

Created on 29 Dec 2017  ยท  10Comments  ยท  Source: JuliaData/DataFrames.jl

Currently join copies all data to a new table (correct me please if I am wrong here).

This can be very expensive for huge DataFrames. E.g. data.table in R allows to update a table with columns from other table in-place.

Any opinion if something like this would be a desirable option for :left and :right joins?

non-breaking performance priority

Most helpful comment

I have just investigated h2o benchmarks - I think improving joins performance should be a priority for the near future as we are really slow here.

All 10 comments

Why not. I imagine we could call this join!?

(Though I think improving the efficiency of join by using optimized hashing algorithms for each combination of column types (in particular CategoricalArray) should be higher priority.)

I have just investigated h2o benchmarks - I think improving joins performance should be a priority for the near future as we are really slow here.

Couple of notes on my usage of join, that y'all probably know about but:

It causes repl crashes frequently, as once it eats all my ram, it keeps trying to perform join.

I don't have a great fix, but I have to work around by joining subsets of the original array of dataframes, and then joining these.

When I am able to interrupt the join, it doesn't seem like the memory is freed until you kill the repl.

Yes - this is the reality: https://h2oai.github.io/db-benchmark/ (click on join - 50GB tests all fail, 5GB tests are very bad).

If no one else works on it probably I will have a look at the code some time in the future as this is the major thing that requires fixing in terms of performance in DataFrames.jl.

Since you probably know a lot more of the implementation, what is causing the problem?
Hopefully I can start work on a PR, join.jl is a pretty long file

I often want to join with splat join(dfs..., on=:x1) but know that it will crash, even for relatively small dataframes, so there is probably optimization possibilities for this case.
edit: join! makes a lot of sense

Actually it is the only part of source code that I never touched before.
For sure innerjoin(dfs..., on=:x1) can be optimized (now we just recursively perform a join), but I would start with a single join.

I'm not sure if this helps, but StructArrays has a utility function to iterate pairs of ranges i0:i1, j0:j1 such that lkeys[sortperm(lkeys)[i0:i1]] and rkeys[sortperm(rkeys)[j0:j1]] are all the entries with a given value of keys. It works best if lkeys, rkeys are StructArrays whose columns are all of isbits elements (i.e. after pooling).

You would call it with:

StructArrays.GroupJoinPerm(lkeys::StructArray, rkeys::StructArray,
    lperm=sortperm(lkeys), rperm=sortperm(rkeys))

Again, sortperm(lkeys) should be pretty efficient if all cols are isbits, better if integers. For string arrays with many unique entries, probably better to call radixsort I imagine.

If you are going with the sorting direction than hashing, I guess using the galloping search could be beneficial? When I tried it with simple "dot products" sum(x1 .* x2 .* ... .* xn) on sparse vectors (which is like fused reduce-innerjoin), it was much faster than linear search (which is what GroupJoinPerm does?) when there are more than two vectors. I guess it depends on the distribution of the keys, though.

This may be an aside error, but InterruptException on a join prevents the memory from being freed, essentially forcing a repl restart. Would there be anyway to deallocate on interrupt?

If this happens it probably should be reported to Julia as it seems to be a bug with GC.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

bbrunaud picture bbrunaud  ยท  3Comments

garborg picture garborg  ยท  8Comments

cormullion picture cormullion  ยท  6Comments

tlienart picture tlienart  ยท  8Comments

gustafsson picture gustafsson  ยท  6Comments