Dataframes.jl: add a `leftjoin!` (or `match!` or `merge!` or whatever it should be called)

Created on 15 May 2020  ยท  11Comments  ยท  Source: JuliaData/DataFrames.jl

I very frequently find myself needing to do

df = leftjoin(df, rdf, on=cols, validate=(false, true))

It's not that big a deal to have a new dataframe (after all, the underlying columns weren't copied) but I find myself having to do this so often (this is probably one of the most commonly used operations for me) that I would really love a mutating function for this. Maybe something like

match!(df, rdf, on=cols)

I started to look into what would be involved in doing this, but _join is pretty monolithic so it wasn't immediately obvious. I'd think that the biggest subtlety would be whether the ordering of rows in df can change (but I don't think they can even as things are?).

Thoughts? Am I the only one who does this all the time?

feature

Most helpful comment

Ok, interesting to hear we had the same thought about this.

I can't think of any reason why it couldn't be efficient. The only potential issue that I thought of with this approach that concerned me is that it might be less efficient to generate the group keys and then hash them rather than just calling some dead-simple broadcasted hash function on the rows before even getting into the groupby stuff. I would think that the most expensive part of all of this is the actual concatenation (we clearly don't want views for the joins), and I don't see any reason why that couldn't be just as fast using groupby than by going a more direct route. I think there may also be some extra "fluff" in the user-facing groupby function to make it nicer to work with and safer that we might strip out for extra efficiency in joins, but I wouldn't think any of that would actually be very expensive.

It seems to me that a groupby is a subroutine of a join, so even if we find some reason that using the existing groupby would be too slow, it might be worth it to modify groupby (admittedly quite tricky since that operation is already impressively fast) or add an alternative groupby algorithm that can be used by joins. As you implied, it would be really cool to give users the ability to do joins on GroupedDataFrames.

All 11 comments

First we should finalize the API for https://github.com/JuliaData/DataFrames.jl/issues/2243 as this is related. When we have a decision on "update" stuff, then this can be implemented at the same time.

leftjoin! should be relatively easy to add and it should just be leftjoin but in-place for its first argument.

If you're open to doing it before 1.0, I'd be happy to work on it after #2243 is finalized, because, like I said, I really use this quite a lot. If you think of it, feel free to post here again when #2243 is ready and I'll likely jump on it.

Sure! I think 1.0 release will be in the end of 2020 so there is plenty of time for it. Thank you!

Note though that for performance reasons this should be a special implementation anyway (as we will not copy left data frame and we will only need to compute the indices for the right data frame)

Yeah, if and when I do it I might look into breaking up _join into some more convenient and re-usable pieces, if you guys are open to that (don't necessarily need to discuss now or here).

Well - we will probably redesign _join anyway. It has a sub-par performance now, but we are not yet sure what we should do to improve it.

I was really tempted to start messing around with _join, but I told myself I have to stay focused and finish what I was already working on. Maybe I'm just being naive, but I feel like it should be easier than split-apply-combine anyway.

I'd really like to rewrite the joins to use groupby, then perform set operations on the keys and then somehow concatenate the views (into Arrays) given by the groupby. Anyway, I'll open a separate issue if I ever really start digging into rewriting the joins, but more likely one of you guys will get to it before I do.

Actually I was considering to define *join for GroupedDataFrame objects (maybe as an additional definition). However, my initial testing has shown that it was slower to create two GroupedDataFrames and then join them than what we have now (but I might have made some inefficient implementation).

However, if we can make it fast, the additional benefit of such an approach would be that doing several joins using the same data frame would be much faster (groups have to be computed only once).

So in general: this direction is very promising and I would like to jointly explore it more soon!

Ok, interesting to hear we had the same thought about this.

I can't think of any reason why it couldn't be efficient. The only potential issue that I thought of with this approach that concerned me is that it might be less efficient to generate the group keys and then hash them rather than just calling some dead-simple broadcasted hash function on the rows before even getting into the groupby stuff. I would think that the most expensive part of all of this is the actual concatenation (we clearly don't want views for the joins), and I don't see any reason why that couldn't be just as fast using groupby than by going a more direct route. I think there may also be some extra "fluff" in the user-facing groupby function to make it nicer to work with and safer that we might strip out for extra efficiency in joins, but I wouldn't think any of that would actually be very expensive.

It seems to me that a groupby is a subroutine of a join, so even if we find some reason that using the existing groupby would be too slow, it might be worth it to modify groupby (admittedly quite tricky since that operation is already impressively fast) or add an alternative groupby algorithm that can be used by joins. As you implied, it would be really cool to give users the ability to do joins on GroupedDataFrames.

As I think of it I see the following performance optimization:

  1. do groupby on left data frame (possibly keeping some extra metadata)
  2. then when doing groupby on the right data frame, but do not do it anew, but "as if" it were further rows of a bigger data frame. In this way you get proper matching of groups between two data frames "for free". Actually for the right data frame you can store lists of rows matching each group number in the left data frame. In this way you can do everything in one pass.

Hm, interesting, but wouldn't that require large modifications to the groupby code since it's not currently set up to track locations across multiple dataframes? I suppose you could create some sort of lazily concatenated DataFrame type instead (or perhaps you mean horizontally concatenating which is more-or-less free, but I don't think would handle the final concatenations of the joined rows for you). I see that this would allow you to easily get your joins just by calling combine, however I don't really understand what real performance benefits it would give you over calling groupby on the two dataframes separately (though I haven't looked at the groupby code very deeply, so I'm ignorant on this subject).

The benefit is. That by calling two groupby you create two vectors, say:

groups1 = [1,1,2,3,4,5,2,1]
groups2 = [1,2,1,2,3,1]

and after this you still need to do mapping between group numbers in groups1 and groups2, e.g. to determine if 1 in groups1 has some matching number in groups2, and if yes, what it would be.
And even after you have these mappings you still need to perform proper row matching.

On the other hand if you are doing it in a single processing step, you do not need to create groups2 at all, but create an array (probably this is not the best data structure, but it is simplest so I use it to show the idea):

groups1 = [1,1,2,3,4,5,2,1]
rows2groups1 = [[1,2], [5], [3, 4], [], []]

telling you which rows of the second data frame map to which groups of the first data frame. Out of such a data structure it is really easy to properly resize first data frame (you just look at the lengths of rows2groups1 entries and the only special case is if the length is 0 in which case you use length 1 ). For the second data frame you just concatenate the vectors from rows2groups1 matching groups in groups1, with the special case that if the vector has length 0 you instead fill-in with missing values.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

xiaodaigh picture xiaodaigh  ยท  7Comments

cormullion picture cormullion  ยท  6Comments

mattBrzezinski picture mattBrzezinski  ยท  5Comments

gustafsson picture gustafsson  ยท  6Comments

rofinn picture rofinn  ยท  3Comments