Data.table: [R-Forge #2301] Add "merge" or "join" argument to X[Y,...]

Created on 8 Jun 2014  路  8Comments  路  Source: Rdatatable/data.table

Submitted by: Matt Dowle; Assigned to: Nobody; R-Forge link

As highlighted here :
http://stackoverflow.com/questions/12773822/why-does-xy-join-of-data-tables-not-allow-a-full-outer-join-or-a-left-join

Not just for syntax but for the inspection of j advantage of X[Y,j,merge=TRUE].

duplicate feature request joins

Most helpful comment

what about X[Y, nomatch=c(0, NA)].
pydatatable adopted interesting style for defining join, it scales well with number of options as they don't clutter main query arguments.

All 8 comments

Is a "full outer join" (like this FR asks for) already implemented in another way than this FR asks for?

If not I would like to "vote" for this feature to be implemented.
I could imagine naming the parameter "fullouter" (boolean) which is more intuitive than "merge".
Or stick to the merge parameters (all.x, all.y, all)...

I think there is another issue to consider on this topic. Specifically, while we can switch X[Y] to Y[X] in order alter the direction of the outer join, in both cases, only the first data.table can be updated. i.e. X[Y] only allows := style updates to X, while Y[X] only supports updates to Y. So, in this sense, simply reversing the order of X and Y is not always a solution

@ebs238 Fwiw, I typically do Y[, v := X[.SD, on=cols, j, by=.EACHI]$V1] for that case.

you can also do a merge to get the join right, then subset/aggregate the result of that merge and write it back. However, the assertion (in the SO threads that spawned this) that X[Y] and Y[X] are substitutable to accommodate outer join direction is not accurate. They are substitutable from a join direction perspective, but not from an overall DT functionality perspective. Conceptually, the direction of the outer join, and the choice of which DT to update are independent decisions.

Not suggesting this is a critical issue, as it can be worked around, Just suggesting this aspect be considered as part of this issue

If we wanted to implemented left / full joins, perhaps the most "data.table" way would be to add a nomatch.x argument (default = 0L). Then we would have:

X right join Y / X subset on Y (default):
X[Y] # nomatch = NA, nomatch.x = 0L

X inner join Y / X subset on Y, ignoring rows in Y with no match in X:
X[Y, nomatch = 0L] # nomatch.x = 0L

X full join Y:
X[Y, nomatch.x = NA] # nomatch = NA by default

X left join Y:
X[Y, nomatch = 0L, nomatch.x = NA]

what about X[Y, nomatch=c(0, NA)].
pydatatable adopted interesting style for defining join, it scales well with number of options as they don't clutter main query arguments.

@jangorecki I guess it would have to be list(NULL, NA) with the new nomatch argument style? Anyway I like the idea.

After reading linked SO in details it seems that requested feature here is an argument that specify join type (inner/full/left/right) using all, all.x, etc., so different types of join can be supported inside [.data.table. Then the issue is basically the same as #3946. The latter one seems to be more nicely defined, will close this one as duplicate.

Was this page helpful?
0 / 5 - 0 ratings