Data.table: join column naming with `on` syntax

Created on 11 May 2016  Â·  14Comments  Â·  Source: Rdatatable/data.table

I remember seeing a similar issue to this, but am failing to find it at the moment.

I find current naming of columns with on syntax to be confusing and counterintuitive. This becomes especially apparent when using non-equi joins. Right now the column name is the _left_ side in an on = .(LHS = RHS) expression and imo it would make a lot more sense for it to be the _right_ side.

X = data.table(pos = 1:3)
#   pos
#1:   1
#2:   2
#3:   3

Y = data.table(start = c(1, 3), end = c(2, 4))
#   start end
#1:     1   2
#2:     3   4

Y[X, on = .(start <= pos, end >= pos)]
#   start end
#1:     1   1
#2:     2   2
#3:     3   3

X[Y, on = .(pos >= start, pos <= end)]
#   pos pos.1
#1:   1     2
#2:   1     2
#3:   3     4

In both cases above it would make a lot more sense imo to have the column name taken from RHS.

enhancement feature request joins top request

Most helpful comment

Just wanted to weigh in on this topic since it keeps cropping up for me when teaching non-equi and rolling joins (Have I found the root issue? I followed a chain of issues closed as duplicates – it looks like this is the original).

Currently the way I teach both non-equi joins and rolling joins is to recommend students to create duplicates of the join columns, e.g.:

dt.x[, by.x.copy := by.x]
dt.y[, by.y.copy := by.y]
joined <- dt.x[dt.y, on = .(by.x.copy <= by.y.copy)]
joined[, by.x.copy := NULL] # remove meaningless and misleading column

That way the information in the non-equi/rolling join key columns from both data.tables are preserved, and the duplicate column actually used in the on argument can be deleted since it is redundant (and at worst, misleading).

As others have proposed, I think there is a strong rationale for treating join key columns differently in the case of non-equi and rolling joins – that is both columns should be kept in the result of the join. There is precedence for this already in the foverlaps() function, which will return both sets of interval columns in the join, while additional keys are treated the same way as in merge() (i.e. only by.x is kept).

My view is that breaking backwards compatibility with merge() in these cases is fine, since it is not possible to perform non-equi or rolling joins with merge().

This would also simplify learning of these joins, since the current behaviour is unintuitive and the source of many conceptual and runtime errors.

All 14 comments

Agree it makes more sense, but I'm not sure if we should change that default.
This issue overlaps with #1615.
Current behavior, the default when j is not provided, is consistent to base R merge (+ making name unique).

base::merge(X, Y, by.x=c("pos","pos"), by.y=c("start","end"))
#Empty data.table (0 rows) of 2 cols: pos,pos

The good thing is that it is just a matter of using j argument, and since it now handles both x. and i. prefixes, it should handle all cases.

Sticking to a poor decision because base does it that way has never been the philosophy of data.table.

I agree that using x. and i. gives a workaround, but also quickly becomes a huge pain if you have many (non-join) columns (that you'll now have to either type out, or will have to change the names of the join columns later).

This is what the linked issue is about, to drop consistency to base R when joining with [.
I usually produce j as quoted expression in such cases. Not really user friendly but its quite straightforward and works like a charm :)

Thanks - that was the issue I was looking for earlier.

I completely agree with @eantonya -- I just ran into the same issue, and I dealt with it by using setnames after the join:

X = data.table(pos = 1:3)
Y = data.table(start = c(1, 3), end = c(2, 4))
joined.badColNames <- X[Y, on = .(pos >= start, pos <= end)]
joined.goodColNames <- data.table(joined.badColNames)
setnames(joined.goodColNames, c("start", "end"))
> joined.badColNames
   pos pos.1
1:   1     2
2:   1     2
3:   3     4
> joined.goodColNames
   start end
1:     1   2
2:     1   2
3:     3   4

it would be more user-friendly if data.table would rename the columns (and it would be backwards-compatible with the setnames code that I have already written)

simple example of non-equi join column name masking

a <- data.table(foo = c(1:5))
b <- data.table(bar = c(1:5))

> a
   foo
1:   1
2:   2
3:   3
4:   4
5:   5
> b
   bar
1:   1
2:   2
3:   3
4:   4
5:   5

It's very odd that these outputs below are not identical. if you read all the related issues, many people expect them to be and keep getting surprised that they are not

identical(a[b, on = .(foo >= bar), .(foo, bar)], a[b, on = .(foo >= bar), .(x.foo, i.bar)])

Just wanted to weigh in on this topic since it keeps cropping up for me when teaching non-equi and rolling joins (Have I found the root issue? I followed a chain of issues closed as duplicates – it looks like this is the original).

Currently the way I teach both non-equi joins and rolling joins is to recommend students to create duplicates of the join columns, e.g.:

dt.x[, by.x.copy := by.x]
dt.y[, by.y.copy := by.y]
joined <- dt.x[dt.y, on = .(by.x.copy <= by.y.copy)]
joined[, by.x.copy := NULL] # remove meaningless and misleading column

That way the information in the non-equi/rolling join key columns from both data.tables are preserved, and the duplicate column actually used in the on argument can be deleted since it is redundant (and at worst, misleading).

As others have proposed, I think there is a strong rationale for treating join key columns differently in the case of non-equi and rolling joins – that is both columns should be kept in the result of the join. There is precedence for this already in the foverlaps() function, which will return both sets of interval columns in the join, while additional keys are treated the same way as in merge() (i.e. only by.x is kept).

My view is that breaking backwards compatibility with merge() in these cases is fine, since it is not possible to perform non-equi or rolling joins with merge().

This would also simplify learning of these joins, since the current behaviour is unintuitive and the source of many conceptual and runtime errors.

Initially it should be introduced only as an option to minimise potential for breaking any existing code

hi all just wondering if you can prioritize this issue /plan to include in an upcoming release? I also would like to include this in my teaching but it is needlessly complicated with current version (the column names returned are confusing)

I think the solution might be already waiting in PRs, if so we should put it into next release.

There is, but no commits since Oct 2018 -- may take some time to de-conflict & rebase to current master

https://github.com/Rdatatable/data.table/pull/3093/commits

I've been intending to revist that PR, but wasn't really sure where we had
gotten to in terms of implementing the desired functionality.

On Fri, 3 Apr 2020 at 13:53, Michael Chirico notifications@github.com
wrote:

There is, but no commits since Oct 2018 -- may take some time to
de-conflict & rebase to current master

https://github.com/Rdatatable/data.table/pull/3093/commits

—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
https://github.com/Rdatatable/data.table/issues/1700#issuecomment-608416625,
or unsubscribe
https://github.com/notifications/unsubscribe-auth/AAMYZE2OWZFKEONFVRHF3KTRKXL4XANCNFSM4CDR6EFA
.

I think the important thing is to give option for backward compatibility.

Was this page helpful?
0 / 5 - 0 ratings