Data.table: merge method ignores allow.cartesian

Created on 21 May 2019  路  2Comments  路  Source: Rdatatable/data.table

allow.cartesian is ignored

library(data.table)
d1 = data.table(a=c(1L,1L), b=2:3)
d2 = data.table(a=c(1L,1L), d=3:2)
merge(d1, d2, by="a", allow.cartesian=FALSE)
#   a b d
#1: 1 2 3
#2: 1 2 2
#3: 1 3 3
#4: 1 3 2

Most helpful comment

correct, thanks @franknarf1

library(data.table)
d1 = data.table(a=c(1L,1L,1L), b=2:4)
d2 = data.table(a=c(1L,1L,1L), d=3:1)
merge(d1, d2, by="a", allow.cartesian=FALSE)
#Error in vecseq(f__, len__, if (allow.cartesian || notjoin || !anyDuplicated(f__,  : 
#  Join results in 9 rows; more than 6 = nrow(x)+nrow(i).

All 2 comments

This is consistent with the documentation, since nrow(result) <= nrow(x) + nrow(i), right?

From ?data.table:

FALSE prevents joins that would result in more than nrow(x)+nrow(i) rows. This is usually caused by duplicate values in i's join columns, each of which join to the same group in 'x' over and over again: a misspecified join. Usually this was not intended and the join needs to be changed. The word 'cartesian' is used loosely in this context. The traditional cartesian join is (deliberately) difficult to achieve in data.table: where every row in i joins to every row in x (a nrow(x)*nrow(i) row result). 'cartesian' is just meant in a 'large multiplicative' sense.

Maybe clearer edit:

FALSE prevents joins that would result in more than nrow(x)+nrow(i) rows. This is usually caused by duplicate values in i's join columns, each of which join to the same group in 'x' over and over again: a misspecified join. Usually this was not intended and the join needs to be changed. The word 'cartesian' is used loosely in this context. The traditional cartesian join is (deliberately) difficult to achieve in data.table: where every row in i joins to every row in x (a nrow(x)*nrow(i) row result). 'cartesian' is just meant in a 'large multiplicative' sense , so FALSE does not always prevent a traditional cartesian join.

Related: https://github.com/Rdatatable/data.table/issues/2837 , https://github.com/Rdatatable/data.table/issues/2879#issuecomment-389162440

correct, thanks @franknarf1

library(data.table)
d1 = data.table(a=c(1L,1L,1L), b=2:4)
d2 = data.table(a=c(1L,1L,1L), d=3:1)
merge(d1, d2, by="a", allow.cartesian=FALSE)
#Error in vecseq(f__, len__, if (allow.cartesian || notjoin || !anyDuplicated(f__,  : 
#  Join results in 9 rows; more than 6 = nrow(x)+nrow(i).
Was this page helpful?
0 / 5 - 0 ratings