Data.table: Expand allow.cartesian error hints for non-equi joins

Created on 28 Mar 2017  路  7Comments  路  Source: Rdatatable/data.table

Taken from a good question on SO...

library(data.table)
DT = setDT(structure(list(City = structure(c(2L, 2L, 2L, 2L, 2L, 1L, 1L, 
1L, 1L), .Label = c("LA", "NY"), class = "factor"), YearFrom = c(2001L, 
2003L, 2002L, 2006L, 2008L, 2004L, 2005L, 2005L, 2002L), YearTo = c(NA, 
2005L, NA, NA, 2009L, NA, 2008L, NA, NA), quasiYearTo = c(2017L, 
2005L, 2017L, 2017L, 2009L, 2017L, 2008L, 2017L, 2017L)), .Names = c("City", 
"YearFrom", "YearTo", "quasiYearTo"), row.names = c(NA, -9L), class = "data.frame"))


   City YearFrom YearTo quasiYearTo
1:   NY     2001     NA        2017
2:   NY     2003   2005        2005
3:   NY     2002     NA        2017
4:   NY     2006     NA        2017
5:   NY     2008   2009        2009
6:   LA     2004     NA        2017
7:   LA     2005   2008        2008
8:   LA     2005     NA        2017
9:   LA     2002     NA        2017

The goal is to count, for every city and year in the data, how many firms are active (have YearFrom <= Year < quasiYearTo). This code works:

df[CJ(City = City, Year = min(YearFrom):max(YearTo, na.rm=TRUE), unique=TRUE), 
  on=.(City, YearFrom <= Year, quasiYearTo > Year), allow.cartesian = TRUE, 
  .N
, by=.EACHI][, .(City, Year = YearFrom, N)]


    City Year N
 1:   LA 2001 0
 2:   LA 2002 1
 3:   LA 2003 1
 4:   LA 2004 2
 5:   LA 2005 4
 6:   LA 2006 4
 7:   LA 2007 4
 8:   LA 2008 3
 9:   LA 2009 3
10:   NY 2001 1
11:   NY 2002 2
12:   NY 2003 3
13:   NY 2004 3
14:   NY 2005 2
15:   NY 2006 3
16:   NY 2007 3
17:   NY 2008 4
18:   NY 2009 3

However, if I drop allow.cartesian, I get the usual error message:

DT[CJ(City = City, Year = min(YearFrom):max(YearTo, na.rm=TRUE), unique=TRUE), 
  on=.(City, YearFrom <= Year, quasiYearTo > Year), .N, by=.EACHI]

Error in vecseq(f__, len__, if (allow.cartesian || notjoin || !anyDuplicated(f__, :
Join results in 47 rows; more than 27 = nrow(x)+nrow(i). Check for duplicate key values in i each of which join to the same group in x over and over again. If that's ok, try by=.EACHI to run j for each group to avoid the large allocation. If you are sure you wish to proceed, rerun with allow.cartesian=TRUE. Otherwise, please search for this error message in the FAQ, Wiki, Stack Overflow and datatable-help for advice.

So my i has no duplicates:

DT[, CJ(City = City, Year = min(YearFrom):max(YearTo, na.rm=TRUE), unique=TRUE)][, 
  anyDuplicated(.SD)] # 0

...and I am using by=.EACHI already. It would be nice if the warning message said gave a more relevant hint here (though I don't know what that might be, since I'm not actually clear on why I need allow.cartesian on here). Anyway, a minor suggestion.

An alternative would be to turn off the cartesian check (or downgrade it from error to verbose message) if it's a non-equi join and by=.EACHI, since that case seems pretty safe.

enhancement non-equi joins

Most helpful comment

Closed by #4493. Related to #4489. 1.12.9 will make allow.cartesian = TRUE for non-equi joins which is even better than a plain old hint :)

@fedyakov, there's also #3009 for joins that exceed 2^31 but are aggregated for .N. Then separately, there is #3957 which is explicitly about data.table allowing for more than 2^31 rows.

All 7 comments

Fetching indices for non-equi joins is slightly more complicated and requires (at times) a reordering, which requires populating them entirely first. Even though it is rare, there might be cases where this triggers the allow.cartesian threshold. It could be sidestepped by checking for by=.EACHI.

I'm trying to non-equi join two really big tables to count sizes of groups. Cartesian join exceeds the 2^31 limit. So the only possibility for me is to group by .EACHI. Unfortunately, it fails with the mentioned error.

It's a critical bug for me. Please, consider fixing it.

Here is a tiny example to reproduce the error:

dt1 = data.table(a1 = 4:1, b1 = 1:2)
dt2 = data.table(a2 = 5:6, b2 = 3)
dt1[dt2, .N, on = .(a1 < a2, b1 < b2), by = .EACHI]

Error in vecseq(f__, len__, if (allow.cartesian || notjoin || !anyDuplicated(f__, :
Join results in 8 rows; more than 6 = nrow(x)+nrow(i). Check for duplicate key values in i each of which join to the same group in x over and over again. If that's ok, try by=.EACHI to run j for each group to avoid the large allocation. If you are sure you wish to proceed, rerun with allow.cartesian=TRUE. Otherwise, please search for this error message in the FAQ, Wiki, Stack Overflow and data.table issue tracker for advice.

@fedyakov as temporary fix just add , allow.cartesian = TRUE as in

dt1[dt2, .N, on = .(a1 < a2, b1 < b2), by = .EACHI, allow.cartesian = TRUE]
#    a1 b1 N
# 1:  5  3 4
# 2:  6  3 4

Unless you've tried that and it fails?

Unless you've tried that and it fails?

I don't have the reproducible example now, but as far as I remember it did not work. Cartesian join exceeds the 2^31 limit.

Closed by #4493. Related to #4489. 1.12.9 will make allow.cartesian = TRUE for non-equi joins which is even better than a plain old hint :)

@fedyakov, there's also #3009 for joins that exceed 2^31 but are aggregated for .N. Then separately, there is #3957 which is explicitly about data.table allowing for more than 2^31 rows.

@ColeMiller1 My case is about .EACHI aggregation primarily. It doesn't require 64-bit vectors at all since aggregation could be done alongside joining in a single stage and the result fits in 邪 32-bit vector. Could you consider a separate issue to fix this special case?

@fedyakov you are more than welcome to open an issue. However, issue #3009 is the exact use case you referred to earlier in this thread.

Was this page helpful?
0 / 5 - 0 ratings