Taken from a good question on SO...
library(data.table)
DT = setDT(structure(list(City = structure(c(2L, 2L, 2L, 2L, 2L, 1L, 1L,
1L, 1L), .Label = c("LA", "NY"), class = "factor"), YearFrom = c(2001L,
2003L, 2002L, 2006L, 2008L, 2004L, 2005L, 2005L, 2002L), YearTo = c(NA,
2005L, NA, NA, 2009L, NA, 2008L, NA, NA), quasiYearTo = c(2017L,
2005L, 2017L, 2017L, 2009L, 2017L, 2008L, 2017L, 2017L)), .Names = c("City",
"YearFrom", "YearTo", "quasiYearTo"), row.names = c(NA, -9L), class = "data.frame"))
City YearFrom YearTo quasiYearTo
1: NY 2001 NA 2017
2: NY 2003 2005 2005
3: NY 2002 NA 2017
4: NY 2006 NA 2017
5: NY 2008 2009 2009
6: LA 2004 NA 2017
7: LA 2005 2008 2008
8: LA 2005 NA 2017
9: LA 2002 NA 2017
The goal is to count, for every city and year in the data, how many firms are active (have YearFrom <= Year < quasiYearTo). This code works:
df[CJ(City = City, Year = min(YearFrom):max(YearTo, na.rm=TRUE), unique=TRUE),
on=.(City, YearFrom <= Year, quasiYearTo > Year), allow.cartesian = TRUE,
.N
, by=.EACHI][, .(City, Year = YearFrom, N)]
City Year N
1: LA 2001 0
2: LA 2002 1
3: LA 2003 1
4: LA 2004 2
5: LA 2005 4
6: LA 2006 4
7: LA 2007 4
8: LA 2008 3
9: LA 2009 3
10: NY 2001 1
11: NY 2002 2
12: NY 2003 3
13: NY 2004 3
14: NY 2005 2
15: NY 2006 3
16: NY 2007 3
17: NY 2008 4
18: NY 2009 3
However, if I drop allow.cartesian, I get the usual error message:
DT[CJ(City = City, Year = min(YearFrom):max(YearTo, na.rm=TRUE), unique=TRUE),
on=.(City, YearFrom <= Year, quasiYearTo > Year), .N, by=.EACHI]
Error in vecseq(f__, len__, if (allow.cartesian || notjoin || !anyDuplicated(f__, :
Join results in 47 rows; more than 27 = nrow(x)+nrow(i). Check for duplicate key values in i each of which join to the same group in x over and over again. If that's ok, try by=.EACHI to run j for each group to avoid the large allocation. If you are sure you wish to proceed, rerun with allow.cartesian=TRUE. Otherwise, please search for this error message in the FAQ, Wiki, Stack Overflow and datatable-help for advice.
So my i has no duplicates:
DT[, CJ(City = City, Year = min(YearFrom):max(YearTo, na.rm=TRUE), unique=TRUE)][,
anyDuplicated(.SD)] # 0
...and I am using by=.EACHI already. It would be nice if the warning message said gave a more relevant hint here (though I don't know what that might be, since I'm not actually clear on why I need allow.cartesian on here). Anyway, a minor suggestion.
An alternative would be to turn off the cartesian check (or downgrade it from error to verbose message) if it's a non-equi join and by=.EACHI, since that case seems pretty safe.
Fetching indices for non-equi joins is slightly more complicated and requires (at times) a reordering, which requires populating them entirely first. Even though it is rare, there might be cases where this triggers the allow.cartesian threshold. It could be sidestepped by checking for by=.EACHI.
I'm trying to non-equi join two really big tables to count sizes of groups. Cartesian join exceeds the 2^31 limit. So the only possibility for me is to group by .EACHI. Unfortunately, it fails with the mentioned error.
It's a critical bug for me. Please, consider fixing it.
Here is a tiny example to reproduce the error:
dt1 = data.table(a1 = 4:1, b1 = 1:2)
dt2 = data.table(a2 = 5:6, b2 = 3)
dt1[dt2, .N, on = .(a1 < a2, b1 < b2), by = .EACHI]
Error in vecseq(f__, len__, if (allow.cartesian || notjoin || !anyDuplicated(f__, :
Join results in 8 rows; more than 6 = nrow(x)+nrow(i). Check for duplicate key values in i each of which join to the same group in x over and over again. If that's ok, try by=.EACHI to run j for each group to avoid the large allocation. If you are sure you wish to proceed, rerun with allow.cartesian=TRUE. Otherwise, please search for this error message in the FAQ, Wiki, Stack Overflow and data.table issue tracker for advice.
@fedyakov as temporary fix just add , allow.cartesian = TRUE as in
dt1[dt2, .N, on = .(a1 < a2, b1 < b2), by = .EACHI, allow.cartesian = TRUE]
# a1 b1 N
# 1: 5 3 4
# 2: 6 3 4
Unless you've tried that and it fails?
Unless you've tried that and it fails?
I don't have the reproducible example now, but as far as I remember it did not work. Cartesian join exceeds the 2^31 limit.
Closed by #4493. Related to #4489. 1.12.9 will make allow.cartesian = TRUE for non-equi joins which is even better than a plain old hint :)
@fedyakov, there's also #3009 for joins that exceed 2^31 but are aggregated for .N. Then separately, there is #3957 which is explicitly about data.table allowing for more than 2^31 rows.
@ColeMiller1 My case is about .EACHI aggregation primarily. It doesn't require 64-bit vectors at all since aggregation could be done alongside joining in a single stage and the result fits in 邪 32-bit vector. Could you consider a separate issue to fix this special case?
@fedyakov you are more than welcome to open an issue. However, issue #3009 is the exact use case you referred to earlier in this thread.
Most helpful comment
Closed by #4493. Related to #4489. 1.12.9 will make
allow.cartesian = TRUEfor non-equi joins which is even better than a plain old hint :)@fedyakov, there's also #3009 for joins that exceed 2^31 but are aggregated for .N. Then separately, there is #3957 which is explicitly about
data.tableallowing for more than 2^31 rows.