Data.table: ERROR with dcast.data.table with big data

Created on 6 May 2016 · 8Comments · Source: Rdatatable/data.table

Hi, I have data with up to 200K rows and about 12K-20K unique values in value.var which needs to be flattened out using dcast.data.table. I get the following error with such data:

dt1wide <- dcast.data.table(dt1, RowId ~ Feature, function(x) TRUE, fill = FALSE, value.var = "Feature")
# _Error in bmerge(i, x, leftcols, rightcols, io, xo, roll, rollends, nomatch,  : 
#   long vectors not supported yet: bmerge.c:51
# In addition: Warning message:
# In setattr(l, "row.names", .set_row_names(length(l[[1L]]))) :
#   NAs introduced by coercion to integer range_

When I reduce the number of rows to 20K, the same syntax as above works fine.
Below is code to generate the dummy data:

library(data.table)

#Define function to return a random string
getRandString<-function(len=12) return(paste0("T", paste(sample(c(rep(0:9,each=5),LETTERS,letters),len,replace=TRUE),collapse='')))

#Define parameters for data.table to generate
nrows <- 200000
nvalues <- 12000

#Generate unique character values
vrandtext <- c()

for(i in 1:nvalues)  {
 vrandtext [i] <-  getRandString(4)
}
length(vrandtext )

#Define function to generate dummy data

generatetable <- function(x, vrandtext ) {
      subtable <- function(x, row, vrandtext ) {
                          data.table(row, sample(vrandtext , 1))
    }
    numRepeat <- sample(1:2, 1)
    out <- rbindlist(lapply(1:numRepeat, subtable, row = x, vrandtext = vrandtext))
}

#Generate dummy data
system.time(dt1 <- rbindlist(lapply(1:nrows, generatetable, vrandtext = vrandtext )))

setnames(dt1, c("RowId", "Feature"))
dt1$Feature <- as.factor(dt1$Feature)
dt1wide <- dcast.data.table(dt1, RowId ~ Feature, function(x) TRUE, fill = FALSE, value.var = "Feature")

Also, with the flattened table, the object size increases manifold due to 1 column ("Feature") expanded to several. I would like to reduce the size of the resulting table by filling the flattened columns by BIT instead of LOGICAL. But when I use the following syntax (on a table of 20K rows, ~12K unique values in "Feature"), I get the following error:

library(bit)
dt1wide <- dcast.data.table(dt1, RowId ~ Feature, function(x) as.bit(1), fill = as.bit(0), value.var = "Feature")
# _Error in setDT(ans) : 
#   All elements in argument 'x' to 'setDT' must be of same length_

Could you please let me know of the fix for both the above problems?
Thanks!

bug enhancement refactor reshape

Source

sjain777

Most helpful comment

Hi , I am also facing the same issue
dcast(dt, A ~ B, fill = 0, value.var = "col_sum")

Error in dim.data.table(x) :
long vectors not supported yet: ../../src/include/Rinlinedfuns.h:138

This works fine if dt is small. But fails when it is high. Can't we classify this as bug rather than enhancement?

niths4u on 13 Jun 2017

👍6

All 8 comments

Hi Arun, Did you get a chance to look at the two issues above? Thanks!

sjain777 on 26 May 2016

Just looked at this issue:

#   long vectors not supported yet: bmerge.c:51

This shouldn't be happening for data of these dimensions, and should be fixed. Thanks for spotting this. I'm not sure if I'll be able to invest time on this for this release though :-(. Will see.

arunsrinivasan on 26 May 2016

Hi , I am also facing the same issue
dcast(dt, A ~ B, fill = 0, value.var = "col_sum")

Error in dim.data.table(x) :
long vectors not supported yet: ../../src/include/Rinlinedfuns.h:138

This works fine if dt is small. But fails when it is high. Can't we classify this as bug rather than enhancement?

niths4u on 13 Jun 2017

👍6

I have encountered the same error as @niths4u. I am using developer version and it is not fixed in it either!

ucb on 9 Oct 2017

This isn't an enhancement, it's a bug. The "enhancement" label is probably what pushes this issue to the back of the queue. I, like many others, have high cardinality data that I need to cast. I use data.table for speed and find that the function is broken.

Please add the "bug" label.

ljodea on 13 Oct 2017

In fread function, if use skip=2500000000, it will also raise the error:
NAs introduced by coercion to integer range

Miachol on 3 Dec 2017

I am also encountering the same error as @niths4u. And agree with @ljodea - Can this be labelled as a bug? Data.table is so appealing for its speed on large data sets, if this is a soft limit on the size of data it can handle it surely makes sense to mark it as something to fix.

charliekirkwood on 28 Feb 2018

The issue is that this can not be fixed with the way dcast is currently implemented. Will need a rewrite. I've added the bug label (I agree that it is technically a bug, although the current error message is much more clearer). I'll have a look at this ASAP but will be lots of work AFAICT.

arunsrinivasan on 9 Feb 2019

Was this page helpful?

0 / 5 - 0 ratings