Hi, I have data with up to 200K rows and about 12K-20K unique values in value.var which needs to be flattened out using dcast.data.table. I get the following error with such data:
dt1wide <- dcast.data.table(dt1, RowId ~ Feature, function(x) TRUE, fill = FALSE, value.var = "Feature")
# _Error in bmerge(i, x, leftcols, rightcols, io, xo, roll, rollends, nomatch, :
# long vectors not supported yet: bmerge.c:51
# In addition: Warning message:
# In setattr(l, "row.names", .set_row_names(length(l[[1L]]))) :
# NAs introduced by coercion to integer range_
When I reduce the number of rows to 20K, the same syntax as above works fine.
Below is code to generate the dummy data:
library(data.table)
#Define function to return a random string
getRandString<-function(len=12) return(paste0("T", paste(sample(c(rep(0:9,each=5),LETTERS,letters),len,replace=TRUE),collapse='')))
#Define parameters for data.table to generate
nrows <- 200000
nvalues <- 12000
#Generate unique character values
vrandtext <- c()
for(i in 1:nvalues) {
vrandtext [i] <- getRandString(4)
}
length(vrandtext )
#Define function to generate dummy data
generatetable <- function(x, vrandtext ) {
subtable <- function(x, row, vrandtext ) {
data.table(row, sample(vrandtext , 1))
}
numRepeat <- sample(1:2, 1)
out <- rbindlist(lapply(1:numRepeat, subtable, row = x, vrandtext = vrandtext))
}
#Generate dummy data
system.time(dt1 <- rbindlist(lapply(1:nrows, generatetable, vrandtext = vrandtext )))
setnames(dt1, c("RowId", "Feature"))
dt1$Feature <- as.factor(dt1$Feature)
dt1wide <- dcast.data.table(dt1, RowId ~ Feature, function(x) TRUE, fill = FALSE, value.var = "Feature")
Also, with the flattened table, the object size increases manifold due to 1 column ("Feature") expanded to several. I would like to reduce the size of the resulting table by filling the flattened columns by BIT instead of LOGICAL. But when I use the following syntax (on a table of 20K rows, ~12K unique values in "Feature"), I get the following error:
library(bit)
dt1wide <- dcast.data.table(dt1, RowId ~ Feature, function(x) as.bit(1), fill = as.bit(0), value.var = "Feature")
# _Error in setDT(ans) :
# All elements in argument 'x' to 'setDT' must be of same length_
Could you please let me know of the fix for both the above problems?
Thanks!
Hi Arun, Did you get a chance to look at the two issues above? Thanks!
Just looked at this issue:
# long vectors not supported yet: bmerge.c:51
This shouldn't be happening for data of these dimensions, and should be fixed. Thanks for spotting this. I'm not sure if I'll be able to invest time on this for this release though :-(. Will see.
Hi , I am also facing the same issue
dcast(dt, A ~ B, fill = 0, value.var = "col_sum")
Error in dim.data.table(x) :
long vectors not supported yet: ../../src/include/Rinlinedfuns.h:138
This works fine if dt is small. But fails when it is high. Can't we classify this as bug rather than enhancement?
I have encountered the same error as @niths4u. I am using developer version and it is not fixed in it either!
This isn't an enhancement, it's a bug. The "enhancement" label is probably what pushes this issue to the back of the queue. I, like many others, have high cardinality data that I need to cast. I use data.table for speed and find that the function is broken.
Please add the "bug" label.
In fread function, if use skip=2500000000, it will also raise the error:
NAs introduced by coercion to integer range
I am also encountering the same error as @niths4u. And agree with @ljodea - Can this be labelled as a bug? Data.table is so appealing for its speed on large data sets, if this is a soft limit on the size of data it can handle it surely makes sense to mark it as something to fix.
The issue is that this can not be fixed with the way dcast is currently implemented. Will need a rewrite. I've added the bug label (I agree that it is technically a bug, although the current error message is much more clearer). I'll have a look at this ASAP but will be lots of work AFAICT.
Most helpful comment
Hi , I am also facing the same issue
dcast(dt, A ~ B, fill = 0, value.var = "col_sum")This works fine if dt is small. But fails when it is high. Can't we classify this as bug rather than enhancement?