Data.table: dcast not working properly with a large dataset

Created on 1 Feb 2019  路  6Comments  路  Source: Rdatatable/data.table

Given this sample data:

dT<-structure(list(A = c("a1", "a2", "a1", "a1", "a2", "a1", "a1", 
    "a2", "a1"), B = c("b2", "b2", "b2", "b1", "b2", "b2", "b1", 
    "b2", "b1"), ID = c("3", "4", "3", "1", "4", "3", "1", "4", "1"
    ), E = c(0.621142094943352, 0.742109450696123, 0.39439152996948, 
    0.40694392882818, 0.779607277916503, 0.550579323666347, 0.352622183880119, 
    0.690660491345867, 0.23378944873769)), class = c("data.table", 
    "data.frame"), row.names = c(NA, -9L))

this code works to create several variables from the variable E as expected:

library(data.table)
dcast(dT, A + B + ID ~ paste0("E", rowid(ID)))
#   A  B ID        E1        E2        E3
#1 a1 b1  1 0.4069439 0.3526222 0.2337894
#2 a1 b2  3 0.6211421 0.3943915 0.5505793
#3 a2 b2  4 0.7421095 0.7796073 0.6906605

However, when I apply the same code to a larger dataset - available here, which is the actual data to which I want to apply the operation, data.table does not give the expected output as illustrated here:

library(readr)
mydata <- read_csv("mydata.csv")  
library(data.table)
myDT<-dcast(mydata, A + B + ID ~ paste0("E", rowid(ID)))
View(myDT) 

Thanks in advance.

Most helpful comment

Hi, @arunsrinivasan. You are right. I loaded the data using readr which as you say loads in as a data.frame. Importantly, I think the fact that # if reshape2 exists and input to dcast is data.frame, data.table automatically calls reshape2::dcast was what was causing the problem. By acknowledging that, I have been able to solve the problem. Thanks a lot for this tremendous help.

All 6 comments

You said "as illustrated here" but the link doesn't illustrate what's wrong. What's wrong with the output?

Without a reproducible example, this sounds like a user error -- are you sure the larger data is as regular as you think it is?

Hi, just changed the link, that is, a reproducible example has been provided. Thanks for the feedback. Yes, the larger dataset is regular - I created it using tidybayes. Any solution?

Also, @MichaelChirico, could you please expand this Look at mydata[ , table(rowid(ID))] suggested here?

Is the problem that mydata is not a data.table?

You have loaded the data using readr which loads in as a data.frame. When using dcast, it'll run reshape2::dcast. The issue has nothing to do with data.table AFAICT. And checking the results of both reshape2::dcast and with data.table's dcast after converting the original object to data.table, I get identical results:

require(readr)
# if reshape2 exists and input to dcast is data.frame, data.table automatically calls reshape2::dcast
require(data.table)
df <- read_csv("mydata.csv")
dt <- fread("mydata.csv")
df_ans <- dcast(df, A + B + ID ~ paste0("E", rowid(ID)))
dt_ans <- dcast(dt, A + B + ID ~ paste0("E", rowid(ID)))
setkey(dt_ans, NULL)
all.equal(dt_ans, as.data.table(df_ans))
# [1] TRUE

Please provide a clear minimal reproducible example where your data is a data.table, and point out what the expected output is and what the difference is with the output you obtain.

Hi, @arunsrinivasan. You are right. I loaded the data using readr which as you say loads in as a data.frame. Importantly, I think the fact that # if reshape2 exists and input to dcast is data.frame, data.table automatically calls reshape2::dcast was what was causing the problem. By acknowledging that, I have been able to solve the problem. Thanks a lot for this tremendous help.

Was this page helpful?
0 / 5 - 0 ratings