I get unprintable data.table object whose printing results in error
Error in as.character.factor(x) : malformed factor
but in some cases (large dataset) also crashes R.
My code looks like this
require(data.table)
# generate demo dataset
ids <- letters[1:3]
dates <- 1:2
dt <- data.table(CJ(dates, ids, ids))
setnames(dt, c("date", "id1", "id2"))
dt[, value := rnorm(18)]
# IMPORTANT: to reproduce the bug, drop some id
dt <- dt[!(date == 1 & (id1 == "a" | id2 == "a"))]
# demo function
f1 <- function(sdt) {
dt1 <- dcast.data.table(sdt, id1 ~ id2)
dt2 <- melt.data.table(dt1, id.vars = "id1")
print(dt2)
dt2
}
res <- dt[, f1(.SD), by = date]
#id1 variable value
#1: b b -1.3643635
#2: c b 0.5305674
#3: b c 0.2023935
#4: c c 0.1063894
#id1 variable value
#1: a a -0.35193161
#2: b a -0.65570503
#3: c a 0.01524152
#4: a b -0.07234880
#5: b b -1.16267653
#6: c b -1.41490080
#7: a c 0.10225144
#8: b c -0.84277336
#9: c c -0.23164772
res
#Error in as.character.factor(x) : malformed factor
In my case I forgot to put the usual variable.factor = FALSE into the melt.data.table() which makes it work ok. But this behavior surprises me. It only appears when the two sets of factors differ (for date=1 and date=2 the ids are different sets), so if you I skip the line
dt <- dt[!(date == 1 & (id1 == "a" | id2 == "a"))]
it works alright.
I am on the latest stable version of data.table.
My sessionInfo()
R version 3.4.0 (2017-04-21)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: KDE neon User Edition 5.10
...
other attached packages:
[1] data.table_1.10.4
loaded via a namespace (and not attached):
[1] compiler_3.4.0 tools_3.4.0
In my opinion, combining vectors with inconsistent attributes is a bad idea, and it would be reasonable to leave the responsibility for handling that with the user.
Some other examples:
data.table(g = 1:2)[, factor(letters[.GRP]), by=g] c(factor("a"), factor("b"))On the other hand, rbind/rbindlist apparently has special handling for factors: rbind(data.table(x = factor("a")), data.table(x = factor("b")))
I somehow agree about the responsibility (I usually avoid factors at all), but for ordinary R user I think this is unexpected behavior in two ways:
a) the error message about corrupted factors comes only when printing the data.table.
b) with my original dataset (biggish but not really huge), this code always crashes my R session without error message.
Only after I tried View(res) instead of print(res) I've got the error message so I eventually had some clue what was going on.
So this crashes my R/RStudio session
require(data.table)
set.seed(2)
# generate demo dataset
ids <- sample(letters, 20)
dates <- 1:40
dt <- data.table(CJ(dates, ids, ids))
setnames(dt, c("date", "id1", "id2"))
dt[, value := rnorm(length(date))]
# IMPORTANT: to reproduce the bug, drop some id
dt <- dt[!(date == 1 & (id1 == "a" | id2 == "a"))]
dt <- dt[!(date == 4 & (id1 == "e" | id2 == "e"))]
# demo function
f1 <- function(sdt) {
dt1 <- dcast.data.table(sdt, id1 ~ id2)
dt2 <- melt.data.table(dt1, id.vars = "id1")
# print(dt2)
dt2
}
res <- dt[, f1(.SD), by = date]
res
R version 3.4.0 (2017-04-21)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows Server 2008 R2 x64 (build 7601) Service Pack 1
Matrix products: default
...
other attached packages:
[1] data.table_1.10.4
loaded via a namespace (and not attached):
[1] compiler_3.4.0 tools_3.4.0
I was about to post the same issue. In addition to the arguments above I think the issue should be adressed because it can lead to silent errors in data analysis without any kind of error or warning message. I found this behaviour because I got strange results in a study due to that.
As factor variables are one of the fundamental data types in R and one of the first things encountered by new R users, many people using them won't even know that such things as "attributes" even exist. Therefore they can't be held responsible for possible problems caused by them.
Similar to the other poster, this issue can also crash R with a fatal error on my computer.
Further example to Frank's (showed it isn't due to melt or dcast) to show the malformed factor error too. Problem lies in internal dogroups.
> DT = data.table(A=1:2)
> g = function(x) { if (x==1L) factor(c("a","b")) else factor(c("b","c")) }
> ans = DT[,g(.GRP),by=A]
> ans
A V1
<int> <fctr>
1: 1 a
2: 1 b
3: 2 a # wrong silently
4: 2 b # wrong silently
> g = function(x) { if (x==1L) factor(c("a","b")) else factor(c("a","b","c")) }
> ans = DT[,g(.GRP),by=A]
> ans
Error in as.character.factor(x) : malformed factor
> unclass(ans$V1)
[1] 1 2 1 2 3
attr(,"levels")
[1] "a" "b"
>
I was facing similar problem in frollapply, where we call R C eval for every single row (here for every single group).
It will require to detect if factor fields are among the results, and synchronise their values and attributes. It is far from trivial.
Most helpful comment
I somehow agree about the responsibility (I usually avoid factors at all), but for ordinary R user I think this is unexpected behavior in two ways:
a) the error message about corrupted factors comes only when printing the data.table.
b) with my original dataset (biggish but not really huge), this code always crashes my R session without error message.
Only after I tried
View(res)instead ofprint(res)I've got the error message so I eventually had some clue what was going on.