Data.table: malformed factor resulting from 'by' expression when using melt.data.table in 'j' expression

Created on 11 Jun 2017  路  6Comments  路  Source: Rdatatable/data.table

I get unprintable data.table object whose printing results in error

Error in as.character.factor(x) : malformed factor

but in some cases (large dataset) also crashes R.

My code looks like this

require(data.table)

# generate demo dataset
ids   <- letters[1:3]
dates <- 1:2

dt <- data.table(CJ(dates, ids, ids))
setnames(dt, c("date", "id1", "id2"))

dt[, value := rnorm(18)]

# IMPORTANT: to reproduce the bug, drop some id
dt <- dt[!(date == 1 & (id1 == "a" | id2 == "a"))]

# demo function
f1 <- function(sdt) {
  dt1 <- dcast.data.table(sdt, id1 ~ id2)
  dt2 <- melt.data.table(dt1, id.vars = "id1")
  print(dt2)

  dt2
}

res <- dt[, f1(.SD), by = date]
#id1 variable      value
#1:   b        b -1.3643635
#2:   c        b  0.5305674
#3:   b        c  0.2023935
#4:   c        c  0.1063894
#id1 variable       value
#1:   a        a -0.35193161
#2:   b        a -0.65570503
#3:   c        a  0.01524152
#4:   a        b -0.07234880
#5:   b        b -1.16267653
#6:   c        b -1.41490080
#7:   a        c  0.10225144
#8:   b        c -0.84277336
#9:   c        c -0.23164772

res
#Error in as.character.factor(x) : malformed factor

In my case I forgot to put the usual variable.factor = FALSE into the melt.data.table() which makes it work ok. But this behavior surprises me. It only appears when the two sets of factors differ (for date=1 and date=2 the ids are different sets), so if you I skip the line

dt <- dt[!(date == 1 & (id1 == "a" | id2 == "a"))]

it works alright.

I am on the latest stable version of data.table.

My sessionInfo()

R version 3.4.0 (2017-04-21)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: KDE neon User Edition 5.10

...

other attached packages:
[1] data.table_1.10.4

loaded via a namespace (and not attached):
[1] compiler_3.4.0 tools_3.4.0

Most helpful comment

I somehow agree about the responsibility (I usually avoid factors at all), but for ordinary R user I think this is unexpected behavior in two ways:

a) the error message about corrupted factors comes only when printing the data.table.
b) with my original dataset (biggish but not really huge), this code always crashes my R session without error message.

Only after I tried View(res) instead of print(res) I've got the error message so I eventually had some clue what was going on.

All 6 comments

In my opinion, combining vectors with inconsistent attributes is a bad idea, and it would be reasonable to leave the responsibility for handling that with the user.

Some other examples:

  • data.table(g = 1:2)[, factor(letters[.GRP]), by=g]
  • c(factor("a"), factor("b"))

On the other hand, rbind/rbindlist apparently has special handling for factors: rbind(data.table(x = factor("a")), data.table(x = factor("b")))

I somehow agree about the responsibility (I usually avoid factors at all), but for ordinary R user I think this is unexpected behavior in two ways:

a) the error message about corrupted factors comes only when printing the data.table.
b) with my original dataset (biggish but not really huge), this code always crashes my R session without error message.

Only after I tried View(res) instead of print(res) I've got the error message so I eventually had some clue what was going on.

So this crashes my R/RStudio session

require(data.table)

set.seed(2)

# generate demo dataset
ids   <- sample(letters, 20)
dates <- 1:40

dt <- data.table(CJ(dates, ids, ids))
setnames(dt, c("date", "id1", "id2"))

dt[, value := rnorm(length(date))]

# IMPORTANT: to reproduce the bug, drop some id
dt <- dt[!(date == 1 & (id1 == "a" | id2 == "a"))]
dt <- dt[!(date == 4 & (id1 == "e" | id2 == "e"))]

# demo function
f1 <- function(sdt) {
  dt1 <- dcast.data.table(sdt, id1 ~ id2)
  dt2 <- melt.data.table(dt1, id.vars = "id1")
  # print(dt2)

  dt2
}

res <- dt[, f1(.SD), by = date]

res
R version 3.4.0 (2017-04-21)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows Server 2008 R2 x64 (build 7601) Service Pack 1

Matrix products: default

...

other attached packages:
[1] data.table_1.10.4

loaded via a namespace (and not attached):
[1] compiler_3.4.0 tools_3.4.0   

I was about to post the same issue. In addition to the arguments above I think the issue should be adressed because it can lead to silent errors in data analysis without any kind of error or warning message. I found this behaviour because I got strange results in a study due to that.

As factor variables are one of the fundamental data types in R and one of the first things encountered by new R users, many people using them won't even know that such things as "attributes" even exist. Therefore they can't be held responsible for possible problems caused by them.

Similar to the other poster, this issue can also crash R with a fatal error on my computer.

Further example to Frank's (showed it isn't due to melt or dcast) to show the malformed factor error too. Problem lies in internal dogroups.

> DT = data.table(A=1:2)
> g = function(x) { if (x==1L) factor(c("a","b")) else factor(c("b","c")) }
> ans = DT[,g(.GRP),by=A]
> ans
       A     V1
   <int> <fctr>
1:     1      a
2:     1      b
3:     2      a   # wrong silently
4:     2      b   # wrong silently
> g = function(x) { if (x==1L) factor(c("a","b")) else factor(c("a","b","c")) }
> ans = DT[,g(.GRP),by=A]
> ans
Error in as.character.factor(x) : malformed factor
> unclass(ans$V1)
[1] 1 2 1 2 3
attr(,"levels")
[1] "a" "b"
> 

I was facing similar problem in frollapply, where we call R C eval for every single row (here for every single group).
It will require to detect if factor fields are among the results, and synchronise their values and attributes. It is far from trivial.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

lux5 picture lux5  路  3Comments

st-pasha picture st-pasha  路  3Comments

jangorecki picture jangorecki  路  3Comments

alex46015 picture alex46015  路  3Comments

symbalex picture symbalex  路  3Comments