Data.table: typeorder for LGLSXP and RAWSXP in src/init.c is in conflict with R's coercion rules

Created on 13 Jan 2020 · 4Comments · Source: Rdatatable/data.table

Copying from a discussion chain with @MichaelChirico in #4144 :

In src/init.c the type order is laid out as follows:

static void setSizes() {
  for (int i=0; i<100; ++i) { sizes[i]=0; typeorder[i]=0; }
  // only these types are currently allowed as column types :
  sizes[LGLSXP] =  sizeof(int);       typeorder[LGLSXP] =  0;
  sizes[RAWSXP] =  sizeof(Rbyte);     typeorder[RAWSXP] =  1;
  sizes[INTSXP] =  sizeof(int);       typeorder[INTSXP] =  2;   // integer and factor
  sizes[REALSXP] = sizeof(double);    typeorder[REALSXP] = 3;   // numeric and integer64
  sizes[CPLXSXP] = sizeof(Rcomplex);  typeorder[CPLXSXP] = 4;
  sizes[STRSXP] =  sizeof(SEXP *);    typeorder[STRSXP] =  5;
  sizes[VECSXP] =  sizeof(SEXP *);    typeorder[VECSXP] =  6;   // list column
  if (sizeof(char *)>8) error(_("Pointers are %d bytes, greater than 8. We have not tested on any architecture greater than 64bit yet."), sizeof(char *));
  // One place we need the largest sizeof is the working memory malloc in reorder.c
}

The order of LGLSXP and RAWSXP are in conflict with R's coercion rules:

c(as.raw(0), TRUE)
# [1] FALSE  TRUE

As you can see, the raw is coerced to logical, not the other way round as listed in src/init.c

I believe the coercion rules work this way because a logical vector may contain NA, whereas raw vectors cannot:

as.raw(NA)
# [1] 00
# Warning message:
# out-of-range values treated as 0 in coercion to raw

Changing the order of LGLSXP and RAWSXP in typeorder has minimal impact, breaking only the following tests: 2006.1, 2006.2, and 2129.

These tests all related to the rbind/rbindlist functionality. Tests 2006.1 and 2006.2 are a simple fix: they simply expect the wrong output: when the typeorder is changed, rbindlist(..., fill = TRUE) with missing raw colums are now filled with NA in line with the functionality for all other column types, rather than 00 as is in the expected test output. Its not clear to me why 2129 fails, as the input data.tables do not contain raw columns, but the test is for rbind(), so this would take a little debugging.

Source

sritchie73

All 4 comments

For 2129, that's related to #546 / #3811. As pointed out there, the issue wasn't closed by the PR, so the test failure may well be related to that issue being unsolved.

MichaelChirico on 13 Jan 2020

That example is only with 0 in raw.
Consider in base R c(as.raw(12), as.raw(42), TRUE) converts 12 and 42 to TRUE and loses the distinction between 12 and 42, which I don't think makes sense. When folk use raw, they're likely to be using the full range 0:255.
In very many (most?) real-world cases, the logical column doesn't contain any NA anyway, and coercing 0:1 to the 0:255 range makes more sense.
I'm aware of base R's hierarchy and I decided to be different.
Where choices like this have been made I think we should resist changing it for the sole reason of being consistent with base R. If there's an example from a user, or a problem demonstrated with real-world data, then that's different. We certainly shouldn't change it after only considering c(as.raw(0), TRUE). And in terms of priorities, raw is a less used type that tends not to be combined with other types anyway, which is why we don't see this coercion aspect coming up in practice.

mattdowle on 19 Feb 2020

👍1

Ah, I hadn't realised that base R's coercion rules differed depending on the value of the raw, nor that a design decision had been made here. I'm happy for the issue and correspondign PR to be closed, I agree that it doesn't make sense to downcast the non-zero raw values to TRUE.

sritchie73 on 19 Feb 2020

👍1

Thanks for the explanation Matt, let's add something to that effect as a comment there -- I looked around for an explanation of the ordering there before recommending to try switching but didn't find one

MichaelChirico on 19 Feb 2020

👍1

Was this page helpful?

0 / 5 - 0 ratings