Data.table: dim() on 0-column data.table produced in non-data.table-aware package is wrong

Created on 16 Oct 2017 · 14Comments · Source: Rdatatable/data.table

If a data.table, which is passed to a function of a non-data.table-aware package, is subsetted there, such that a 0-column data.table/data.frame is produced, dim() on that data.table/data.frame falsely reports 0 rows.

library(data.table)
X <- data.table(a = 1:10)

# imitate subsetting in function of non-data.table-aware package
Y <- `[.data.frame`(X, , character(), drop = FALSE)

dim.data.frame(Y)  # returns c(10, 0)
dim(Y)  # returns c(0, 0), also in function of non-data.table-aware package

> sessionInfo()
R version 3.4.2 (2017-09-28)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Debian GNU/Linux 8 (jessie)

Matrix products: default
BLAS: /usr/lib/atlas-base/atlas/libblas.so.3.0
LAPACK: /usr/lib/atlas-base/atlas/liblapack.so.3.0

locale:
 [1] LC_CTYPE=de_DE.UTF-8       LC_NUMERIC=C               LC_TIME=de_DE.UTF-8        LC_COLLATE=de_DE.UTF-8    
 [5] LC_MONETARY=de_DE.UTF-8    LC_MESSAGES=de_DE.UTF-8    LC_PAPER=de_DE.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C             LC_MEASUREMENT=de_DE.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] data.table_1.10.4-2

loaded via a namespace (and not attached):
[1] tools_3.4.2 yaml_2.1.14

Source

akersting

👍1

Most helpful comment

The bug is not whether or not the "actual" dimension is (0, 0), but whether or not it should be. data.table represents zero-row-nonzero-column tables just fine (as a nonempty list of zero length vectors), but does not go on to representing nonzero-row-zero-column tables as empty lists of nonempty vectors (think about it). In that sense, Ys internal state is entirely consistent, data.table arguably just chose to interpret it as dimension (0, 0).

The behaviour of data.frame in these cases, which is

> iris[character(0)]
data frame with 0 columns and 150 rows

has very desirable properties, for example one has

all.equal(cbind(df[x], df[y]), df[c(x, y)])

Even though X by 0 and 0 by X data.frames or matrices contain no data, they make edge case behaviour more consistent and are useful for package development (even though they may not help much in interactive sessions).

mb706 on 21 Apr 2018

👍3

All 14 comments

Actually, Y _does_ have dimension (0, 0) (i.e., the error is not dim's fault):

dput(Y)
# structure(list(), .Names = character(0), 
#           class = c("data.table", "data.frame"), row.names = c(NA, -10L))

In fact, Y is not internally consistent, since row.names has retained the 10-row structure (which is what dim.data.frame uses to get its (10,0)), but the table itself is empty.

print(Y)
# Null data.table (0 rows and 0 cols)

row.names(Y)
# [1] "1"  "2"  "3"  "4"  "5"  "6"  "7"  "8"  "9"  "10"

I don't know what the result of X[ , character(0), with = FALSE] _should_ be, TBH

MichaelChirico on 17 Oct 2017

👍1

The behaviour of data.frame in these cases, which is

> iris[character(0)]
data frame with 0 columns and 150 rows

has very desirable properties, for example one has

all.equal(cbind(df[x], df[y]), df[c(x, y)])

mb706 on 21 Apr 2018

👍3

I understand your point, but still I prefer c(0, 0) as correct answer. Rows are _childs_ of columns, if there are no columns there should be no rows returned. As Michael pointed out, it looks more like bug in R. This is a little bit problematic because there is not much control over how non-data.table-aware package will process user data. Eventually good solution would be to handle this edge case by detecting if call like df[character()] (resulting in 0 cols data.table) was made from non-data.table package and then make an exception.

jangorecki on 21 Apr 2018

Rows are childs of columns, if there are no columns there should be no rows returned

That is just a detail about implementation, though. data.table does handle zero row but nonzero column tables just fine (dt[integer(0), ]). An R matrix can also have zero rows xor zero columns (although in the underlying representation, as a simple vector with additional info of dimensionality, the data is a 0-length vector).

Having zero column tables with nonzero rows is similar to having zero-length vectors that still have a type.

I changed the confusing === notation.

mb706 on 21 Apr 2018

Having zero column tables with nonzero rows is similar to having zero-length vectors that still have a type.

My understanding is

Having zero rows table with nonzero columns is similar to having zero-length vectors that still have a type.

Also my comment from linked issue:

In columnar storage row is a child of column. Without column no rows exists. This make sense for multidimensional structures like vector/matrix/array but not data.frames.

jangorecki on 6 Feb 2019

A data.frame is a rectangle list of vectors with the same lengths. Every data.frame knows its own height, and we cannot add a new column with a different length. It means that we cannot add a nonzero-length vector to an existing zero-row data.frame:

empty_df = data.frame()
empty_df$newcol = seq_len(nrow(iris))    # Error! Height is incompatible
#> Error in `$<-.data.frame`(`*tmp*`, newcol, value = 1:150): replacement has 150 rows, data has 0

df = iris[,FALSE]
print(df)
#> data frame with 0 columns and 150 rows
attributes(df)                           # row.names are preserved
#> $names
#> character(0)
#> 
#> $class
#> [1] "data.frame"
#> 
#> $row.names
#>   [1]   1   2   3   4   5   6   7   8   9  10  11  12  13  14  15  16  17
#>  [18]  18  19  20  21  22  23  24  25  26  27  28  29  30  31  32  33  34
#>  [35]  35  36  37  38  39  40  41  42  43  44  45  46  47  48  49  50  51
#>  [52]  52  53  54  55  56  57  58  59  60  61  62  63  64  65  66  67  68
#>  [69]  69  70  71  72  73  74  75  76  77  78  79  80  81  82  83  84  85
#>  [86]  86  87  88  89  90  91  92  93  94  95  96  97  98  99 100 101 102
#> [103] 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119
#> [120] 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136
#> [137] 137 138 139 140 141 142 143 144 145 146 147 148 149 150
df$newcol = seq_len(nrow(iris))          # OK, df is derived from iris

dt = data.table::data.table(iris)[,FALSE]
print(dt)
#> Null data.table (0 rows and 0 cols)
attributes(dt)                           # row.names are discarded
#> $class
#> [1] "data.table" "data.frame"
#> 
#> $row.names
#> integer(0)
#> 
#> $names
#> character(0)
#> 
#> $.internal.selfref
#> <pointer: 0x7f943b01c2e0>
dt$newcol = seq_len(nrow(iris))          # Error! even if dt is derived from iris
#> Error in `[<-.data.table`(x, j = name, value = value): Cannot use := to add columns to a null data.table (no columns), currently. You can use := to add (empty) columns to a 0-row data.table (1 or more empty columns), though.

^{Created on 2019-02-06 by the reprex package (v0.2.1)}

Keeping dim()[1L] and row.names seems more reasonable to me.

heavywatal on 6 Feb 2019

In other words,
Row subsetting should keep column number.
Column subsetting should keep row number.

tbl = tibble::as_tibble(iris)
dt = data.table::data.table(iris)

ncol(tbl[seq_len(3L),])
#> [1] 5
ncol(tbl[integer(0L),])
#> [1] 5
ncol(dt[seq_len(3L),])
#> [1] 5
ncol(dt[integer(0L),])
#> [1] 5

nrow(tbl[,"Species"])
#> [1] 150
nrow(tbl[,FALSE])
#> [1] 150
nrow(dt[,"Species"])
#> [1] 150
nrow(dt[,FALSE])        # Surprise!
#> [1] 0

^{Created on 2019-02-06 by the reprex package (v0.2.1)}

heavywatal on 6 Feb 2019

What would be also useful is to show practical existing implications of both approaches. For example, if some package breaks, then provide reproducible example.
rownames is nothing but a dimension names, which unfortunately are attempting to mimic matrix, where dimension names are perfectly justified. But data.frame is not a multidimensional data structure of any particular dimension (as vector, matrix, arrays) but a list of independent one-dimensional structures - vectors. Restriction that those vectors have to maintain equal length doesn't change much. Dimension names does not fits into data.frame concept. Without particular practical implications of that I am now convinced.

jangorecki on 6 Feb 2019

OK, let's forget about row.names (I don't like it either) and focus on dim(x)[1L]. I am currently working on a thin igraph wrapper with Rcpp. Edges and vertices themselves are stored in igraph_t object, and their attributes such as names and weights are stored in data.frames, say, Eattr and Vattr. Their row numbers should always remain the same as the edge and vertex numbers, respectively. In this senario, it is quite natural to start from (and sometimes shrink to) zero-column nonzero-row data.frames. If it was not allowed, I would have to switch two different methods for adding a new column to a non-empty data.frame and for adding a first column to a dim c(0, 0) data.frame or null placeholder.

heavywatal on 6 Feb 2019

data.frame is not a multidimensional data structure of any particular dimension

I strongly disagree. A data.frame (and hence also a data.table) is a two-dimensional data structure . That is also why dim returns a vector with two elements.

Anyway, there are two issues with the current behavior of data.table:

It breaks non-data.table-aware packages.
It is inconsistent in itself, since subsetting on one dimension (columns) might change the other one (rows).

akersting on 23 Feb 2019

@akersting

data.frame is not a multidimensional data structure of any particular dimension

I strongly disagree. A data.frame (and hence also a data.table) is a two-dimensional data structure .

data.frame is two-dimensional data structure but not a (any particular) case of multidimensional data. Because of that we can store different data types in different columns. This is not possible for multidimensional data where column is no different from row, page or any other name you will use instead of integer sequence that maps data into dimensions. Names like rows, columns, pages doesn't really have meaning for multidimensional data, they only maps an integer dimension indexes in some visual representation. They are used only when you want to format data for output. This is also the reason why applying transpose function for multidimensional data will never alter the data but only re-arrange along some dimension index, which is not true for data.frames where transpose can alter data.

It is inconsistent in itself, since subsetting on one dimension (columns) might change the other one (rows).

This consistency is exactly what you would expect from multidimensional data, where dimension 1 (lets call it "row") is no different from dimension 2 (lets call it "column"). While in data.frames row is a child of a column.

I am not saying we have to strictly align to the above, we already made multiple exceptions just for sake of being consistent to base R.

jangorecki on 24 Feb 2019

jangorecki on 17 May 2019

👍1

I just ran into this issue and while I understand (and agree there is some merit to) @jangorecki's "nestedness" argument, I still feel the current data.table behavior is counter-intuitive. Out of the big three data frame structures in R (the other two being tibble and base R data.frame), data.table is the only one to interpret a zero-col/nonzero-row data frame in this way. I feel it would make for a smoother user-experience if zero-col/nonzero-row data.table were to become possible.