Data.table: Weird behavior of setcolorder

Created on 2 Sep 2020 · 6Comments · Source: Rdatatable/data.table

I have this weird behavior on a data.frame that is the result of RSQLite. I have attached it (df.zip) since this is the only way to reproduce the problem.

Please unzip before using it: github requires zipping the file.

Here is an annotated script that reproduces the problem with this file:

# load the data.frame
df <- readRDS('df.rds')
class(df)
# [1] "data.frame"

# there are many columns, but the first few are enough to show the problem
df[, 1:5]
#     estate_id mod_timestamp cust_id cust_isOwner cust_publicInfo.name
# 1 -2147474620    1596557895 1575296            1                 <NA>

# now I want to sort the columns (by reference?)
data.table::setcolorder(df, neworder = sort(colnames(df)))

# now let's look at the first few columns
df[, 1:5]
#   estate_id mod_timestamp cust_id cust_isOwner cust_publicInfo.name
# 1   1575296             1       0         <NA>                 <NA>

# (!!) you see here that the fields have changed (i.e. sorted), but not the column names (!!)

# let us take a copy of this table by (!) selecting the columns this time (!)
df2 <- readRDS('df.rds')[, colnames(readRDS('df.rds'))]

# now do the same: order columns
data.table::setcolorder(df2, neworder = sort(colnames(df2)))
df2[, 1:5]
#   cust_id cust_isOwner cust_publicInfo.hasIpiNo cust_publicInfo.ipiNo cust_publicInfo.name
# 1 1575296            1                        0                  <NA>                 <NA>

# (!!) now the columns and fields are ordered as they should (!!)

Any clue on what is going on here? Both readRDS('df.rds') and readRDS('df.rds')[, colnames(readRDS('df.rds'))] are identical, so I don't understand what is causing this column ordering discrepancy.

sessionInfo()
R version 3.6.3 (2020-02-29)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows >= 8 x64 (build 9200)

Matrix products: default

locale:
[1] LC_COLLATE=Dutch_Belgium.1252  LC_CTYPE=Dutch_Belgium.1252    LC_MONETARY=Dutch_Belgium.1252 LC_NUMERIC=C                   LC_TIME=Dutch_Belgium.1252    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

loaded via a namespace (and not attached):
[1] compiler_3.6.3    tools_3.6.3       data.table_1.13.0

Note: converting df to data.table solve the problem, but begs the question. Why doesn't setcolorder() work with this specific data.frame.

Source

DavorJ

All 6 comments

Thanks for reporting.
You may try setDF(setDT(df)) after readRDS or dbGetQuery.

jangorecki on 2 Sep 2020

👍1

I'm still not sure why df[ , names(df)] works, but setcolorder is intended to work on data.tables, and you should always setDT when loading a data.table from RDS.

MichaelChirico on 2 Sep 2020

Thanks for reporting.
You may try setDF(setDT(df)) after readRDS or dbGetQuery.

I added a note just before you replied. It works correctly with data.table but not with this specific data.frame. As long as setcolorder() doesn't enforce data.table with an assertion, I think this issue is legit and should be investigated. No?

DavorJ on 2 Sep 2020

👍1

?setcolorder suggests that it should only be used for data.tables:

Usage
setcolorder(x, neworder=key(x))
Arguments
x
A data.table.

Based on the example and documentation, it would seem like one potential solution is to error if setcolorder receives something other than a data.table.

While I am unsure the root cause, it seems to be related to objects sharing the same memory - note when we provide j to [.data.frame, a copy is made on the names which likely explains the second option works as intended.

library(data.table)
address(names(iris)) == address(names(iris[1L, ]))
#> [1] TRUE
address(names(iris)) == address(names(iris[1L, 1:5]))
#> [1] FALSE

Edit: removed some code that contained rookie mistakes.

ColeMiller1 on 3 Sep 2020

👍1

erroring on data.frames will likely break some downstreams as this behavior
has probably been around a while...

not that we shouldn't do it, but I would rather understand what's happening
first

On Wed, Sep 2, 2020, 7:10 PM Cole Miller notifications@github.com wrote:

?setcolorder suggests that it should only be used for data.tables:

Usage
setcolorder(x, neworder=key(x))
Arguments
x
A data.table.

Based on the example and documentation, it would seem like one potential
solution is to error if setcolorder receives something other than a
data.table.

While I am unsure the root cause, it seems to be related to objects
sharing the same memory - note when we provide j to [.data.frame, a copy
is made on the names which likely explains the second option works as
intended.

library(data.table)
address(names(iris)) == address(names(iris[1L, ]))#> [1] TRUE
address(names(iris)) == address(names(iris[1L, 1:5]))#> [1] FALSE

And finally, here's a more minimal example.

library(data.table)
DF = data.frame(first = rep("first", 2L), last = rep("last", 2L))DF2 = DF
setcolorder(DF2, 2L) DF ##Good#> last first#> 1 last first#> 2 last first
DF = data.frame(first = rep("first", 2L), last = rep("last", 2L))DF2 = DF[1L, ]
setcolorder(DF2, 2L)DF ##Bad#> last first#> 1 first last#> 2 first last
DF = data.frame(first = rep("first", 2L), last = rep("last", 2L))DF2 = DF[1L, names(DF)] ##copy of the names attribute will be made
setcolorder(DF2, 2L)DF ##Good again!#> first last#> 1 first last#> 2 first last

This suggests that to allow data.frames, we can only setcolorder if the
memory of the names is not shared with another object.

—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
https://github.com/Rdatatable/data.table/issues/4690#issuecomment-686085059,
or unsubscribe
https://github.com/notifications/unsubscribe-auth/AB2BA5J7RVYR4IM7FEA2OXLSD3GIHANCNFSM4QTAYC3Q
.

MichaelChirico on 3 Sep 2020

I think we could error on data.frame here, at least untill we can handle that nicely.
Breaking change should not be that bad if it follows documentation, anyway revdeps check should be used to make this decision.

jangorecki on 3 Sep 2020

Was this page helpful?

0 / 5 - 0 ratings