I have this weird behavior on a data.frame that is the result of RSQLite. I have attached it (df.zip) since this is the only way to reproduce the problem.
Please unzip before using it: github requires zipping the file.
Here is an annotated script that reproduces the problem with this file:
# load the data.frame
df <- readRDS('df.rds')
class(df)
# [1] "data.frame"
# there are many columns, but the first few are enough to show the problem
df[, 1:5]
# estate_id mod_timestamp cust_id cust_isOwner cust_publicInfo.name
# 1 -2147474620 1596557895 1575296 1 <NA>
# now I want to sort the columns (by reference?)
data.table::setcolorder(df, neworder = sort(colnames(df)))
# now let's look at the first few columns
df[, 1:5]
# estate_id mod_timestamp cust_id cust_isOwner cust_publicInfo.name
# 1 1575296 1 0 <NA> <NA>
# (!!) you see here that the fields have changed (i.e. sorted), but not the column names (!!)
# let us take a copy of this table by (!) selecting the columns this time (!)
df2 <- readRDS('df.rds')[, colnames(readRDS('df.rds'))]
# now do the same: order columns
data.table::setcolorder(df2, neworder = sort(colnames(df2)))
df2[, 1:5]
# cust_id cust_isOwner cust_publicInfo.hasIpiNo cust_publicInfo.ipiNo cust_publicInfo.name
# 1 1575296 1 0 <NA> <NA>
# (!!) now the columns and fields are ordered as they should (!!)
Any clue on what is going on here? Both readRDS('df.rds') and readRDS('df.rds')[, colnames(readRDS('df.rds'))] are identical, so I don't understand what is causing this column ordering discrepancy.
sessionInfo()
R version 3.6.3 (2020-02-29)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows >= 8 x64 (build 9200)
Matrix products: default
locale:
[1] LC_COLLATE=Dutch_Belgium.1252 LC_CTYPE=Dutch_Belgium.1252 LC_MONETARY=Dutch_Belgium.1252 LC_NUMERIC=C LC_TIME=Dutch_Belgium.1252
attached base packages:
[1] stats graphics grDevices utils datasets methods base
loaded via a namespace (and not attached):
[1] compiler_3.6.3 tools_3.6.3 data.table_1.13.0
Note: converting df to data.table solve the problem, but begs the question. Why doesn't setcolorder() work with this specific data.frame.
Thanks for reporting.
You may try setDF(setDT(df)) after readRDS or dbGetQuery.
I'm still not sure why df[ , names(df)] works, but setcolorder is intended to work on data.tables, and you should always setDT when loading a data.table from RDS.
Thanks for reporting.
You may trysetDF(setDT(df))afterreadRDSordbGetQuery.
I added a note just before you replied. It works correctly with data.table but not with this specific data.frame. As long as setcolorder() doesn't enforce data.table with an assertion, I think this issue is legit and should be investigated. No?
?setcolorder suggests that it should only be used for data.tables:
Usage
setcolorder(x, neworder=key(x))
Arguments
x
A data.table.
Based on the example and documentation, it would seem like one potential solution is to error if setcolorder receives something other than a data.table.
While I am unsure the root cause, it seems to be related to objects sharing the same memory - note when we provide j to [.data.frame, a copy is made on the names which likely explains the second option works as intended.
library(data.table)
address(names(iris)) == address(names(iris[1L, ]))
#> [1] TRUE
address(names(iris)) == address(names(iris[1L, 1:5]))
#> [1] FALSE
Edit: removed some code that contained rookie mistakes.
erroring on data.frames will likely break some downstreams as this behavior
has probably been around a while...
not that we shouldn't do it, but I would rather understand what's happening
first
On Wed, Sep 2, 2020, 7:10 PM Cole Miller notifications@github.com wrote:
?setcolorder suggests that it should only be used for data.tables:
Usage
setcolorder(x, neworder=key(x))
Arguments
x
A data.table.Based on the example and documentation, it would seem like one potential
solution is to error if setcolorder receives something other than a
data.table.While I am unsure the root cause, it seems to be related to objects
sharing the same memory - note when we provide j to [.data.frame, a copy
is made on the names which likely explains the second option works as
intended.library(data.table)
address(names(iris)) == address(names(iris[1L, ]))#> [1] TRUE
address(names(iris)) == address(names(iris[1L, 1:5]))#> [1] FALSEAnd finally, here's a more minimal example.
library(data.table)
DF = data.frame(first = rep("first", 2L), last = rep("last", 2L))DF2 = DF
setcolorder(DF2, 2L) DF ##Good#> last first#> 1 last first#> 2 last first
DF = data.frame(first = rep("first", 2L), last = rep("last", 2L))DF2 = DF[1L, ]
setcolorder(DF2, 2L)DF ##Bad#> last first#> 1 first last#> 2 first last
DF = data.frame(first = rep("first", 2L), last = rep("last", 2L))DF2 = DF[1L, names(DF)] ##copy of the names attribute will be made
setcolorder(DF2, 2L)DF ##Good again!#> first last#> 1 first last#> 2 first lastThis suggests that to allow data.frames, we can only setcolorder if the
memory of the names is not shared with another object.—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
https://github.com/Rdatatable/data.table/issues/4690#issuecomment-686085059,
or unsubscribe
https://github.com/notifications/unsubscribe-auth/AB2BA5J7RVYR4IM7FEA2OXLSD3GIHANCNFSM4QTAYC3Q
.
I think we could error on data.frame here, at least untill we can handle that nicely.
Breaking change should not be that bad if it follows documentation, anyway revdeps check should be used to make this decision.