Data.table: Invalid .internal.selfref detected and fixed by taking a (shallow) copy

Created on 11 Jan 2019  ·  10Comments  ·  Source: Rdatatable/data.table

Hi

I get the warning:

Invalid .internal.selfref detected and fixed by taking a (shallow) copy of the data.table so that := can add this new column by reference. At an earlier point, this data.table has been copied by R (or was created manually using structure() or similar). Avoid key<-, names<- and attr<- which in R currently (and oddly) may copy the whole data.table. Use set* syntax instead to avoid copying: ?set, ?setnames and ?setattr. If this message doesn't help, please report your use case to the data.table issue tracker so the root cause can be fixed or this message improved.

when creating a new variable by reference, on a data.table that, at least as far as I can tell, is just a normal data.table:

library(dplyr)
library(data.table)

a1 <- data.table(v1=c(1:10), v2=rep('A'))
a2 <- data.table(v1=c(1:10), v2=rep('B'))
a3 <- Reduce(bind_rows,list(a1,a2))
a3[, n_max:=.N, by=v2]

I do not get the same warning when using the '[, n_max:=.N, by=v2]' syntax on either a1 or a2. This does not make any sense, does it?

Thank you for your time,
Emil

UPDATE: Doesn't happen when using rbind() instead.

> sessionInfo()
R version 3.5.1 (2018-07-02)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 18.10

Matrix products: default
BLAS/LAPACK: /opt/intel/compilers_and_libraries_2018.2.199/linux/mkl/lib/intel64_lin/libmkl_rt.so

locale:
 [1] LC_CTYPE=en_DK.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_DK.UTF-8        LC_COLLATE=en_DK.UTF-8    
 [5] LC_MONETARY=en_DK.UTF-8    LC_MESSAGES=en_DK.UTF-8   
 [7] LC_PAPER=en_DK.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_DK.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] data.table_1.11.8 dplyr_0.7.8      

loaded via a namespace (and not attached):
 [1] Rcpp_1.0.0       crayon_1.3.4     assertthat_0.2.0 R6_2.3.0        
 [5] magrittr_1.5     pillar_1.3.1     rlang_0.3.0.1    bindrcpp_0.2.2  
 [9] glue_1.3.0       purrr_0.2.5      compiler_3.5.1   pkgconfig_2.0.2 
[13] colorspace_1.3-2 bindr_0.1.1      tidyselect_0.2.5 tibble_1.4.2  

Most helpful comment

it was just not actively maintained, but issues like this are good reasons to re-activate that project

All 10 comments

quick fix may be to use rbindlist instead of bind_rows

On Fri, Jan 11, 2019, 7:49 PM Emil Erik Pula Bellamy Begtrup-Bright <
[email protected] wrote:

Hi

I get the warning:

Invalid .internal.selfref detected and fixed by taking a (shallow) copy of
the data.table so that := can add this new column by reference. At an
earlier point, this data.table has been copied by R (or was created
manually using structure() or similar). Avoid key<-, names<- and attr<-
which in R currently (and oddly) may copy the whole data.table. Use set*
syntax instead to avoid copying: ?set, ?setnames and ?setattr. If this
message doesn't help, please report your use case to the data.table issue
tracker so the root cause can be fixed or this message improved.

when creating a new variable by reference, on a data.table that, at least
as far as I can tell, is just a normal data.table:

a1 <- data.table(v1=c(1:10), v2=rep('A'))
a2 <- data.table(v1=c(1:10), v2=rep('B'))
a3 <- Reduce(bind_rows,list(a1,a2))
a3[, n_max:=.N, by=v2]

I do not get the same warning when using the '[, n_max:=.N, by=v2]' syntax
on either a1 or a2. This does not make any sense, does it?

Thank you for your time,
Emil

sessionInfo()
R version 3.5.1 (2018-07-02)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 18.10

Matrix products: default
BLAS/LAPACK: /opt/intel/compilers_and_libraries_2018.2.199/linux/mkl/lib/intel64_lin/libmkl_rt.so

locale:
[1] LC_CTYPE=en_DK.UTF-8 LC_NUMERIC=C
[3] LC_TIME=en_DK.UTF-8 LC_COLLATE=en_DK.UTF-8
[5] LC_MONETARY=en_DK.UTF-8 LC_MESSAGES=en_DK.UTF-8
[7] LC_PAPER=en_DK.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_DK.UTF-8 LC_IDENTIFICATION=C

attached base packages:
[1] stats graphics grDevices utils datasets methods base

other attached packages:
[1] openxlsx_4.1.0 colorout_1.2-0 wrapr_1.8.0
[4] data.table_1.11.8 fst_0.8.10 writexl_1.1
[7] readxl_1.1.0 feather_0.3.1 haven_2.0.0
[10] languageserver_0.2.5 rmarkdown_1.11.3 knitr_1.21
[13] scales_1.0.0 usethis_1.4.0 devtools_2.0.1
[16] viridis_0.5.1 viridisLite_0.3.0 ggthemes_4.0.1
[19] RColorBrewer_1.1-2 plyr_1.8.4 forcats_0.3.0
[22] stringr_1.3.1 dplyr_0.7.8 purrr_0.2.5
[25] readr_1.3.0 tidyr_0.8.2 tibble_1.4.2
[28] ggplot2_3.1.0 tidyverse_1.2.1

loaded via a namespace (and not attached):
[1] httr_1.4.0 pkgload_1.0.2 jsonlite_1.6 modelr_0.1.2
[5] assertthat_0.2.0 cellranger_1.1.0 remotes_2.0.2 sessioninfo_1.1.1
[9] pillar_1.3.1 backports_1.1.3 lattice_0.20-38 glue_1.3.0
[13] digest_0.6.18 rvest_0.3.2 colorspace_1.3-2 htmltools_0.3.6
[17] pkgconfig_2.0.2 broom_0.5.1 processx_3.2.1 generics_0.0.2
[21] withr_2.1.2 lazyeval_0.2.1 cli_1.0.1 magrittr_1.5
[25] crayon_1.3.4 memoise_1.1.0 evaluate_0.12 ps_1.2.1
[29] fs_1.2.6 nlme_3.1-137 eliter_1.0 xml2_1.2.0
[33] pkgbuild_1.0.2 tools_3.5.1 prettyunits_1.0.2 hms_0.4.2
[37] munsell_0.5.0 zip_1.0.0 bindrcpp_0.2.2 callr_3.1.0
[41] compiler_3.5.1 rlang_0.3.0.1 grid_3.5.1 rstudioapi_0.8
[45] testthat_2.0.1 gtable_0.2.0 R6_2.3.0 gridExtra_2.3
[49] lubridate_1.7.4 bindr_0.1.1 rprojroot_1.3-2 desc_1.2.0
[53] stringi_1.2.4 parallel_3.5.1 Rcpp_1.0.0 tidyselect_0.2.5
[57] xfun_0.4


You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
https://github.com/Rdatatable/data.table/issues/3274, or mute the thread
https://github.com/notifications/unsubscribe-auth/AHQQdRdcTjm6UdUf5UEdZAUlV93HHVOKks5vCHpFgaJpZM4Z7UR4
.

Please make your code reproducible, including calls to attach required libraries, to avoid errors like

Error in match.fun(f) : object 'bind_rows' not found

oh sorry, yes. Forgot about the libraries-part. Done.

Michael, yes thank you. It just seems strange that it does not work.

Using bind_rows loses over allocation. That results into warning later on when using :=. You can track that using truelength function.

library(dplyr)
library(data.table)
a1 <- data.table(v1=c(1:10), v2=rep('A'))
a2 <- data.table(v1=c(1:10), v2=rep('B'))
a3 <- Reduce(bind_rows,list(a1,a2))
a4 <- Reduce(rbind,list(a1,a2))
a5 <- rbindlist(list(a1,a2))
truelength(a3)
#[1] 0
truelength(a4)
#[1] 1026
truelength(a5)
#[1] 1026

I suggest to use rbindlist instead.

Thank you for the clarification. Kind of unexpected behaviour, but not on data.table's side, I gather.

@emilBeBri you might raise this over at dplyr, not sure it's something they'll fix though.

You could also use alloc.col or setDT on the result of bind_rows and that should help as well, if you're married to using bind_rows.

But yes, canonical approach is to use rbindlist, and in fact this should be more efficient than bind_rows anyway :)

probably more appropriate to raise on dtplyr: https://github.com/hadley/dtplyr

That project looks a bit dead unfortunately ☠️ no updates in 2 years+

it was just not actively maintained, but issues like this are good reasons to re-activate that project

Allright, I have done so.

https://github.com/hadley/dtplyr/issues/64

Was this page helpful?
0 / 5 - 0 ratings