Today data.table returned a double V1 name upon a simple operation:
dt <- fread("A 2
+ B 3
+ C 4
+ D 5")
dt[, mean(V2), by = V1]
V1 V1
1: A 2
2: B 3
3: C 4
4: D 5
sessionInfo()
R version 3.6.0 (2019-04-26)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Debian GNU/Linux 9 (stretch)
Matrix products: default
BLAS: /usr/lib/libblas/libblas.so.3.7.0
LAPACK: /usr/lib/lapack/liblapack.so.3.7.0
locale:
[1] LC_CTYPE=es_CO.UTF-8 LC_NUMERIC=C LC_TIME=es_CO.UTF-8 LC_COLLATE=es_CO.UTF-8
[5] LC_MONETARY=es_CO.UTF-8 LC_MESSAGES=es_CO.UTF-8 LC_PAPER=es_CO.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C LC_MEASUREMENT=es_CO.UTF-8 LC_IDENTIFICATION=C
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] zoo_1.8-5 ggplot2_3.1.1 lubridate_1.7.4 data.table_1.12.2
loaded via a namespace (and not attached):
[1] Rcpp_1.0.1 rstudioapi_0.10 magrittr_1.5 hms_0.4.2 tidyselect_0.2.5 munsell_0.5.0
[7] lattice_0.20-38 colorspace_1.4-1 R6_2.4.0 rlang_0.3.4 plyr_1.8.4 stringr_1.4.0
[13] dplyr_0.8.1 tools_3.6.0 grid_3.6.0 gtable_0.3.0 withr_2.1.2 digest_0.6.19
[19] lazyeval_0.2.2 assertthat_0.2.1 tibble_2.1.1 crayon_1.3.4 purrr_0.3.2 glue_1.3.1
[25] labeling_0.3 stringi_1.4.3 compiler_3.6.0 pillar_1.4.0 scales_1.0.0 pkgconfig_2.0.2
Iit seems related to one of the entries in https://github.com/Rdatatable/data.table/issues/3193 (the one by @Atrebas)
What behavior would you want to see instead? Something like the check.names=TRUE behavior in data.table() mentioned in your link?
Personally, I sometimes use the fact that the columns in j are always-ish named V1, V2, V3... like https://stackoverflow.com/a/50525294 or https://stackoverflow.com/a/52541768
@franknarf1 Thanks for the quick response. I'm ok with V1, V2 naming of otherwise unnamed columns, I use them myself a lot, just like in the SO examples you shared. The problem I see is (following my code above):
# this is expected behavior
dt[, mean(V2), by = V1][V1 == "A", ]
V1 V1
1: A 2
# this seems unstable
dt[, mean(V2), by = V1][V1 == 2, ]
Empty data.table (0 rows and 2 cols): V1,V1
# even though
str(dt[, mean(V2), by = V1])
Classes ‘data.table’ and 'data.frame': 4 obs. of 2 variables:
$ V1: chr "A" "B" "C" "D"
$ V1: num 2 3 4 5
- attr(*, ".internal.selfref")=<externalptr>
This makes it clear to me that the "real" V1 is the original one, and that before assigning V1 to an unnamed newly created column, it should check if the name already exists, and in such case, append something to it, say for example V1_1
That's a tricky area.
IMHO, the ambiguity is caused by the new column (resulting from computing mean(V2)) is being given the default name V1.
dt[, .(mean(V2))]
# V1
# 1: 3.5
To avoid the name conflict you can either name the result column explicitely, e.g.,
dt[, .(mn = mean(V2))]
# mn
# 1: 3.5
dt[, .(mn = mean(V2)), by = V1]
# V1 mn
# 1: A 2
# 2: B 3
# 3: C 4
# 4: D 5
or name the grouping variable explicitely
dt[, mean(V2), by = .(grp = V1)]
# grp V1
# 1: A 2
# 2: B 3
# 3: C 4
# 4: D 5
or use .SDcols:
dt[, lapply(.SD, mean), .SDcols = "V2"]
# V2
# 1: 3.5
dt[, lapply(.SD, mean), .SDcols = "V2", by = V1]
# V1 V2
# 1: A 2
# 2: B 3
# 3: C 4
# 4: D 5
[Feature request]:
For _simple operations_ (which depend only on one one column), a shortcut to above could be if new columns would be given the same name as the column which was used in the expression:
dt[, .(mean(V2))]
# V2
# 1: 3.5
dt[, mean(V2), by = V1]
# V1 V2
# 1: A 2
# 2: B 3
# 3: C 4
# 4: D 5
In case of _multiple_ simple operations this would follow the same naming scheme used for .SDcols, e.g.,
dt[, .(mean(V2), sum(V2)), by = V1]
# V1 V21 V22
# 1: A 2 2
# 2: B 3 3
# 3: C 4 4
# 4: D 5 5
which would be a shortcut for
dt[, unlist(lapply(.SD, function(x) .(mean(x), sum(x))), recursive = FALSE), .SDcols = c("V2"), by = V1]
# V1 V21 V22
# 1: A 2 2
# 2: B 3 3
# 3: C 4 4
# 4: D 5 5
I'm not sure this really belongs here, but it's about duplicated column names after operations, so I thought it could be.
dt1 <- data.table(CHR = 29:34,
BP = seq(0, 30, length.out = 6),
key = "BP")
dt2 <- data.table(start_pos = seq(0, 32, length.out = 15),
gene_id = paste0("ABCD", rep(0, 3), 1:15))
dt2[, end_pos := start_pos + 2]
setcolorder(dt2, c(1, 3, 2))
dt1[dt2,
.(BP, CHR, gene_id),
on = .(BP >= start_pos, BP <= end_pos),
nomatch = NULL,
by = .EACHI]
results in :
BP BP BP CHR gene_id
1: 0.000000 2.000000 0 29 ABCD01
2: 4.571429 6.571429 6 30 ABCD03
3: 11.428571 13.428571 12 31 ABCD06
4: 16.000000 18.000000 18 32 ABCD08
5: 22.857143 24.857143 24 33 ABCD011
6: 29.714286 31.714286 30 34 ABCD014
with duplicate names. It is easily solved with setnames(), but it's not the behavior I expected. (ideally the names would have been c("start_pos", "end_pos", "BP", "CHR", "gene_id"), but as that seems difficult, at least c("V1", "V2", "BP", "CHR", "gene_id") would make it.
Compare this with the names resulting from foverlaps(dt1, dt2) (expected behavior).
Running data.table_1.12.2 on R 3.6.1
[Feature request]:
For simple operations (which depend only on one one column), a shortcut to above could be if new columns would be given the same name as the column which was used in the expression:
Problem is that a simple operations that depends only on one column can still return multiple columns. IMO the proper way is to handle names by user.
We could eventually check if the output of the operation contains duplicated column, and raise warning. But we would probably need to do such a check in each data.table functions one by one. Moreover using duplicated names could also be used intentionally, when trying to format table for output and some of the fields has to be duplicated. Voting for close.
Agree re closing since it's not clear what the fix should be.
One other idea to address OP's case would be to add prefixes: like x.* and i.*, could use by.*, j.*
(This would clarify for the user where columns are coming from. Perhaps for all prefixes, an "explicit column names" toggle could be used to turn the prefixes on even when there are no duplicates/conflicts. Related https://github.com/Rdatatable/data.table/issues/1700#issuecomment-371455256)
Most helpful comment
We could eventually check if the output of the operation contains duplicated column, and raise warning. But we would probably need to do such a check in each data.table functions one by one. Moreover using duplicated names could also be used intentionally, when trying to format table for output and some of the fields has to be duplicated. Voting for close.