Data.table: duplicate names on simple operation

Created on 14 Jun 2019  Â·  7Comments  Â·  Source: Rdatatable/data.table

Today data.table returned a double V1 name upon a simple operation:

Toy data

dt <- fread("A 2
+ B 3
+ C 4
+ D 5")

operation produces two columns named V1

dt[, mean(V2), by = V1]
   V1 V1
1:  A  2
2:  B  3
3:  C  4
4:  D  5

Session info:

sessionInfo()
R version 3.6.0 (2019-04-26)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Debian GNU/Linux 9 (stretch)

Matrix products: default
BLAS:   /usr/lib/libblas/libblas.so.3.7.0
LAPACK: /usr/lib/lapack/liblapack.so.3.7.0

locale:
 [1] LC_CTYPE=es_CO.UTF-8       LC_NUMERIC=C               LC_TIME=es_CO.UTF-8        LC_COLLATE=es_CO.UTF-8    
 [5] LC_MONETARY=es_CO.UTF-8    LC_MESSAGES=es_CO.UTF-8    LC_PAPER=es_CO.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C             LC_MEASUREMENT=es_CO.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] zoo_1.8-5         ggplot2_3.1.1     lubridate_1.7.4   data.table_1.12.2

loaded via a namespace (and not attached):
 [1] Rcpp_1.0.1       rstudioapi_0.10  magrittr_1.5     hms_0.4.2        tidyselect_0.2.5 munsell_0.5.0   
 [7] lattice_0.20-38  colorspace_1.4-1 R6_2.4.0         rlang_0.3.4      plyr_1.8.4       stringr_1.4.0   
[13] dplyr_0.8.1      tools_3.6.0      grid_3.6.0       gtable_0.3.0     withr_2.1.2      digest_0.6.19   
[19] lazyeval_0.2.2   assertthat_0.2.1 tibble_2.1.1     crayon_1.3.4     purrr_0.3.2      glue_1.3.1      
[25] labeling_0.3     stringi_1.4.3    compiler_3.6.0   pillar_1.4.0     scales_1.0.0     pkgconfig_2.0.2

Iit seems related to one of the entries in https://github.com/Rdatatable/data.table/issues/3193 (the one by @Atrebas)

Most helpful comment

We could eventually check if the output of the operation contains duplicated column, and raise warning. But we would probably need to do such a check in each data.table functions one by one. Moreover using duplicated names could also be used intentionally, when trying to format table for output and some of the fields has to be duplicated. Voting for close.

All 7 comments

What behavior would you want to see instead? Something like the check.names=TRUE behavior in data.table() mentioned in your link?

Personally, I sometimes use the fact that the columns in j are always-ish named V1, V2, V3... like https://stackoverflow.com/a/50525294 or https://stackoverflow.com/a/52541768

@franknarf1 Thanks for the quick response. I'm ok with V1, V2 naming of otherwise unnamed columns, I use them myself a lot, just like in the SO examples you shared. The problem I see is (following my code above):

# this is expected behavior
dt[, mean(V2), by = V1][V1 == "A", ]
   V1 V1
1:  A  2

# this seems unstable
dt[, mean(V2), by = V1][V1 == 2, ]
Empty data.table (0 rows and 2 cols): V1,V1

# even though
str(dt[, mean(V2), by = V1])
Classes ‘data.table’ and 'data.frame':  4 obs. of  2 variables:
 $ V1: chr  "A" "B" "C" "D"
 $ V1: num  2 3 4 5
 - attr(*, ".internal.selfref")=<externalptr>

This makes it clear to me that the "real" V1 is the original one, and that before assigning V1 to an unnamed newly created column, it should check if the name already exists, and in such case, append something to it, say for example V1_1

That's a tricky area.

IMHO, the ambiguity is caused by the new column (resulting from computing mean(V2)) is being given the default name V1.

dt[,  .(mean(V2))]
#    V1
# 1: 3.5

To avoid the name conflict you can either name the result column explicitely, e.g.,

dt[,  .(mn = mean(V2))]
#     mn
# 1: 3.5

dt[,  .(mn = mean(V2)), by = V1]
#    V1 mn
# 1:  A  2
# 2:  B  3
# 3:  C  4
# 4:  D  5

or name the grouping variable explicitely

dt[, mean(V2), by = .(grp = V1)]
#    grp V1
# 1:   A  2
# 2:   B  3
# 3:   C  4
# 4:   D  5

or use .SDcols:

dt[, lapply(.SD, mean), .SDcols = "V2"]
#     V2
# 1: 3.5

dt[, lapply(.SD, mean), .SDcols = "V2", by = V1]
#    V1 V2
# 1:  A  2
# 2:  B  3
# 3:  C  4
# 4:  D  5

[Feature request]:
For _simple operations_ (which depend only on one one column), a shortcut to above could be if new columns would be given the same name as the column which was used in the expression:

dt[,  .(mean(V2))]
#     V2
# 1: 3.5

dt[, mean(V2), by = V1]
#    V1 V2
# 1:  A  2
# 2:  B  3
# 3:  C  4
# 4:  D  5

In case of _multiple_ simple operations this would follow the same naming scheme used for .SDcols, e.g.,

dt[, .(mean(V2), sum(V2)), by = V1]
#    V1 V21 V22
# 1:  A   2   2
# 2:  B   3   3
# 3:  C   4   4
# 4:  D   5   5

which would be a shortcut for

dt[, unlist(lapply(.SD, function(x) .(mean(x), sum(x))), recursive = FALSE), .SDcols = c("V2"), by = V1]
#    V1 V21 V22
# 1:  A   2   2
# 2:  B   3   3
# 3:  C   4   4
# 4:  D   5   5

I'm not sure this really belongs here, but it's about duplicated column names after operations, so I thought it could be.

dt1 <- data.table(CHR = 29:34, 
                  BP = seq(0, 30, length.out = 6), 
                  key = "BP")

dt2 <- data.table(start_pos = seq(0, 32, length.out = 15), 
                  gene_id = paste0("ABCD", rep(0, 3), 1:15))

dt2[, end_pos := start_pos + 2]

setcolorder(dt2, c(1, 3, 2))

dt1[dt2, 
    .(BP, CHR, gene_id), 
    on = .(BP >= start_pos, BP <= end_pos), 
    nomatch = NULL, 
    by = .EACHI]

results in :

          BP        BP BP CHR gene_id
1:  0.000000  2.000000  0  29  ABCD01
2:  4.571429  6.571429  6  30  ABCD03
3: 11.428571 13.428571 12  31  ABCD06
4: 16.000000 18.000000 18  32  ABCD08
5: 22.857143 24.857143 24  33 ABCD011
6: 29.714286 31.714286 30  34 ABCD014

with duplicate names. It is easily solved with setnames(), but it's not the behavior I expected. (ideally the names would have been c("start_pos", "end_pos", "BP", "CHR", "gene_id"), but as that seems difficult, at least c("V1", "V2", "BP", "CHR", "gene_id") would make it.

Compare this with the names resulting from foverlaps(dt1, dt2) (expected behavior).

Running data.table_1.12.2 on R 3.6.1

[Feature request]:
For simple operations (which depend only on one one column), a shortcut to above could be if new columns would be given the same name as the column which was used in the expression:

Problem is that a simple operations that depends only on one column can still return multiple columns. IMO the proper way is to handle names by user.

We could eventually check if the output of the operation contains duplicated column, and raise warning. But we would probably need to do such a check in each data.table functions one by one. Moreover using duplicated names could also be used intentionally, when trying to format table for output and some of the fields has to be duplicated. Voting for close.

Agree re closing since it's not clear what the fix should be.

One other idea to address OP's case would be to add prefixes: like x.* and i.*, could use by.*, j.*

(This would clarify for the user where columns are coming from. Perhaps for all prefixes, an "explicit column names" toggle could be used to turn the prefixes on even when there are no duplicates/conflicts. Related https://github.com/Rdatatable/data.table/issues/1700#issuecomment-371455256)

Was this page helpful?
0 / 5 - 0 ratings