tested with latest tidyverse
setequal(union(iris,iris),iris)
union(iris,iris) (149) has different number of rows from iris (150)! How can this be?
I often think that data.frame with rownames does not play well with dplyr. It may be related here.
If I pass rownames to columns, it works as expected.
library(dplyr)
#>
#> Attachement du package : 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
# with rownames
intersect(iris, iris) %>% nrow()
#> [1] 149
union(iris, iris) %>% nrow()
#> [1] 149
setequal(union(iris,iris),iris)
#> FALSE: Different number of rows
# Without rownames
iris <- tibble::rownames_to_column(iris)
intersect(iris, iris) %>% nrow()
#> [1] 150
union(iris, iris) %>% nrow()
#> [1] 150
setequal(union(iris,iris),iris)
#> TRUE
Created on 2017-12-02 by the reprex package (v0.1.1.9000).
I think it is a good practice to not have rownames when working with tidyverse.
From the tibble rownames documentation
Generally, it is best to avoid row names, because they are basically a character column with different semantics to every other column.
It is whytibbleprovides some helpers to delete them or convert back and forth.
However, I am not sure why this behaviour with union and others. it is not consistent with base function. Don't know if it is intended or not.
Hope It helps.
I think some of this is that one of dplyr::union_all() or dplyr::bind_rows() may be the better operator for what I assume is your intended application (union knocks out duplicate rows!). Also definitely do not use rownames with dplyr, tibble::rownames_to_column() is strongly advised.
Thank you all for the insights. How about issuing a warning message then when having a input with rownames?
Just a follow-up:
a=b=iris
rownames(a) <- NULL
rownames(b) <- NULL
union(a,b) %>% nrow # still 149.
Seems rownames() <- NULL would not have an effect here
There is a duplicate row and union() removes duplicates:
library(dplyr)
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
iris %>% group_by_at(., colnames(.)) %>% summarize(n = n()) %>% filter(n > 1)
#> # A tibble: 1 x 6
#> # Groups: Sepal.Length, Sepal.Width, Petal.Length, Petal.Width [1]
#> Sepal.Length Sepal.Width Petal.Length Petal.Width Species n
#> <dbl> <dbl> <dbl> <dbl> <fctr> <int>
#> 1 5.8 2.7 5.1 1.9 virginica 2
Please try union_all:
library(dplyr)
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
union_all(iris, iris) %>% nrow()
#> [1] 300
Thanks. Looks like union() does work as expected. Maybe we should be explicit in the documentation that duplicate rows are removed?
I can help with this, what is the best way to proceed? Add the clarification at the beginning of the docs, add one example, both?
Clarification and example would be great. Does this affect intersect() and other verbs?
@edublancas, are you still interested in submitting a PR for this? If so, awesome, and, if not, that's totally fine, too. If you wouldn't mind letting us know when you get a chance (or if you have any questions), that'd be great so I know whether to add to my TODO list! 馃憤
Thanks.
@batpigandme Yes, I can work on this. I can probably find some time to do it during this week. Will post updates here.
Submitted PR (#3474), let me know what you think.
This old issue has been automatically locked. If you believe you have found a related problem, please file a new issue (with reprex) and link to this issue. https://reprex.tidyverse.org/
Most helpful comment
I often think that data.frame with rownames does not play well with
dplyr. It may be related here.If I pass rownames to columns, it works as expected.
Created on 2017-12-02 by the reprex package (v0.1.1.9000).
I think it is a good practice to not have rownames when working with
tidyverse.From the
tibblerownamesdocumentationHowever, I am not sure why this behaviour with
unionand others. it is not consistent withbasefunction. Don't know if it is intended or not.Hope It helps.