Dplyr: union the exactly same two copies gives different number of rows

Created on 2 Dec 2017 · 12Comments · Source: tidyverse/dplyr

tested with latest tidyverse

setequal(union(iris,iris),iris)
union(iris,iris) (149) has different number of rows from iris (150)! How can this be?

documentation wip

Source

jerryzhujian9

Most helpful comment

I often think that data.frame with rownames does not play well with dplyr. It may be related here.
If I pass rownames to columns, it works as expected.

library(dplyr)
#> 
#> Attachement du package : 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
# with rownames
intersect(iris, iris) %>% nrow()
#> [1] 149
union(iris, iris) %>% nrow()
#> [1] 149
setequal(union(iris,iris),iris)
#> FALSE: Different number of rows
# Without rownames 
iris <- tibble::rownames_to_column(iris)
intersect(iris, iris) %>% nrow()
#> [1] 150
union(iris, iris) %>% nrow()
#> [1] 150
setequal(union(iris,iris),iris)
#> TRUE

Created on 2017-12-02 by the reprex package (v0.1.1.9000).

I think it is a good practice to not have rownames when working with tidyverse.
From the tibble rownames documentation

Generally, it is best to avoid row names, because they are basically a character column with different semantics to every other column.
It is why tibble provides some helpers to delete them or convert back and forth.

However, I am not sure why this behaviour with union and others. it is not consistent with base function. Don't know if it is intended or not.

Hope It helps.

cderv on 2 Dec 2017

👍3

All 12 comments

I often think that data.frame with rownames does not play well with dplyr. It may be related here.
If I pass rownames to columns, it works as expected.

library(dplyr)
#> 
#> Attachement du package : 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
# with rownames
intersect(iris, iris) %>% nrow()
#> [1] 149
union(iris, iris) %>% nrow()
#> [1] 149
setequal(union(iris,iris),iris)
#> FALSE: Different number of rows
# Without rownames 
iris <- tibble::rownames_to_column(iris)
intersect(iris, iris) %>% nrow()
#> [1] 150
union(iris, iris) %>% nrow()
#> [1] 150
setequal(union(iris,iris),iris)
#> TRUE

Created on 2017-12-02 by the reprex package (v0.1.1.9000).

I think it is a good practice to not have rownames when working with tidyverse.
From the tibble rownames documentation

Generally, it is best to avoid row names, because they are basically a character column with different semantics to every other column.
It is why tibble provides some helpers to delete them or convert back and forth.

However, I am not sure why this behaviour with union and others. it is not consistent with base function. Don't know if it is intended or not.

Hope It helps.

cderv on 2 Dec 2017

👍3

I think some of this is that one of dplyr::union_all() or dplyr::bind_rows() may be the better operator for what I assume is your intended application (union knocks out duplicate rows!). Also definitely do not use rownames with dplyr, tibble::rownames_to_column() is strongly advised.

JohnMount on 2 Dec 2017

Thank you all for the insights. How about issuing a warning message then when having a input with rownames?

jerryzhujian9 on 4 Dec 2017

Just a follow-up:

a=b=iris
rownames(a) <- NULL
rownames(b) <- NULL
union(a,b) %>% nrow  # still 149.

Seems rownames() <- NULL would not have an effect here

jerryzhujian9 on 7 Dec 2017

There is a duplicate row and union() removes duplicates:

library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
iris %>% group_by_at(., colnames(.)) %>% summarize(n = n()) %>% filter(n > 1)
#> # A tibble: 1 x 6
#> # Groups:   Sepal.Length, Sepal.Width, Petal.Length, Petal.Width [1]
#>   Sepal.Length Sepal.Width Petal.Length Petal.Width   Species     n
#>          <dbl>       <dbl>        <dbl>       <dbl>    <fctr> <int>
#> 1          5.8         2.7          5.1         1.9 virginica     2

Please try union_all:

library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
union_all(iris, iris) %>% nrow()
#> [1] 300

MZLABS on 7 Dec 2017

Thanks. Looks like union() does work as expected. Maybe we should be explicit in the documentation that duplicate rows are removed?

krlmlr on 12 Dec 2017

👍1

I can help with this, what is the best way to proceed? Add the clarification at the beginning of the docs, add one example, both?

edublancas on 2 Mar 2018

Clarification and example would be great. Does this affect intersect() and other verbs?

krlmlr on 5 Mar 2018

@edublancas, are you still interested in submitting a PR for this? If so, awesome, and, if not, that's totally fine, too. If you wouldn't mind letting us know when you get a chance (or if you have any questions), that'd be great so I know whether to add to my TODO list! 👍
Thanks.

batpigandme on 24 Mar 2018

@batpigandme Yes, I can work on this. I can probably find some time to do it during this week. Will post updates here.

edublancas on 26 Mar 2018

👍1

Submitted PR (#3474), let me know what you think.

edublancas on 30 Mar 2018

👍1

This old issue has been automatically locked. If you believe you have found a related problem, please file a new issue (with reprex) and link to this issue. https://reprex.tidyverse.org/