Dplyr: Default behavior of count() seems to have changed in 1.0.0

Created on 4 Jun 2020 · 6Comments · Source: tidyverse/dplyr

Noticed that the default behavior of count() has changed in 1.0.0 but wasn't listed as a breaking change in the release notes

0.8.5

library(dplyr, warn.conflicts = FALSE)
packageVersion("dplyr")
#> [1] '0.8.5'
tibble(x = 1:3, n = 3:1) %>% count(x)
#> # A tibble: 3 x 2
#>       x     n
#>   <int> <int>
#> 1     1     1
#> 2     2     1
#> 3     3     1

1.0.0

library(dplyr, warn.conflicts = FALSE)
packageVersion("dplyr")
#> [1] '1.0.0'
tibble(x = 1:3, n = 3:1) %>% count(x)
#> Using `n` as weighting variable
#> ℹ Quiet this message with `wt = n` or count rows with `wt = 1`
#> # A tibble: 3 x 2
#>       x     n
#>   <int> <int>
#> 1     1     3
#> 2     2     2
#> 3     3     1

_Originally posted by @nhols in https://github.com/tidyverse/dplyr/issues/5265#issuecomment-638496562_

bug verbs

Source

nhols

👍4

Most helpful comment

Seconded - I and others I work with regularly use the x %>% count(a, b) %>% count(a) pattern when exploring datasets, and this has broken a bit of existing code.

That aside, it seems like a really counter-intuitive default to implicitly assume that weighting is intended because there's a variable called n in the data frame. It makes way more sense to me to only weight if it's explicitly specified.

The reference to counting rows in the message looks incorrect at the moment also - setting wt = 1 returns n = 1 for all returned rows, where wt = n() replicates the old behaviour:

library(dplyr, warn.conflicts = FALSE)
df <- tibble(x = c(1, 1, 1, 2, 2, 2), n = 1:6)

df %>% count(x, wt = n)
#> # A tibble: 2 x 2
#>       x     n
#>   <dbl> <int>
#> 1     1     6
#> 2     2    15

df %>% count(x, wt = 1)
#> # A tibble: 2 x 2
#>       x     n
#>   <dbl> <dbl>
#> 1     1     1
#> 2     2     1

df %>% count(x, wt = n())
#> # A tibble: 2 x 2
#>       x     n
#>   <dbl> <int>
#> 1     1     3
#> 2     2     3

^{Created on 2020-06-09 by the reprex package (v0.3.0)}

gorcha on 9 Jun 2020

👍7

All 6 comments

I don't remember count() ever working like this 😞; the new behaviour is definitely what I intended here.

hadley on 5 Jun 2020

I like the wt argument, in fact I have used it many times when there are weights in the data. But generally this is intentional behavior.
I also have code and I remember having a database with a column called n, and on which I wanted to count (without wt). I saw that specifying wt = NULL fixes the problem. Although, I would expect this to be the default behavior of count.

brianmsm on 6 Jun 2020

Seconded - I and others I work with regularly use the x %>% count(a, b) %>% count(a) pattern when exploring datasets, and this has broken a bit of existing code.

The reference to counting rows in the message looks incorrect at the moment also - setting wt = 1 returns n = 1 for all returned rows, where wt = n() replicates the old behaviour:

library(dplyr, warn.conflicts = FALSE)
df <- tibble(x = c(1, 1, 1, 2, 2, 2), n = 1:6)

df %>% count(x, wt = n)
#> # A tibble: 2 x 2
#>       x     n
#>   <dbl> <int>
#> 1     1     6
#> 2     2    15

df %>% count(x, wt = 1)
#> # A tibble: 2 x 2
#>       x     n
#>   <dbl> <dbl>
#> 1     1     1
#> 2     2     1

df %>% count(x, wt = n())
#> # A tibble: 2 x 2
#>       x     n
#>   <dbl> <int>
#> 1     1     3
#> 2     2     3

^{Created on 2020-06-09 by the reprex package (v0.3.0)}

gorcha on 9 Jun 2020

👍7

Hello all,

I have a similar issue not in count but in tally.
data2<- data%>%
dplyr::group_by(SITEID) %>%
tally()

I used to use tally to add up how many rows of data I had for each siteID.

Thanks @gorcha that wt = n () fixes the issue:-

data2<- data%>%
dplyr::group_by(SITEID) %>%
tally(wt = n())

Now I just need to remember which dataframes have n in them.

lilly-lilly on 16 Jun 2020

The root problem appears to be that an inconsistency crept in between count and tally (https://github.com/tidyverse/dplyr/blob/v0.8.5/R/count-tally.R#L25-L33) — in my mind count() and tally() always behaved the same way (automatically weighting by n variable, if present.

However, it looks like count() and tally() have in fact behaved differently for rather a long time (at least in 0.6.0 and above), although it took a long time for the documentation to be correct (it looks like it was fixed in 0.8.5).

Since this was a fairly major and unplanned change, I think the best option is to roll it back in 1.0.1 so that count() once again doesn't try use an n variable by default.

hadley on 23 Jun 2020

👍5

Alternatively, maybe it's time to make tally() and count() consistent once more by making _neither_ automatically weight by n if present.

hadley on 23 Jun 2020

👍2

Was this page helpful?

0 / 5 - 0 ratings