Noticed that the default behavior of count() has changed in 1.0.0 but wasn't listed as a breaking change in the release notes
0.8.5
library(dplyr, warn.conflicts = FALSE)
packageVersion("dplyr")
#> [1] '0.8.5'
tibble(x = 1:3, n = 3:1) %>% count(x)
#> # A tibble: 3 x 2
#> x n
#> <int> <int>
#> 1 1 1
#> 2 2 1
#> 3 3 1
1.0.0
library(dplyr, warn.conflicts = FALSE)
packageVersion("dplyr")
#> [1] '1.0.0'
tibble(x = 1:3, n = 3:1) %>% count(x)
#> Using `n` as weighting variable
#> ℹ Quiet this message with `wt = n` or count rows with `wt = 1`
#> # A tibble: 3 x 2
#> x n
#> <int> <int>
#> 1 1 3
#> 2 2 2
#> 3 3 1
_Originally posted by @nhols in https://github.com/tidyverse/dplyr/issues/5265#issuecomment-638496562_
I don't remember count() ever working like this 😞; the new behaviour is definitely what I intended here.
I like the wt argument, in fact I have used it many times when there are weights in the data. But generally this is intentional behavior.
I also have code and I remember having a database with a column called n, and on which I wanted to count (without wt). I saw that specifying wt = NULL fixes the problem. Although, I would expect this to be the default behavior of count.
Seconded - I and others I work with regularly use the x %>% count(a, b) %>% count(a) pattern when exploring datasets, and this has broken a bit of existing code.
That aside, it seems like a really counter-intuitive default to implicitly assume that weighting is intended because there's a variable called n in the data frame. It makes way more sense to me to only weight if it's explicitly specified.
The reference to counting rows in the message looks incorrect at the moment also - setting wt = 1 returns n = 1 for all returned rows, where wt = n() replicates the old behaviour:
library(dplyr, warn.conflicts = FALSE)
df <- tibble(x = c(1, 1, 1, 2, 2, 2), n = 1:6)
df %>% count(x, wt = n)
#> # A tibble: 2 x 2
#> x n
#> <dbl> <int>
#> 1 1 6
#> 2 2 15
df %>% count(x, wt = 1)
#> # A tibble: 2 x 2
#> x n
#> <dbl> <dbl>
#> 1 1 1
#> 2 2 1
df %>% count(x, wt = n())
#> # A tibble: 2 x 2
#> x n
#> <dbl> <int>
#> 1 1 3
#> 2 2 3
Created on 2020-06-09 by the reprex package (v0.3.0)
Hello all,
I have a similar issue not in count but in tally.
data2<- data%>%
dplyr::group_by(SITEID) %>%
tally()
I used to use tally to add up how many rows of data I had for each siteID.
Thanks @gorcha that wt = n () fixes the issue:-
data2<- data%>%
dplyr::group_by(SITEID) %>%
tally(wt = n())
Now I just need to remember which dataframes have n in them.
The root problem appears to be that an inconsistency crept in between count and tally (https://github.com/tidyverse/dplyr/blob/v0.8.5/R/count-tally.R#L25-L33) — in my mind count() and tally() always behaved the same way (automatically weighting by n variable, if present.
However, it looks like count() and tally() have in fact behaved differently for rather a long time (at least in 0.6.0 and above), although it took a long time for the documentation to be correct (it looks like it was fixed in 0.8.5).
Since this was a fairly major and unplanned change, I think the best option is to roll it back in 1.0.1 so that count() once again doesn't try use an n variable by default.
Alternatively, maybe it's time to make tally() and count() consistent once more by making _neither_ automatically weight by n if present.
Most helpful comment
Seconded - I and others I work with regularly use the
x %>% count(a, b) %>% count(a)pattern when exploring datasets, and this has broken a bit of existing code.That aside, it seems like a really counter-intuitive default to implicitly assume that weighting is intended because there's a variable called
nin the data frame. It makes way more sense to me to only weight if it's explicitly specified.The reference to counting rows in the message looks incorrect at the moment also - setting
wt = 1returns n = 1 for all returned rows, wherewt = n()replicates the old behaviour:Created on 2020-06-09 by the reprex package (v0.3.0)