Tidyr: Duplicate identifiers for rows in `spread`

Created on 22 Feb 2018 · 10Comments · Source: tidyverse/tidyr

This may be a duplicate of #296, but there's a community discussion about duplicate identifiers in spread. I find it very difficult to explain to newcomers why spread has sometimes difficulties spreading what gather has just gathered.

Lets take example from SO question:

 library(tidyverse)
 df <- structure(list(age = c("21", "17", "32", "29", "15"), 
                        gender = structure(c(2L, 1L, 1L, 2L, 2L), .Label = c("Female", "Male"), class = "factor")), 
                   row.names = c(NA, -5L), class = c("tbl_df", "tbl", "data.frame"), .Names = c("age", "gender"))
 df
#> # A tibble: 5 x 2
#>   age   gender
#>   <chr> <fct> 
#> 1 21    Male  
#> 2 17    Female
#> 3 32    Female
#> 4 29    Male  
#> 5 15    Male  

df %>% 
  spread(key=gender, value=age)
#> Error: Duplicate identifiers for rows (2, 3), (1, 4, 5)

Ok, so for some strange reason tidyr needs another index to do the spread. Fine! Why can't it intercept the call and do something along the lines of:

df %>% 
  group_by_at(vars(-age)) %>%  # group by everything other than the value column. 
  mutate(row_id=1:n()) %>% ungroup() %>%  # build group index
  spread(key=gender, value=age) %>%    # spread
  select(-row_id)  # drop the index

#> # A tibble: 3 x 2
#>   Female Male 
#>   <chr>  <chr>
#> 1 17     21   
#> 2 32     29   
#> 3 NA     15

Is there any reason why doing it on behalf of user might be against his/her interest?

feature pivoting tidy-dev-day

Source

dmi3kno

👍28

Most helpful comment

I'm going to work on the error message for this. 🙌

juliasilge on 19 Jan 2019

👍6 🚀3 🎉3

All 10 comments

It might be better to include with the error message
Error: Duplicate identifiers for rows (2, 3), (1, 4, 5) to say
Consider adding a unique ID with tibble::rowid_to_column() or
Duplicate identifiers in key, please use a unique key to spread on.

pgensler on 25 Feb 2018

👍5

@hadley this feature request is now the third-most upvoted open issue. I personally encounter this issue when trying to demonstrate how to invert a gather(). While I know you're inundated with tasks, I thought it was worth an @ to see if this one might get onto your queue.

hammer on 28 Aug 2018

👍2

The third-most upvoted issue in tidyr, but only the 9th in the tidyverse as whole 😉 (but that's still pretty high)

The reason that there's been no response is because I'm not sure what to do here. I don't think it's necessarily true that spread() and gather() should be perfectly symmetric. In the motivating example in the initial issue, there isn't a unique identifier for each row, and it doesn't feel including with the tidyverse principles to automatically create one (because there are multiple valid choices).

A better error message would be possible. Using the current error style maybe something like:

Error: Each row of output must be identified by a unique combination of keys.
Keys are shared for 5 rows:
* 2, 3
* 1, 4, 5
Do you need to create unique ID with tibble::rowid_to_column()?

What do you think?

hadley on 31 Aug 2018

👍2

While I agree that (in the presence of multiple alternatives) spread shouldn't deprive the user of the choice, I find it counter-intuitive that there's no "out of the box" solution to this fairly common problem (at least from beginner UX point of view).

I would expect that an optional parameter in spread would make sure I can spread with spread what gather has gathered. With this optional argument, the text of the error would be shown as warning with explanation of what shortcuts spread has employed to get the spreading job done. In this case the warning would say:

Warning: Each row of output must be identified by a unique combination of keys.
Keys are shared for 5 rows:
* 2, 3
* 1, 4, 5
Unique id has been created for each record within group, defined by unique values in c("gender").

Please, make spread more accessible to beginners.

dmi3kno on 1 Sep 2018

👍3

@dmi3kno that is not an acceptable solution to me because it conflates two tools. IMO that will make things easier for the very beginner only because you're selling them an oversimplified mental model.

hadley on 1 Sep 2018

I am totally in favor of improving error messages. I understand and respect your unwillingness to change your position on this. Let's agree to disagree.

If you could post the final text of the error message here, we will quickly create respective question on StackOverflow, because there will be many people searching for it ;) This is a rare opportunity to rank up reputation points.

dmi3kno on 16 Sep 2018

Reopening so that I remember to improve the error message.

hadley on 17 Sep 2018

@dmi3kno if you have written/are aware of an SO answer illustrating correct usage of tibble::rowid_to_column() in this context, do you mind linking to it here? There are quite a few related questions on SO already, but none offer a concise and 'proper' solution in my opinion. Thanks in advance!

soedr on 13 Nov 2018

👍4

I'm going to work on the error message for this. 🙌

juliasilge on 19 Jan 2019

👍6 🚀3 🎉3

This might need a new issue but its related to the error message. I tried using the advice of the warning message by inserting
Use values_fn = list(eviction_filing_rate = summary_fun) to summarise duplicates into my pivot_wider.
The other 2 tips work but not this one, it says 'summary_fun' not found.

Error in vals_dedup(key = val_id, val = val, value = value, summarize = values_fn[[value]]) : object 'summary_fun' not found