Dplyr: Fix dplyr_reconstruct genericity

Created on 29 Apr 2020 · 5Comments · Source: tidyverse/dplyr

It should be dispatching on template. Can also remove tibble check now that we no longer rely on tibble methods

bug

Source

hadley

Most helpful comment

Random thought that I wanted to quickly capture — maybe the long term solution is to make it easier to write correct [ methods for data frame subclasses. Here we've just carved out the limited pieces that dplyr absolutely needs, but if we knew that [ did the "right" thing, we could probably use it in more places in dplyr (and eliminating a few cases where we have to operations in multiple steps because these existing generics aren't quite rich enough).

Using what we've learned in tibble, we could write a function that takes df, i, j, and ... and returns the indexing requirements in a standardised format (maybe integer vectors for i and j, and then some sentinel representing no change). Could also possibly record whether it creates duplicate rows/cols.

hadley on 30 Apr 2020

👍3

All 5 comments

I think it currently does dispatch on template right?

I was also thinking about letting dplyr_col_modify.data.frame() call dplyr_reconstruct() instead of manually restoring the attributes like it currently does. The manual attribute restoration process is almost the exact same as what dplyr_reconstruct.data.frame() does now. However, it would require removing the is_tibble() check in dplyr_reconstruct().

Similarly, even though in theory vec_slice() retains the right attributes in dplyr_row_slice.data.frame(), we _could_ call vec_slice() and then call dplyr_reconstruct() before returning from dplyr_row_slice.data.frame() as well.

I'd be curious to see if that would reduce the number of customization points that are generally required for data frame subclasses. If a dplyr_reconstruct() method is implemented, and both dplyr_col_modify() and dplyr_row_slice() call it, would methods really be needed for those as well? I guess if there are vectorized attributes, the methods probably would be needed still.

DavisVaughan on 29 Apr 2020

Oh, doh yeah.

Yeah, I think it would be fine to use dplyr_reconstruct() from dplyr_col_modify(). We just need to remember that these methods can generally not use [.

And I think using dplyr_reconstruct in dplyr_row_slice() makes sense too — then you'd only need to implement dplyr_col_modify() and dplyr_row_slice() if there were either special row-wise or column-wise attributes, or there was some performance advantage.

hadley on 29 Apr 2020

I think generally advice to package authors could be:

If you have data frame attributes that don't depend on the rows or columns present (meaning they should unconditionally be preserved), nothing to do.
If you have _scalar_ data frame attributes that depend on the _rows_ present, implement a dplyr_reconstruct() method.
- This way, when dplyr_row_slice() is called, by calling dplyr_reconstruct() at the end you get a chance to check if the important row is present, and modify the scalar attribute accordingly.
If you have _scalar_ data frame attributes that depend on the _columns_ present, then implement a dplyr_reconstruct() method, and a 1D [ method.
- For example, there might be some boolean value that is TRUE if a specific column is present, otherwise FALSE. If dplyr_col_modify() adds the column through mutate(), then by calling dplyr_reconstruct() it gives the developer a chance to flip that boolean attribute.
- The [ method is needed because select() could drop the column, so the developer needs a way to flip the boolean attribute here too.
If you have _vectorized_ data frame attributes that depend on the _rows_, implement a dplyr_row_slice() method. This way you have access to i so you can slice your attributes accordingly.
If you have _vectorized_ data frame attributes that depend on the _columns_, implement a dplyr_col_modify() and 1D [ method.
- For dplyr_col_modify() you get access to cols so you can update your colwise attribute.
- For [ you get access to the column selection index i, so you can update the colwise attribute.