Dplyr: Fix dplyr_reconstruct genericity

Created on 29 Apr 2020  Â·  5Comments  Â·  Source: tidyverse/dplyr

It should be dispatching on template. Can also remove tibble check now that we no longer rely on tibble methods

bug

Most helpful comment

Random thought that I wanted to quickly capture — maybe the long term solution is to make it easier to write correct [ methods for data frame subclasses. Here we've just carved out the limited pieces that dplyr absolutely needs, but if we knew that [ did the "right" thing, we could probably use it in more places in dplyr (and eliminating a few cases where we have to operations in multiple steps because these existing generics aren't quite rich enough).

Using what we've learned in tibble, we could write a function that takes df, i, j, and ... and returns the indexing requirements in a standardised format (maybe integer vectors for i and j, and then some sentinel representing no change). Could also possibly record whether it creates duplicate rows/cols.

All 5 comments

I think it currently does dispatch on template right?

I was also thinking about letting dplyr_col_modify.data.frame() call dplyr_reconstruct() instead of manually restoring the attributes like it currently does. The manual attribute restoration process is almost the exact same as what dplyr_reconstruct.data.frame() does now. However, it would require removing the is_tibble() check in dplyr_reconstruct().

Similarly, even though in theory vec_slice() retains the right attributes in dplyr_row_slice.data.frame(), we _could_ call vec_slice() and then call dplyr_reconstruct() before returning from dplyr_row_slice.data.frame() as well.

I'd be curious to see if that would reduce the number of customization points that are generally required for data frame subclasses. If a dplyr_reconstruct() method is implemented, and both dplyr_col_modify() and dplyr_row_slice() call it, would methods really be needed for those as well? I guess if there are vectorized attributes, the methods probably would be needed still.

Oh, doh yeah.

Yeah, I think it would be fine to use dplyr_reconstruct() from dplyr_col_modify(). We just need to remember that these methods can generally not use [.

And I think using dplyr_reconstruct in dplyr_row_slice() makes sense too — then you'd only need to implement dplyr_col_modify() and dplyr_row_slice() if there were either special row-wise or column-wise attributes, or there was some performance advantage.

I think generally advice to package authors could be:

  • If you have data frame attributes that don't depend on the rows or columns present (meaning they should unconditionally be preserved), nothing to do.

  • If you have _scalar_ data frame attributes that depend on the _rows_ present, implement a dplyr_reconstruct() method.

    • This way, when dplyr_row_slice() is called, by calling dplyr_reconstruct() at the end you get a chance to check if the important row is present, and modify the scalar attribute accordingly.
  • If you have _scalar_ data frame attributes that depend on the _columns_ present, then implement a dplyr_reconstruct() method, and a 1D [ method.

    • For example, there might be some boolean value that is TRUE if a specific column is present, otherwise FALSE. If dplyr_col_modify() adds the column through mutate(), then by calling dplyr_reconstruct() it gives the developer a chance to flip that boolean attribute.

    • The [ method is needed because select() could drop the column, so the developer needs a way to flip the boolean attribute here too.

  • If you have _vectorized_ data frame attributes that depend on the _rows_, implement a dplyr_row_slice() method. This way you have access to i so you can slice your attributes accordingly.

  • If you have _vectorized_ data frame attributes that depend on the _columns_, implement a dplyr_col_modify() and 1D [ method.

    • For dplyr_col_modify() you get access to cols so you can update your colwise attribute.

    • For [ you get access to the column selection index i, so you can update the colwise attribute.

@topepo does that make sense to you?

Random thought that I wanted to quickly capture — maybe the long term solution is to make it easier to write correct [ methods for data frame subclasses. Here we've just carved out the limited pieces that dplyr absolutely needs, but if we knew that [ did the "right" thing, we could probably use it in more places in dplyr (and eliminating a few cases where we have to operations in multiple steps because these existing generics aren't quite rich enough).

Using what we've learned in tibble, we could write a function that takes df, i, j, and ... and returns the indexing requirements in a standardised format (maybe integer vectors for i and j, and then some sentinel representing no change). Could also possibly record whether it creates duplicate rows/cols.

Was this page helpful?
0 / 5 - 0 ratings