If sparklyr had the equivalent or similar to SparkR::dapply then we could run R user defined functions on rows of data. This would allow using package like lubridate at scale and build a very valuable "method of last resort" for solving Spark issues.
The function I want would have a signature like the following:
SparkR::dapply(dataHandle, Rfunction)
The Rfunction source or closure (not sure which is more appropriate) would be shipped to the Spark cluster. Spark would cut the data represented by dataHandle into reasonable sized contiguous row-intervals and call the Rfunction on each piece as an R data.frame. The Rfunction must return a data.frame and then Spark would row-bind the pieces together in the same row-block order they were cut up. This would allow the user to apply an arbitrary row transformation (such as lubridate parsing) very quickly. This should be much cheaper than cutting up the data and sending it back to the controlling R instance as we are sending data to transient R instances on the cluster (as SparkR already does) and we thus also get parallelism.
I have an example of using the technique in SparkR here: https://github.com/WinVector/BigDataRStrata2017/blob/master/Exercises/solutions/06-Spark-Extension.md (please look for the SparkR:dapply section).
This is a huge capability that I think will greatly enhance Sparklyr. It also means a lot of other issues (such as https://github.com/rstudio/sparklyr/issues/648 ) could be solved in terms of this operator.
Closing this as a duplicate -- we are still optimistic that we can implement this in our next feature push on sparklyr. (For now we just want to manage the transition from dplyr 0.5.0 to dplyr 0.6.0)
Most helpful comment
Closing this as a duplicate -- we are still optimistic that we can implement this in our next feature push on
sparklyr. (For now we just want to manage the transition fromdplyr 0.5.0todplyr 0.6.0)