Sparklyr: Sparklyr substitute for "fill" function?

Created on 27 Sep 2017 · 4Comments · Source: sparklyr/sparklyr

Hi, wasn't sure if this is a good place for feature requests, but here goes:

The tidyr package provides a very useful function "fill" which allows you to replace NAs with values from above or below the NA value(s). For example, I have a table with values such as:

> thing = data.frame(values = c(NA,NA,NA,1,NA,NA,NA,NA,NA,6,NA,NA,NA,10), row_num = 1:14)
> thing
   values row_num
1      NA       1
2      NA       2
3      NA       3
4       1       4
5      NA       5
6      NA       6
7      NA       7
8      NA       8
9      NA       9
10      6      10
11     NA      11
12     NA      12
13     NA      13
14     10      14

and I'd like to have it fill in those NAs to get:

> fill(thing,values,.direction="up")
   values row_num
1       1       1
2       1       2
3       1       3
4       1       4
5       6       5
6       6       6
7       6       7
8       6       8
9       6       9
10      6      10
11     10      11
12     10      12
13     10      13
14     10      14

Is there a workaround that would allow me to do this in conjunction with sparklyr (perhaps a Spark SQL statment that might work for example)? Ideally, I'd love to see the fill function implemented within the Sparklyr library!

question

Source

admoseremic

All 4 comments

@admoseremic I would try using dplyr::cummax:

library(sparklyr)
library(dplyr)

sc <- spark_connect(master = "local")
thing <- data.frame(values = c(NA,NA,NA,1,NA,NA,NA,NA,NA,6,NA,NA,NA,10), row_num = 1:14)
thing_tbl <- copy_to(sc, thing)

thing_tbl %>%
  arrange(row_num) %>%
  mutate(values = cummax(ifelse(is.na(values), 0, values)))

# Source:     lazy query [?? x 2]
# Database:   spark_connection
# Ordered by: row_num
   row_num values
     <int>  <dbl>
 1       1      0
 2       2      0
 3       3      0
 4       4      1
 5       5      1
 6       6      1
 7       7      1
 8       8      1
 9       9      1
10      10      6
# ... with more rows

Notice that without a row_num field, you could add this with thing_tbl %>% sdf_with_sequential_id().

javierluraschi on 28 Sep 2017

Ah, great suggestion - although I really need the "1" value to be included with the 3 initial NA values, hence the ".direction=up" I had in the fill call above. Does that make sense?

Maybe doing a preemptive lag might do the trick. Lemme give that a try when I get a chance later tonight.

admoseremic on 29 Sep 2017

Using lag get's me close enough I think - I just added:

mutate(values = cummax(ifelse(is.na(lag(values)), 0, lag(values))))

which gave me:

# Source:     lazy query [?? x 2]
# Database:   spark_connection
# Ordered by: row_num
   row_num values
     <int>  <dbl>
 1       1      0
 2       2      0
 3       3      0
 4       4      0
 5       5      1
 6       6      1
 7       7      1
 8       8      1
 9       9      1
10      10      1
# ... with more rows

Just had to add that last row back in, and I'm good. Thanks for the tip!

admoseremic on 29 Sep 2017

👍1

@admoseremic I try to call
arrange(row_num) %>% mutate(values = cummin(ifelse(is.na(values),0,values)
but it can get the target result, as you mentioned as fill(thing,values,.direction="up")