Hi, wasn't sure if this is a good place for feature requests, but here goes:
The tidyr package provides a very useful function "fill" which allows you to replace NAs with values from above or below the NA value(s). For example, I have a table with values such as:
> thing = data.frame(values = c(NA,NA,NA,1,NA,NA,NA,NA,NA,6,NA,NA,NA,10), row_num = 1:14)
> thing
values row_num
1 NA 1
2 NA 2
3 NA 3
4 1 4
5 NA 5
6 NA 6
7 NA 7
8 NA 8
9 NA 9
10 6 10
11 NA 11
12 NA 12
13 NA 13
14 10 14
and I'd like to have it fill in those NAs to get:
> fill(thing,values,.direction="up")
values row_num
1 1 1
2 1 2
3 1 3
4 1 4
5 6 5
6 6 6
7 6 7
8 6 8
9 6 9
10 6 10
11 10 11
12 10 12
13 10 13
14 10 14
Is there a workaround that would allow me to do this in conjunction with sparklyr (perhaps a Spark SQL statment that might work for example)? Ideally, I'd love to see the fill function implemented within the Sparklyr library!
@admoseremic I would try using dplyr::cummax
:
library(sparklyr)
library(dplyr)
sc <- spark_connect(master = "local")
thing <- data.frame(values = c(NA,NA,NA,1,NA,NA,NA,NA,NA,6,NA,NA,NA,10), row_num = 1:14)
thing_tbl <- copy_to(sc, thing)
thing_tbl %>%
arrange(row_num) %>%
mutate(values = cummax(ifelse(is.na(values), 0, values)))
# Source: lazy query [?? x 2]
# Database: spark_connection
# Ordered by: row_num
row_num values
<int> <dbl>
1 1 0
2 2 0
3 3 0
4 4 1
5 5 1
6 6 1
7 7 1
8 8 1
9 9 1
10 10 6
# ... with more rows
Notice that without a row_num
field, you could add this with thing_tbl %>% sdf_with_sequential_id()
.
Ah, great suggestion - although I really need the "1" value to be included with the 3 initial NA values, hence the ".direction=up" I had in the fill call above. Does that make sense?
Maybe doing a preemptive lag might do the trick. Lemme give that a try when I get a chance later tonight.
Using lag get's me close enough I think - I just added:
mutate(values = cummax(ifelse(is.na(lag(values)), 0, lag(values))))
which gave me:
# Source: lazy query [?? x 2]
# Database: spark_connection
# Ordered by: row_num
row_num values
<int> <dbl>
1 1 0
2 2 0
3 3 0
4 4 0
5 5 1
6 6 1
7 7 1
8 8 1
9 9 1
10 10 1
# ... with more rows
Just had to add that last row back in, and I'm good. Thanks for the tip!
@admoseremic I try to call
arrange(row_num) %>% mutate(values = cummin(ifelse(is.na(values),0,values)
but it can get the target result, as you mentioned as fill(thing,values,.direction="up")