The fill() function after a group_by(), especially if the number of groups is large, is more than 10x slower than mutate() with na.locf(), from the zoo package, yet gives identical results. Maybe I'm missing something and there is another way to peform this same operation?
library(dplyr)
library(tidyr)
library(zoo)
library(tibble)
n <- 1e6
df <- tibble(a = sample(paste("id", 1:(n/4)), n, replace = T),
b = sample(c("2012", "2013", "2014"), n, replace = T),
c = sample(c("NA", "A", "B", "C"), n, replace = T))
t1 <- system.time(df1 <-
df %>%
arrange(a, b) %>%
group_by(a) %>%
mutate(c = na.locf(c, na.rm=F))
)
print(t1)
#> user system elapsed
#> 21.45 0.06 21.81
t2 <- system.time(df2 <-
df %>%
arrange(a, b) %>%
group_by(a) %>%
fill(c)
)
print(t2)
#> user system elapsed
#> 313.37 0.47 316.75
print(identical(df1, df2))
#> [1] TRUE
Created on 2018-12-07 by the reprex package (v0.2.1)
There is a typo in the reprex, it should be c = sample(c(NA, "A", "B", "C"), n, replace = T)) but the results are the same (I don't know how to edit the issue)
I can confirm that this does seem rather slow!
suppressPackageStartupMessages({
library(dplyr)
library(tidyr)
library(zoo)
library(tibble)
})
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
# Create some dummy data
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
set.seed(2)
n <- 1e2
df <- tibble(
a = sample(paste("id", 1:(n/4)) , n, replace = TRUE),
b = sample(c("2012", "2013", "2014"), n, replace = TRUE),
c = sample(c(NA, "A", "B", "C") , n, replace = TRUE)
) %>%
arrange(a, b) %>%
group_by(a)
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
# Simple functions to wrap 'zoo::na.locf()' and 'dplyr::fill'
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
f_zoo <- function() { mutate(df, c = na.locf(c, na.rm=FALSE)) }
f_fill <- function() { fill(df, c)}
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
# benchmark
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
identical(f_zoo(), f_fill())
#> [1] TRUE
res <- bench::mark(f_zoo(), f_fill())
res
#> # A tibble: 2 x 10
#> expression min mean median max `itr/sec` mem_alloc n_gc
#> <chr> <bch:t> <bch:t> <bch:t> <bch:t> <dbl> <bch:byt> <dbl>
#> 1 f_zoo() 1.09ms 1.27ms 1.23ms 2.57ms 789. 90.4KB 7
#> 2 f_fill() 29.22ms 31.66ms 32.07ms 34.48ms 31.6 196.8KB 9
#> # … with 2 more variables: n_itr <int>, total_time <bch:tm>
plot(res)

Created on 2019-03-05 by the reprex package (v0.2.1)
It's likely slow because of the current implementation:
fill.grouped_df <- function(data, ..., .direction = c("down", "up", "downup", "updown")) {
dplyr::do(data, fill(., ..., .direction = .direction))
}
Probably the easiest experiment to make this faster would be to switch from do() to mutate_at()
A quick experiment suggests that this is likely to considerably improve performance:
library(dplyr, warn.conflicts = FALSE)
library(tidyr)
set.seed(2)
n <- 1e4
df <- tibble(
a = sample(paste("id", 1:(n/4)) , n, replace = TRUE),
b = sample(c("2012", "2013", "2014"), n, replace = TRUE),
c = sample(c(NA, "A", "B", "C") , n, replace = TRUE)
) %>% arrange(a, b)
gf <- df %>% group_by(a)
bench::mark(
ungrouped_fill = fill(df, c),
grouped_fill = fill(gf, c),
mutate_fillDown = mutate(gf, c = tidyr:::fillDown(c)),
mutate_na.locf = mutate(gf, c = zoo::na.locf(c, na.rm = FALSE)),
check = FALSE
)[1:4]
#> Warning: Some expressions had a GC in every iteration; so filtering is
#> disabled.
#> # A tibble: 4 x 4
#> expression min mean median
#> <chr> <bch:tm> <bch:tm> <bch:tm>
#> 1 ungrouped_fill 887.19µs 1.1ms 934.43µs
#> 2 grouped_fill 2.85s 2.85s 2.85s
#> 3 mutate_fillDown 36.79ms 42.71ms 42.8ms
#> 4 mutate_na.locf 115.99ms 120.38ms 120.29ms
Created on 2019-03-08 by the reprex package (v0.2.1.9000)
Ooh, and it massively simplifies the implementation.
New performance:
library(dplyr, warn.conflicts = FALSE)
library(tidyr)
set.seed(2)
n <- 1e4
df <- tibble(
a = sample(paste("id", 1:(n/4)) , n, replace = TRUE),
b = sample(c("2012", "2013", "2014"), n, replace = TRUE),
c = sample(c(NA, "A", "B", "C") , n, replace = TRUE)
) %>% arrange(a, b)
gf <- df %>% group_by(a)
bench::mark(
ungrouped_fill = fill(df, c),
grouped_fill = fill(gf, c),
na.locf = mutate(gf, c = zoo::na.locf(c, na.rm = FALSE)),
check = FALSE
)[1:4]
#> Warning: Some expressions had a GC in every iteration; so filtering is
#> disabled.
#> # A tibble: 3 x 4
#> expression min mean median
#> <chr> <bch:tm> <bch:tm> <bch:tm>
#> 1 ungrouped_fill 1.24ms 1.5ms 1.37ms
#> 2 grouped_fill 27.54ms 33.8ms 31.14ms
#> 3 na.locf 114.05ms 122.6ms 122.97ms
Great! Thanks for the hard work
Nice timing on this fix! I had a bit o' tidyverse code for a dataframe with 1.1M rows that was running just fine until I added two grouped fills. Then... it didn't seem like it was EVER going to finish. So, I reduced it to 100k rows and still had to wait >2 mins for it to finish. After a quick Google search to find this recently resolved(!) issue and a quick devtools::install_github("tidyverse/tidyr"), I'm back up and running and all 1.1 M rows are now finishing in 15 seconds! Nice work!
Most helpful comment
Ooh, and it massively simplifies the implementation.
New performance: