The fill()
function after a group_by()
, especially if the number of groups is large, is more than 10x slower than mutate()
with na.locf()
, from the zoo
package, yet gives identical results. Maybe I'm missing something and there is another way to peform this same operation?
library(dplyr)
library(tidyr)
library(zoo)
library(tibble)
n <- 1e6
df <- tibble(a = sample(paste("id", 1:(n/4)), n, replace = T),
b = sample(c("2012", "2013", "2014"), n, replace = T),
c = sample(c("NA", "A", "B", "C"), n, replace = T))
t1 <- system.time(df1 <-
df %>%
arrange(a, b) %>%
group_by(a) %>%
mutate(c = na.locf(c, na.rm=F))
)
print(t1)
#> user system elapsed
#> 21.45 0.06 21.81
t2 <- system.time(df2 <-
df %>%
arrange(a, b) %>%
group_by(a) %>%
fill(c)
)
print(t2)
#> user system elapsed
#> 313.37 0.47 316.75
print(identical(df1, df2))
#> [1] TRUE
Created on 2018-12-07 by the reprex package (v0.2.1)
There is a typo in the reprex, it should be c = sample(c(NA, "A", "B", "C"), n, replace = T))
but the results are the same (I don't know how to edit the issue)
I can confirm that this does seem rather slow!
suppressPackageStartupMessages({
library(dplyr)
library(tidyr)
library(zoo)
library(tibble)
})
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
# Create some dummy data
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
set.seed(2)
n <- 1e2
df <- tibble(
a = sample(paste("id", 1:(n/4)) , n, replace = TRUE),
b = sample(c("2012", "2013", "2014"), n, replace = TRUE),
c = sample(c(NA, "A", "B", "C") , n, replace = TRUE)
) %>%
arrange(a, b) %>%
group_by(a)
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
# Simple functions to wrap 'zoo::na.locf()' and 'dplyr::fill'
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
f_zoo <- function() { mutate(df, c = na.locf(c, na.rm=FALSE)) }
f_fill <- function() { fill(df, c)}
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
# benchmark
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
identical(f_zoo(), f_fill())
#> [1] TRUE
res <- bench::mark(f_zoo(), f_fill())
res
#> # A tibble: 2 x 10
#> expression min mean median max `itr/sec` mem_alloc n_gc
#> <chr> <bch:t> <bch:t> <bch:t> <bch:t> <dbl> <bch:byt> <dbl>
#> 1 f_zoo() 1.09ms 1.27ms 1.23ms 2.57ms 789. 90.4KB 7
#> 2 f_fill() 29.22ms 31.66ms 32.07ms 34.48ms 31.6 196.8KB 9
#> # … with 2 more variables: n_itr <int>, total_time <bch:tm>
plot(res)
Created on 2019-03-05 by the reprex package (v0.2.1)
It's likely slow because of the current implementation:
fill.grouped_df <- function(data, ..., .direction = c("down", "up", "downup", "updown")) {
dplyr::do(data, fill(., ..., .direction = .direction))
}
Probably the easiest experiment to make this faster would be to switch from do()
to mutate_at()
A quick experiment suggests that this is likely to considerably improve performance:
library(dplyr, warn.conflicts = FALSE)
library(tidyr)
set.seed(2)
n <- 1e4
df <- tibble(
a = sample(paste("id", 1:(n/4)) , n, replace = TRUE),
b = sample(c("2012", "2013", "2014"), n, replace = TRUE),
c = sample(c(NA, "A", "B", "C") , n, replace = TRUE)
) %>% arrange(a, b)
gf <- df %>% group_by(a)
bench::mark(
ungrouped_fill = fill(df, c),
grouped_fill = fill(gf, c),
mutate_fillDown = mutate(gf, c = tidyr:::fillDown(c)),
mutate_na.locf = mutate(gf, c = zoo::na.locf(c, na.rm = FALSE)),
check = FALSE
)[1:4]
#> Warning: Some expressions had a GC in every iteration; so filtering is
#> disabled.
#> # A tibble: 4 x 4
#> expression min mean median
#> <chr> <bch:tm> <bch:tm> <bch:tm>
#> 1 ungrouped_fill 887.19µs 1.1ms 934.43µs
#> 2 grouped_fill 2.85s 2.85s 2.85s
#> 3 mutate_fillDown 36.79ms 42.71ms 42.8ms
#> 4 mutate_na.locf 115.99ms 120.38ms 120.29ms
Created on 2019-03-08 by the reprex package (v0.2.1.9000)
Ooh, and it massively simplifies the implementation.
New performance:
library(dplyr, warn.conflicts = FALSE)
library(tidyr)
set.seed(2)
n <- 1e4
df <- tibble(
a = sample(paste("id", 1:(n/4)) , n, replace = TRUE),
b = sample(c("2012", "2013", "2014"), n, replace = TRUE),
c = sample(c(NA, "A", "B", "C") , n, replace = TRUE)
) %>% arrange(a, b)
gf <- df %>% group_by(a)
bench::mark(
ungrouped_fill = fill(df, c),
grouped_fill = fill(gf, c),
na.locf = mutate(gf, c = zoo::na.locf(c, na.rm = FALSE)),
check = FALSE
)[1:4]
#> Warning: Some expressions had a GC in every iteration; so filtering is
#> disabled.
#> # A tibble: 3 x 4
#> expression min mean median
#> <chr> <bch:tm> <bch:tm> <bch:tm>
#> 1 ungrouped_fill 1.24ms 1.5ms 1.37ms
#> 2 grouped_fill 27.54ms 33.8ms 31.14ms
#> 3 na.locf 114.05ms 122.6ms 122.97ms
Great! Thanks for the hard work
Nice timing on this fix! I had a bit o' tidyverse
code for a dataframe with 1.1M rows that was running just fine until I added two grouped fills. Then... it didn't seem like it was EVER going to finish. So, I reduced it to 100k rows and still had to wait >2 mins for it to finish. After a quick Google search to find this recently resolved(!) issue and a quick devtools::install_github("tidyverse/tidyr")
, I'm back up and running and all 1.1 M rows are now finishing in 15 seconds! Nice work!
Most helpful comment
Ooh, and it massively simplifies the implementation.
New performance: