Tidyr: Performance of fill() after group_by()

Created on 7 Dec 2018  Â·  7Comments  Â·  Source: tidyverse/tidyr

The fill() function after a group_by(), especially if the number of groups is large, is more than 10x slower than mutate() with na.locf(), from the zoo package, yet gives identical results. Maybe I'm missing something and there is another way to peform this same operation?

library(dplyr)
library(tidyr)
library(zoo)
library(tibble)

n <- 1e6
df <- tibble(a = sample(paste("id", 1:(n/4)), n, replace = T),
             b = sample(c("2012", "2013", "2014"), n, replace = T),
             c = sample(c("NA", "A", "B", "C"), n, replace = T))


t1 <- system.time(df1 <-
                    df %>% 
                      arrange(a, b) %>% 
                      group_by(a) %>% 
                      mutate(c = na.locf(c, na.rm=F))
                  )

print(t1)
#>    user  system elapsed 
#>   21.45    0.06   21.81

t2 <- system.time(df2 <-
                    df %>%
                      arrange(a, b) %>% 
                      group_by(a) %>%
                      fill(c)
                  )
print(t2)
#>    user  system elapsed 
#>  313.37    0.47  316.75
print(identical(df1, df2))
#> [1] TRUE

Created on 2018-12-07 by the reprex package (v0.2.1)

feature pivoting

Most helpful comment

Ooh, and it massively simplifies the implementation.

New performance:

library(dplyr, warn.conflicts = FALSE)
library(tidyr)

set.seed(2)
n <- 1e4
df <- tibble(
  a = sample(paste("id", 1:(n/4))     , n, replace = TRUE),
  b = sample(c("2012", "2013", "2014"), n, replace = TRUE),
  c = sample(c(NA, "A", "B", "C")     , n, replace = TRUE)
) %>% arrange(a, b) 
gf <- df %>% group_by(a)

bench::mark(
  ungrouped_fill = fill(df, c),
  grouped_fill = fill(gf, c),
  na.locf = mutate(gf, c = zoo::na.locf(c, na.rm = FALSE)),
  check = FALSE
)[1:4]
#> Warning: Some expressions had a GC in every iteration; so filtering is
#> disabled.
#> # A tibble: 3 x 4
#>   expression          min     mean   median
#>   <chr>          <bch:tm> <bch:tm> <bch:tm>
#> 1 ungrouped_fill   1.24ms    1.5ms   1.37ms
#> 2 grouped_fill    27.54ms   33.8ms  31.14ms
#> 3 na.locf        114.05ms  122.6ms 122.97ms

All 7 comments

There is a typo in the reprex, it should be c = sample(c(NA, "A", "B", "C"), n, replace = T)) but the results are the same (I don't know how to edit the issue)

I can confirm that this does seem rather slow!

suppressPackageStartupMessages({
  library(dplyr)
  library(tidyr)
  library(zoo)
  library(tibble)
})

#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
# Create some dummy data
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
set.seed(2)
n <- 1e2
df <- tibble(
  a = sample(paste("id", 1:(n/4))     , n, replace = TRUE),
  b = sample(c("2012", "2013", "2014"), n, replace = TRUE),
  c = sample(c(NA, "A", "B", "C")     , n, replace = TRUE)
) %>%
  arrange(a, b) %>%
  group_by(a)


#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
# Simple functions to wrap 'zoo::na.locf()' and 'dplyr::fill'
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
f_zoo  <- function() { mutate(df, c = na.locf(c, na.rm=FALSE)) }
f_fill <- function() { fill(df, c)}


#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
# benchmark
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
identical(f_zoo(), f_fill())
#> [1] TRUE
res <- bench::mark(f_zoo(), f_fill())

res
#> # A tibble: 2 x 10
#>   expression     min    mean  median     max `itr/sec` mem_alloc  n_gc
#>   <chr>      <bch:t> <bch:t> <bch:t> <bch:t>     <dbl> <bch:byt> <dbl>
#> 1 f_zoo()     1.09ms  1.27ms  1.23ms  2.57ms     789.     90.4KB     7
#> 2 f_fill()   29.22ms 31.66ms 32.07ms 34.48ms      31.6   196.8KB     9
#> # … with 2 more variables: n_itr <int>, total_time <bch:tm>

plot(res)

Created on 2019-03-05 by the reprex package (v0.2.1)

It's likely slow because of the current implementation:

fill.grouped_df <- function(data, ..., .direction = c("down", "up", "downup", "updown")) {
  dplyr::do(data, fill(., ..., .direction = .direction))
}

Probably the easiest experiment to make this faster would be to switch from do() to mutate_at()

A quick experiment suggests that this is likely to considerably improve performance:

library(dplyr, warn.conflicts = FALSE)
library(tidyr)

set.seed(2)
n <- 1e4
df <- tibble(
  a = sample(paste("id", 1:(n/4))     , n, replace = TRUE),
  b = sample(c("2012", "2013", "2014"), n, replace = TRUE),
  c = sample(c(NA, "A", "B", "C")     , n, replace = TRUE)
) %>% arrange(a, b) 
gf <- df %>% group_by(a)


bench::mark(
  ungrouped_fill = fill(df, c),
  grouped_fill = fill(gf, c),
  mutate_fillDown = mutate(gf, c = tidyr:::fillDown(c)),
  mutate_na.locf = mutate(gf, c = zoo::na.locf(c, na.rm = FALSE)),
  check = FALSE
)[1:4]
#> Warning: Some expressions had a GC in every iteration; so filtering is
#> disabled.
#> # A tibble: 4 x 4
#>   expression           min     mean   median
#>   <chr>           <bch:tm> <bch:tm> <bch:tm>
#> 1 ungrouped_fill  887.19µs    1.1ms 934.43µs
#> 2 grouped_fill       2.85s    2.85s    2.85s
#> 3 mutate_fillDown  36.79ms  42.71ms   42.8ms
#> 4 mutate_na.locf  115.99ms 120.38ms 120.29ms

Created on 2019-03-08 by the reprex package (v0.2.1.9000)

Ooh, and it massively simplifies the implementation.

New performance:

library(dplyr, warn.conflicts = FALSE)
library(tidyr)

set.seed(2)
n <- 1e4
df <- tibble(
  a = sample(paste("id", 1:(n/4))     , n, replace = TRUE),
  b = sample(c("2012", "2013", "2014"), n, replace = TRUE),
  c = sample(c(NA, "A", "B", "C")     , n, replace = TRUE)
) %>% arrange(a, b) 
gf <- df %>% group_by(a)

bench::mark(
  ungrouped_fill = fill(df, c),
  grouped_fill = fill(gf, c),
  na.locf = mutate(gf, c = zoo::na.locf(c, na.rm = FALSE)),
  check = FALSE
)[1:4]
#> Warning: Some expressions had a GC in every iteration; so filtering is
#> disabled.
#> # A tibble: 3 x 4
#>   expression          min     mean   median
#>   <chr>          <bch:tm> <bch:tm> <bch:tm>
#> 1 ungrouped_fill   1.24ms    1.5ms   1.37ms
#> 2 grouped_fill    27.54ms   33.8ms  31.14ms
#> 3 na.locf        114.05ms  122.6ms 122.97ms

Great! Thanks for the hard work

Nice timing on this fix! I had a bit o' tidyverse code for a dataframe with 1.1M rows that was running just fine until I added two grouped fills. Then... it didn't seem like it was EVER going to finish. So, I reduced it to 100k rows and still had to wait >2 mins for it to finish. After a quick Google search to find this recently resolved(!) issue and a quick devtools::install_github("tidyverse/tidyr"), I'm back up and running and all 1.1 M rows are now finishing in 15 seconds! Nice work!

Was this page helpful?
0 / 5 - 0 ratings