Drake: Feature Request: Tight Interaction with knitr cacheing

Created on 30 Dec 2019 · 10Comments · Source: ropensci/drake

Prework

[x] Read and abide by drake's code of conduct.
[x] Search for duplicates among the existing issues, both open and closed.

Proposal

I often generate large-ish reports with knitr (100-3000 pages). Usually while iterating through these reports, I generate the data for making tables and figures within drake and then do the reporting with knitr. Generating hundreds to thousands of tables and figures within knitr/rmarkdown takes a long time.

The request: It would be helpful if drake could have granular control over knitr caching (likely via the knitr::opts_chunk$set(cache=XXX) function). The overall goal would be to avoid re-creating the intermediate .pdf files for graphs (e.g. with figures) to speed creating reports.

I recognize that this becomes rather complex with the interaction of child documents, etc. It may be too much of a rabbit's hole, and I'd accept that.

new feature

Source

billdenney

Most helpful comment

Here is a clean way to offload expensive rendering to drake targets without needing to mess around with file_out(). I think you will like it. It would probably make a good chapter in the manual.

```` r

https://stackoverflow.com/questions/32906566/create-ggplot2-plot-in-memory

packages

library(drake)
library(magick)

> Linking to ImageMagick 6.9.7.4

> Enabled features: fontconfig, freetype, fftw, lcms, pango, x11

> Disabled features: cairo, ghostscript, rsvg, webp

library(rmarkdown)
library(tidyverse)
library(webshot)

functions

gen_data <- function(n) {
data.frame(
x = rbeta(n = n, shape1 = 0.5, shape2 = 0.5),
y = runif(n = n)
)
}

The bitmap is small, but it takes a long time to render.

plot1 <- function(data) {
gg <- ggplot(data) +
geom_histogram(aes(x = x), bins = 8)
ggmem(gg)
}

Same for the second plot.

plot2 <- function(data) {
gg <- ggplot(data) +
geom_histogram(aes(x = y), bins = 8)
ggmem(gg)
}

This function is the key.

It serializes the image as a bitmap in memory

so we can save and load the rendered version.

ggmem <- function(gg) {
fig <- image_graph(width = 400, height = 400, res = 96)
print(gg)
dev.off()
image_data(fig)
}

report

report_lines <- c(
"---",
"title: report",
"---",
"",
"{r}", "library(drake)", "library(magick)", "",
"",
"{r}", "image_read(readd(img1))", "",
"",
"{r}", "image_read(readd(img2))", ""
)

writeLines(report_lines, "report.Rmd")

cat(readLines("report.Rmd"), sep = "\n")

> ---

> title: report

> ---

>

> ```{r}

> library(drake)

> library(magick)

> ```

>

> ```{r}

> image_read(readd(img1))

> ```

>

> ```{r}

> image_read(readd(img2))

> ```

plan

plan <- drake_plan(
data = target(gen_data(1e7), format = "fst"),
img1 = plot1(data),
img2 = plot2(data),
report = render(
knitr_in("report.Rmd"),
output_file = file_out("report.html"),
quiet = TRUE
)
)

config <- drake_config(plan)
vis_drake_graph(config)

> Warning: argument config is deprecated. Use ... to supply make() arguments such

> as the plan instead.

````

# qs format is in dev,
# could help for large bitmaps.
make(plan, format = "qs") 
#> target data
#> target img1
#> target img2
#> target report

# We now have a report.
webshot("report.html")

```` r

Those images rendered slowly and the report rendered fast!

build_times(type = "command")

> # A tibble: 4 x 4

> target elapsed user system

>

> 1 data 2.281s 2.224s 0.055s

> 2 img1 23.273s 21.499s 1.759s

> 3 img2 21.328s 19.641s 1.679s

> 4 report 1.977s 0.195s 0.048s

Skip the image rendering next time you

edit the report.

report_lines <- c(
"---",
"title: edited report", # changed title here
"---",
"",
"{r}", "library(drake)", "library(magick)", "",
"",
"Added some new text here.", # new text here
"",
"{r}", "image_read(readd(img1))", "",
"",
"{r}", "image_read(readd(img2))", ""
)

writeLines(report_lines, "report.Rmd")

vis_drake_graph(config)

> Warning: argument config is deprecated. Use ... to supply make() arguments such

> as the plan instead.

````

make(plan, format = "qs")
#> target report

^{Created on 2019-12-30 by the reprex package (v0.3.0)}

wlandau on 31 Dec 2019

👍2

All 10 comments

Could we talk about workarounds first? I think there could be some quicker wins if we look at your workflow differently.

A couple things that would help to know:

Are you using knitr_in() in the plan?
How many reports do you have?
How much time does it take to render each report from scratch?

Many experts in our field claim that reproducibility has become synonymous with literate programming. I strongly disagree. Literate programming in large projects is messy. It breaks down because the paradigm opposes modularity, and we need modularity for clean code and optimal caching. Literate programming struggles to scale for the same reasons ordinary script-based workflows struggle to scale, which are the same reasons why I created drake.

So let's see if we can make your reports less script-oriented and more pipeline-oriented. I have been trying to tell users to do as little as possible in knitr and R Markdown reports. Reports are great for summarizing targets you already created, but not for any new heavy computation. In other words, the best kind of report in a drake workflow is one whose code chunks only contain loadd() and readd() to reference previous targets. Rendering times could still get annoyingly large, but not if you pre-render enough plot files in earlier targets and declare them with file_out(). You can still use readd(target_that_produced_the_files) to declare groups of files as dependencies of a knitr report.

wlandau on 30 Dec 2019

One approach:

make_plot1 <- function(data, file) {
  pl <- ggplot(data) +
     ...
  ggsave(file)
}

plan <- drake_plan(
  ...
  targ_plot1 = make_plot1(data, file_out("plot1.pdf"),
  targ_report = render(knitr_in("report.Rmd"), output_file = file_out("report.pdf"))
)

where "report.Rmd" contains loadd(targ_plot1) in an active code chunk.

wlandau on 30 Dec 2019

👍1

I'm all for the best/most efficient solution which could include workflow changes.

I do use knitr_in() in the plan.
I tend to write a new report like this about every 2 weeks. (I'm not sure that's the answer to the question you're asking. Please let me know if that isn't what you meant.)
Rendering a moderate complexity report takes between 5 and 10 minutes. A very long report takes about 20 minutes. If it would take longer than that, I tend to split it into different reports.

I tend toward doing as little as possible within the report itself, and I get close to the individual chunks in a report looking a lot like the following, and I generally try to make data_from_target into something that is precomputed to be close to the final dataset needed. In other words, I try to make data_from_target into something that loads quickly.

d_plot <-
  readd(data_from_target) %>%
  filter(the_data_of_interest_for_this_figure_or_table)

ggplot(d_plot, aes(x=x, y=y)) + geoms()

d_table <-
  readd(data_from_target) %>%
  filter(the_data_of_interest_for_this_figure_or_table)

pander(d_table)

The actual ggplot object can get a a good bit more complex, but it is all in additional geoms and formatting, but not much direct calculation. The table generation is usually almost that simple (just adding captions). Filtering the data takes almost no time.

The optimization that I don't use which you suggest is pre-rendering the figures. I don't do that because it tends to remove some of the helpful automatic formatting that is possible within knitr. (Please tell me if I'm missing a simple fix for that.)

To me (though without profiling), the overall time delay is based on rendering the figures to a file for report inclusion.

billdenney on 30 Dec 2019

To me (though without profiling), the overall time delay is based on rendering the figures to a file for report inclusion.

This would be unsurprising, rendering is a common bottleneck.

The optimization that I don't use which you suggest is pre-rendering the figures. I don't do that because it tends to remove some of the helpful automatic formatting that is possible within knitr. (Please tell me if I'm missing a simple fix for that.)

Still, it seems like the most expedient thing to do. Which formatting options do you use? Seems like we might find find equivalents in ggsave(). There's a package that makes it easy to change the formal arguments of functions, possibly better than just formals(ggsave) <- ..., but I cannot think of the name.

For what it's worth, if you want all targets to share the same set of custom global options, you can set the prework argument of make(), e.g. make(plan, prework = quote(options(warn = 2))).

Another choice might be to split up a large report into multiple files over multiple targets and then splice the output pdfs together with staplr.

wlandau on 30 Dec 2019

Let me work toward pre-rendering everything via ggsave(). The main things that I modify are width and height which are simple enough to modify. It does make it harder if I want to build both .pdf and .html since they have different defaults for output size in knitr, but there are some work-arounds for that, too.

I don't think that staplr would work for most of my use cases as I tend to require single tables of contents which can reach within each section.

I don't think that I'll be rewriting the current project, but I'll try prerendering on the next project. I'll close for now, and reopen if prerendering isn't the solution that I'm needing.

billdenney on 30 Dec 2019

👍1

Glad we are aligned. If it is just the width and height, I think the problem is straightforward and someone at https://community.rstudio.com/ will probably know what to do.

As an aside, this is the first time I am thinking about drake as a way to save time rendering plots, so I am willing to explore the use case further. Rather than use file_out() and save every figure to a file, it might be more convenient to return the actual in-memory bitmap as the target for each plot. Maybe webshot can accomplish this already, and maybe the magick package also has a role.

wlandau on 30 Dec 2019

Here is a clean way to offload expensive rendering to drake targets without needing to mess around with file_out(). I think you will like it. It would probably make a good chapter in the manual.

```` r

https://stackoverflow.com/questions/32906566/create-ggplot2-plot-in-memory

packages

library(drake)
library(magick)

> Linking to ImageMagick 6.9.7.4

> Enabled features: fontconfig, freetype, fftw, lcms, pango, x11

> Disabled features: cairo, ghostscript, rsvg, webp

library(rmarkdown)
library(tidyverse)
library(webshot)

functions

gen_data <- function(n) {
data.frame(
x = rbeta(n = n, shape1 = 0.5, shape2 = 0.5),
y = runif(n = n)
)
}

The bitmap is small, but it takes a long time to render.

plot1 <- function(data) {
gg <- ggplot(data) +
geom_histogram(aes(x = x), bins = 8)
ggmem(gg)
}

Same for the second plot.

plot2 <- function(data) {
gg <- ggplot(data) +
geom_histogram(aes(x = y), bins = 8)
ggmem(gg)
}

This function is the key.

It serializes the image as a bitmap in memory

so we can save and load the rendered version.

ggmem <- function(gg) {
fig <- image_graph(width = 400, height = 400, res = 96)
print(gg)
dev.off()
image_data(fig)
}

report

report_lines <- c(
"---",
"title: report",
"---",
"",
"{r}", "library(drake)", "library(magick)", "",
"",
"{r}", "image_read(readd(img1))", "",
"",
"{r}", "image_read(readd(img2))", ""
)

writeLines(report_lines, "report.Rmd")

cat(readLines("report.Rmd"), sep = "\n")

> ---

> title: report

> ---

>

> ```{r}

> library(drake)

> library(magick)

> ```

>

> ```{r}

> image_read(readd(img1))

> ```

>

> ```{r}

> image_read(readd(img2))

> ```

plan

config <- drake_config(plan)
vis_drake_graph(config)

> Warning: argument config is deprecated. Use ... to supply make() arguments such

> as the plan instead.

````

# qs format is in dev,
# could help for large bitmaps.
make(plan, format = "qs") 
#> target data
#> target img1
#> target img2
#> target report

# We now have a report.
webshot("report.html")

```` r

Those images rendered slowly and the report rendered fast!

build_times(type = "command")

> # A tibble: 4 x 4

> target elapsed user system

>

> 1 data 2.281s 2.224s 0.055s

> 2 img1 23.273s 21.499s 1.759s

> 3 img2 21.328s 19.641s 1.679s

> 4 report 1.977s 0.195s 0.048s

Skip the image rendering next time you

edit the report.

writeLines(report_lines, "report.Rmd")

vis_drake_graph(config)

> Warning: argument config is deprecated. Use ... to supply make() arguments such

> as the plan instead.

````

make(plan, format = "qs")
#> target report

^{Created on 2019-12-30 by the reprex package (v0.3.0)}

wlandau on 31 Dec 2019

👍2

@billdenney, if you use the technique in https://github.com/ropensci/drake/issues/1126#issuecomment-569845228, would you be willing to write about it as an rOpenSci use case? Not only does it solve a ubiquitous frustration in R Markdown reports, it also highlights two rOpenSci packages in one go. cc @stefaniebutland.

wlandau on 31 Dec 2019

😄1

I tend to write my .pdfs with vector graphics, so I'll have to work out a variant of your proposed solution, if possible. That said, it appears that knitr::include_graphics() will work for the ggsave() method. And, I can likely work out some form of efficiency where it returns the .pdf filename.

I can also imagine a solution that may write the .pdf file to a connection that is backed by a target instead of an actual temporary file during the plan, but that may cut out some of the speed advantage.

I'll play around and report back... (It'll probably be a few weeks.)

billdenney on 31 Dec 2019

👍1

Awesome, thanks! I look forward to learning what you figure out.

wlandau on 31 Dec 2019

Was this page helpful?

0 / 5 - 0 ratings