Drake: Questions on the first stage of ``make``

Created on 31 Oct 2017 · 29Comments · Source: ropensci/drake

Hi again!

I have a large project with ~30000 targets. When I run make, it takes several minutes for the first piece of visual feedback to print (I see packages loading after these first few minutes).

drake is doing something in these minutes.

1) Out of curiosity, what is it doing?
2) is there a visual feedback like a progress bar that might make sense for this first stage?
3) Are there any potential gains in efficiency you could think about that wouldn't compromise robustness?

Source

kendonB

All 29 comments

Did some digging and it seems to be the line:

command_deps <- lapply(plan$command, command_dependencies) in build_graph()

kendonB on 31 Oct 2017

One thing might be to first pull out unique functions and files, then process? You might be doing this already.

kendonB on 31 Oct 2017

@kendonB I was hoping for a huge test case like this!

You may be happy to know that that line of code now reads

command_deps <- lightly_parallelize(
    plan$command, command_dependencies, jobs = jobs)

Since you're not using a Windows machine, you can just set jobs to be whatever you want mc.cores to be so that mclapply() speeds things up. Since you're using the "future_lapply" backend for targets, this will not affect how the meat of your workflow is processed.

To speed things up further, I think we would need a C++-accelerated static code analysis tool to parse commands and look for dependencies (targets/imports). That would be a giant undertaking. Drake uses codetools plus some customizations, and the state of the art is CodeDepends. Neither has compiled code.

Do you think drake should log what it's doing to the console? It might be too much to print one line per target, but I was thinking something like resolve dependencies of imports and resolve dependencies of targets.

You should also check out the new sub-graphing functionality I implemented over the weekend. As long as you don't have too many imports, the interactive visNetwork graph should be sane, even with a 30000-target project. You might even speed things up by using the targets argument to plot_graph().

wlandau-lilly on 31 Oct 2017

👍1

Your comment here is an interesting thought. Currently, drake does not try to pull out unique code fragments or unique imports to analyze. Intuitively, I don't think having unique values would noticeably increase speed, and unique could take a long time on 30000 targets. On the other hand, I do not have projects as large as yours, so feel free to prove me wrong.

wlandau-lilly on 31 Oct 2017

I think that printing a line like resolve dependencies of imports or resolve dependencies of targets makes sense.

Regarding pulling out unique functions, I imagine most large projects are, like mine, just many small chunks of few operations. For example, I have 6 user-written functions generating the 30000 targets. On mine, it would greatly increase speed!

kendonB on 31 Oct 2017

I think I see. In a workflow plan with 30000 targets

##    target       command
## 1  target1  rnorm(1000)
## 2  target2  rnorm(1000)
...

"rnorm(1000)" should really only be processed once.

wlandau-lilly on 31 Oct 2017

You could go finer than that though. Even with commands my_fun(obj1), and my_fun(obj2) you could check the body of my_fun once.

kendonB on 31 Oct 2017

Actually, drake already does that last part just fine. It doesn't waste any time processing duplicate imports. Each code analysis only goes one level deep at a time (the body of one function or the code fragment of one command).

wlandau-lilly on 31 Oct 2017

👍1

You could also consider caching results so that future runs of make don't do the same work twice? Kind of like memoise but using the disk?

kendonB on 31 Oct 2017

Another potential speed up for usability could be to implement different levels of checking like remake? This is really nice while testing.

kendonB on 31 Oct 2017

Drake does already cache imports, but it always checks for changes. It always needs to check for changes somehow. I would assume storr does not cache items if their hashes have not changed, but that is a question for @richfitz.

What levels of checking would you like to see? If you have custom file targets/imports like report.md and report.Rmd in the basic example, drake uses file modification times to check whether to bother computing a new hash (#4), but that's about it.

wlandau-lilly on 31 Oct 2017

I quite liked remakes different levels: "exists" (current if the target exists), "depends" (current if exists and dependencies unchanged), "code" (current if exists and code unchanged) or "all" (current if exists and both dependencies and code unchanged).

Regarding caching, I was referring to the output of command_dependencies which seems different from caching imports?

kendonB on 31 Oct 2017

I did not know remake exposed those levels of checking to the user. To be honest, I still do not think I see the full practical use.

For drake, dependency checking is mostly in current.R and hash.R.

I was not actually sure about caching all the individual command dependencies for every target. I opted for a small manageable dependency hash, thinking it would cut down on storage and time spent caching.

wlandau-lilly on 31 Oct 2017

By the way, b18b43c89f91a3b46b37e39d921f0709d62424c1 and beyond has those console messages. Now, more people will be able to guess why build_graph() takes so long.

> load_basic_example()
> make(my_plan)
interconnect 7 items: ld, td, simulate, reg1, my_plan, reg2, dl
interconnect 15 items: 'report.md', small, large, regression1_small, regressi...
import knit
import 'report.Rmd'
...

wlandau-lilly on 31 Oct 2017

See this new Stack Overflow post. This seems like such a common problem, and I hope someone has solved it already.

wlandau-lilly on 31 Oct 2017

@kendonB, in 87de915a67cc21a3c0bc0b77ef6baf56623cf9be, see lightly_parallelize(), which now calls lightly_parallelize_atomic(). Should avoid duplication of effort in the processing of commands. I posted the general solution here.

wlandau-lilly on 31 Oct 2017

I think the last thing on this thread is the possible memoization of processing commands here. memoise with the file system cache is probably the best option. I am a bit reluctant because this memoization cache belongs inside the .drake/ storr cache, and passing the right path to build_graph() would be a pain.

Also, the code to memoise is this:

command_deps <- lightly_parallelize(
    plan$command, command_dependencies, jobs = jobs)

It depends on both the commands and the number of jobs. I would want to memoize in a way that ignores jobs.

wlandau-lilly on 31 Oct 2017

See r-lib/memoise#54. There is resistance to optionally ignoring function parameters.

wlandau-lilly on 31 Oct 2017

I will memoize only if the memoise package handles this use case reasonably well.

wlandau-lilly on 31 Oct 2017

I think I have done all I can do on this thread at the moment. Will reopen if/when there are new possibilities or suggestions.

wlandau-lilly on 31 Oct 2017

👍1

Ran this again and it runs much faster with the parallelization. I noticed that the interconnect message came up twice for the same set of targets. Is this what you expect? It appears first then I see a bunch of import ..., then I see interconnect again, then check ..., then the making happens.

kendonB on 31 Oct 2017

Glad to hear the parallelism was an improvement. I expect your idea of lightly_parallelize_atomic() may have sped things up too (avoid analyzing the same workflow command twice). I do expect interconnect to show up twice: one for imports only, and another for the targets listed in your workflow plan. There are two calls to console_dependencies(), one here and another here. I suppose I should make different console messages for each. Do you see two interconnects for the same make()?

wlandau-lilly on 31 Oct 2017

FYI: I just made the distinction between the two interconnect (now connect) logs clearer. Now, one says imports and the other says targets.

load_basic_example()
make(my_plan)

## connect 9 imports: ld, lint_summary, gwd, td, simulate, reg1, my_plan, reg2, dl
## connect 15 targets: 'report.md', small, large, regression1_small, regression1...
## check 9 items: 'report.Rmd', knit, summary, suppressWarnings, coefficients, d...
## import 'report.Rmd'
## import knit
## import summary
...

Do you still get import ... messages before the connect ones?

wlandau-lilly on 31 Oct 2017

I actually see more than two (names removed):

cache ....
interconnect 42 items: ...
interconnect 24306 items: ...
interconnect 42 items: ...
interconnect 24306 items: ...
check 92 items: ...
import ... lots
interconnect 42 items: ...
interconnect 24306 items: ...
check 92 items: ...
check 7 items: ...

Note this was before your most recent change. I'm in the middle of building the whole project so I'm not how long it'll be before I can test the latest change.

kendonB on 31 Oct 2017

I understand.

I actually reproduced this on one of my larger projects. Happens with distributed parallelism only. Will explain tonight or tomorrow.

wlandau-lilly on 31 Oct 2017

I also see the same behavior when running drake::outdated(my_plan, jobs = 8) so it seems to also happen with vanilla parallelism?

kendonB on 1 Nov 2017

Yes, I am seeing redundant work in outdated() through config(). Currently working on refactoring the internals.

wlandau-lilly on 1 Nov 2017

@kendonB you just rooted out major inefficiencies in drake. The fixes are effective as of 29f6d29badd92fc3830f0583c5f274923df591b4. Not only did those extra connect messages go away, but the unit tests became blazing fast! This makes my life so much easier, and I cannot thank you enough! If it sounds good to you, please submit a pull request to add yourself as a contributor above this line in the DESCRIPTION.

wlandau-lilly on 1 Nov 2017

👍1

No need to add me as a contributor. Thanks for the fixes.

kendonB on 1 Nov 2017

Was this page helpful?

0 / 5 - 0 ratings

Related issues

AppVeyor CI disconnected

wlandau · 9Comments

Interested in trying the rOpenSci pkgdown template?

maelle · 8Comments

How to create a jagged cross() transform

htlin · 4Comments

Prohibit dynamic branching over unbranched dynamic files.

wlandau · 4Comments

rmarkdown::render with transform

htlin · 3Comments