Hi again!
I have a large project with ~30000 targets. When I run make, it takes several minutes for the first piece of visual feedback to print (I see packages loading after these first few minutes).
drake is doing something in these minutes.
1) Out of curiosity, what is it doing?
2) is there a visual feedback like a progress bar that might make sense for this first stage?
3) Are there any potential gains in efficiency you could think about that wouldn't compromise robustness?
Did some digging and it seems to be the line:
command_deps <- lapply(plan$command, command_dependencies) in build_graph()
One thing might be to first pull out unique functions and files, then process? You might be doing this already.
@kendonB I was hoping for a huge test case like this!
You may be happy to know that that line of code now reads
command_deps <- lightly_parallelize(
plan$command, command_dependencies, jobs = jobs)
Since you're not using a Windows machine, you can just set jobs to be whatever you want mc.cores to be so that mclapply() speeds things up. Since you're using the "future_lapply" backend for targets, this will not affect how the meat of your workflow is processed.
To speed things up further, I think we would need a C++-accelerated static code analysis tool to parse commands and look for dependencies (targets/imports). That would be a giant undertaking. Drake uses codetools plus some customizations, and the state of the art is CodeDepends. Neither has compiled code.
Do you think drake should log what it's doing to the console? It might be too much to print one line per target, but I was thinking something like resolve dependencies of imports and resolve dependencies of targets.
You should also check out the new sub-graphing functionality I implemented over the weekend. As long as you don't have too many imports, the interactive visNetwork graph should be sane, even with a 30000-target project. You might even speed things up by using the targets argument to plot_graph().
Your comment here is an interesting thought. Currently, drake does not try to pull out unique code fragments or unique imports to analyze. Intuitively, I don't think having unique values would noticeably increase speed, and unique could take a long time on 30000 targets. On the other hand, I do not have projects as large as yours, so feel free to prove me wrong.
I think that printing a line like resolve dependencies of imports or resolve dependencies of targets makes sense.
Regarding pulling out unique functions, I imagine most large projects are, like mine, just many small chunks of few operations. For example, I have 6 user-written functions generating the 30000 targets. On mine, it would greatly increase speed!
I think I see. In a workflow plan with 30000 targets
## target command
## 1 target1 rnorm(1000)
## 2 target2 rnorm(1000)
...
"rnorm(1000)" should really only be processed once.
You could go finer than that though. Even with commands my_fun(obj1), and my_fun(obj2) you could check the body of my_fun once.
Actually, drake already does that last part just fine. It doesn't waste any time processing duplicate imports. Each code analysis only goes one level deep at a time (the body of one function or the code fragment of one command).
You could also consider caching results so that future runs of make don't do the same work twice? Kind of like memoise but using the disk?
Another potential speed up for usability could be to implement different levels of checking like remake? This is really nice while testing.
Drake does already cache imports, but it always checks for changes. It always needs to check for changes somehow. I would assume storr does not cache items if their hashes have not changed, but that is a question for @richfitz.
What levels of checking would you like to see? If you have custom file targets/imports like report.md and report.Rmd in the basic example, drake uses file modification times to check whether to bother computing a new hash (#4), but that's about it.
I quite liked remakes different levels: "exists" (current if the target exists), "depends" (current if exists and dependencies unchanged), "code" (current if exists and code unchanged) or "all" (current if exists and both dependencies and code unchanged).
Regarding caching, I was referring to the output of command_dependencies which seems different from caching imports?
I did not know remake exposed those levels of checking to the user. To be honest, I still do not think I see the full practical use.
For drake, dependency checking is mostly in current.R and hash.R.
I was not actually sure about caching all the individual command dependencies for every target. I opted for a small manageable dependency hash, thinking it would cut down on storage and time spent caching.
By the way, b18b43c89f91a3b46b37e39d921f0709d62424c1 and beyond has those console messages. Now, more people will be able to guess why build_graph() takes so long.
> load_basic_example()
> make(my_plan)
interconnect 7 items: ld, td, simulate, reg1, my_plan, reg2, dl
interconnect 15 items: 'report.md', small, large, regression1_small, regressi...
import knit
import 'report.Rmd'
...
See this new Stack Overflow post. This seems like such a common problem, and I hope someone has solved it already.
@kendonB, in 87de915a67cc21a3c0bc0b77ef6baf56623cf9be, see lightly_parallelize(), which now calls lightly_parallelize_atomic(). Should avoid duplication of effort in the processing of commands. I posted the general solution here.
I think the last thing on this thread is the possible memoization of processing commands here. memoise with the file system cache is probably the best option. I am a bit reluctant because this memoization cache belongs inside the .drake/ storr cache, and passing the right path to build_graph() would be a pain.
Also, the code to memoise is this:
command_deps <- lightly_parallelize(
plan$command, command_dependencies, jobs = jobs)
It depends on both the commands and the number of jobs. I would want to memoize in a way that ignores jobs.
See r-lib/memoise#54. There is resistance to optionally ignoring function parameters.
I will memoize only if the memoise package handles this use case reasonably well.
I think I have done all I can do on this thread at the moment. Will reopen if/when there are new possibilities or suggestions.
Ran this again and it runs much faster with the parallelization. I noticed that the interconnect message came up twice for the same set of targets. Is this what you expect? It appears first then I see a bunch of import ..., then I see interconnect again, then check ..., then the making happens.
Glad to hear the parallelism was an improvement. I expect your idea of lightly_parallelize_atomic() may have sped things up too (avoid analyzing the same workflow command twice). I do expect interconnect to show up twice: one for imports only, and another for the targets listed in your workflow plan. There are two calls to console_dependencies(), one here and another here. I suppose I should make different console messages for each. Do you see two interconnects for the same make()?
FYI: I just made the distinction between the two interconnect (now connect) logs clearer. Now, one says imports and the other says targets.
load_basic_example()
make(my_plan)
## connect 9 imports: ld, lint_summary, gwd, td, simulate, reg1, my_plan, reg2, dl
## connect 15 targets: 'report.md', small, large, regression1_small, regression1...
## check 9 items: 'report.Rmd', knit, summary, suppressWarnings, coefficients, d...
## import 'report.Rmd'
## import knit
## import summary
...
Do you still get import ... messages before the connect ones?
I actually see more than two (names removed):
cache ....
interconnect 42 items: ...
interconnect 24306 items: ...
interconnect 42 items: ...
interconnect 24306 items: ...
check 92 items: ...
import ... lots
interconnect 42 items: ...
interconnect 24306 items: ...
check 92 items: ...
check 7 items: ...
Note this was before your most recent change. I'm in the middle of building the whole project so I'm not how long it'll be before I can test the latest change.
I understand.
I actually reproduced this on one of my larger projects. Happens with distributed parallelism only. Will explain tonight or tomorrow.
I also see the same behavior when running drake::outdated(my_plan, jobs = 8) so it seems to also happen with vanilla parallelism?
Yes, I am seeing redundant work in outdated() through config(). Currently working on refactoring the internals.
@kendonB you just rooted out major inefficiencies in drake. The fixes are effective as of 29f6d29badd92fc3830f0583c5f274923df591b4. Not only did those extra connect messages go away, but the unit tests became blazing fast! This makes my life so much easier, and I cannot thank you enough! If it sounds good to you, please submit a pull request to add yourself as a contributor above this line in the DESCRIPTION.
No need to add me as a contributor. Thanks for the fixes.