I'd like to propose three very simple verbs that might get us halfway towards the goal of a DSL (#233). This proposal addresses the low-level technical part, which I think is required for any DSL.
The proposal is too long for an issue (and issues aren't version-controlled), I've added it to the wiki: https://github.com/ropensci/drake/wiki/Meta-make. Perhaps a 5 to 10 minute read.
I haven't closely followed previous related discussions, there might be overlap I'm not aware of. In particular, #79 might be a similar approach. Sharing this early for feedback and discussion, CC @AlexAxthelm @dapperjapper @rkrug @kendonB.
I like the idea of having a wiki page to discuss possible verbs.
I am not to clear about your verbs - could you please add the tibble / plan you would expect at the end?
Thanks for your feedback. I updated the wiki and pinged you in the commit message, would be interested to learn if you received a notification. Please let me know if things are clearer now.
Thanks. Makes things much easier.
The pack() and unpack() sound extremely useful.
Looking at meta_plan(), it would address one question I raised at the workshop: the iterative plans. But now there might be one important issue coming in: the uniqueness of target names. Unique to cache, directory, or plan?
meta_plan() would be extremely powerful, but also very error prone due to the uniqueness of the target names. Don't get me wrong - I would love a feature like that, but there needs to be some mechanism to deal with the same target names in the meta_plan - maybe adding a suffix to the target names of the meta-plans?
In the classical make, the target name has to be unique to the makefile - and I think drake should really aim at achieving the same.
Done adding reprexes. Switching to rmarkdown would have been smarter ;-)
I'd suggest that drake just refuses to run plans with duplicate names, just the way it works currently. That feels like the safest option.
Iterative plans will be covered by the DSL. This is just about low-level tooling to make the DSL possible.
@krlmlr OK - so meta_plan() does not know anything about the main plan - makes perfect sense.
But than the creation of the plan should include the check, not the make(). If make() gets a plan with duplicate target names, it should refuse to run it - that is obvious. B
But the error message should come during creation of the plan in the drake_plan() step. It is a helper function anyway, to populate a simple tibble / data.frame.
Yes, meta_plan() will usually receive a tracked target, or something computed from tracked targets, as argument.
Thanks so much, Kirill! I read your proposal, but it will take me more time to come up with feedback.
I'm really liking the modularity of the design here, and I want to see if I'm understanding correctly. If I have a problem along the lines of "take a big data set, split by an unknown set of factors, and run a arbitrart code on each" (something for which I would use dplyr::do), woudl that be something along the lines of:
drake_plan(
data = generic_data_importing_function(file_in("data.csv")),
factor_levels = unique(pull(data, "factor_column") ),
meta_plan = drake_plan(
this_factor = unpack(factor_levels),
data_subset = filter(data, factor_column == this_factor),
analysis = analysis_function(data_subset)
),
results = meta_make(meta_plan)
)
or would this require some sort of 4th verb (along the lines of a drake_foreach)?
Thanks. I don't intend to change the semantics of drake_plan() for this proposal. Your analysis might work with the following plan:
drake_plan(
data = generic_data_importing_function(file_in("data.csv")),
splits = data %>% nest(-factor_column) %>% deframe(),
meta_plan = tibble(
target = paste0("analyzed_", names(splits)),
command = paste0("analysis_function(splits[[\"", names(splits), "\"]])")
),
results = meta_make(meta_plan)
)
This will create targets of the form analyzed_xxx in the meta_plan, where xxx come from the factor levels. If you'd like to test the plan on a real example (iris?), you can double-check the embedded meta_plan. Does this make sense?
The targets in meta_plan will only be available as components of results, they are only brought into the main plan with unpack(), if necessary.
The command to create meta_plan is utterly awkward, that's where a DSL might help.
Should make() just be "meta" natively? I have played around with
plan <- drake::drake_plan(
x = 1,
y = 2,
meta_plan = drake_plan(
a = x,
b = y
),
results = drake::make(meta_plan)
)
make(plan)
It fails because target a could not find its dependency x. Solving this problem might be simpler and cleaner than implementing new meta_make(), especially since meta_make() may need to deal with self-reference anyway.
plan1 <- drake_plan(
x1 = 1,
results1 = make(plan1),
plan1 = drake_plan(
x2 = x1,
results2 = make(plan2),
plan2 = drake_plan(
x3 = x2,
results3 = make(plan3),
plan3 = drake_plan(
...
)
)
)
)
In any case, I do like where this is headed. The "meta" piece will make the hardest parts of the DSL possible.
I think pack() is similar to gather_plan(). pack() and unpack() are great choices of words, but I also wonder if the standard gather() and spread() would be more recognizable.
Thanks. I'd rather give an informative error message if make() is called from within a plan. We can see later if we can let make() converge towards meta_make(), or the other way round.
meta_make() is declarative only, it returns a data structure that contains the new plan but doesn't execute anything. Therefore, nesting shouldn't be a problem. There should be only one scheduler, even when some plans are constructed dynamically as part of the execution of the main plan.
The way spread() and gather() are defined in tidyr seems to be fairly different from the intended semantics of pack() and unpack(). I'm entirely open to other names.
meta_make() is declarative only, it returns a data structure that contains the new plan but doesn't execute anything. Therefore, nesting shouldn't be a problem. There should be only one scheduler, even when some plans are constructed dynamically as part of the execution of the main plan.
Ah, so make() and meta_make() are fundamentally different anyway.
The way spread() and gather() are defined in tidyr seems to be fairly different from the intended semantics of pack() and unpack(). I'm entirely open to other names.
Sure, that's fine. If spread() and gather() are not the right words, I think pack() and unpack() are great.
I think it would be nice to unify the making interface under one function make, but I agree with krlmlr that they serve different purposes for now...
I'm confused about the return value of meta_make(). What does the data structure look like? When do the meta-targets (?) get made? And how might one loadd a meta-target?
Maybe I need some clarity around what declarative means and how this affects working with drake targets both within plans and outside of them.
Going off of your analogy in the wiki, meta_make targets return a list of made targets when loadded. Thus you would access meta-targets like drake::readd(results)$a. However, with the targets I use, it would use wayyy too much memory to load every target within a meta_make just to access one of them.
@dapperjapper: Thanks. I've added an implementation sketch to the wiki (https://github.com/ropensci/drake/wiki/Meta-make#implementation-sketch), and also tweaked a bit -- see differences. Does this help answer your questions?
If we follow the implementation sketch, it feels easiest to start with pack(), because meta_make() will apply very similar ideas (or maybe even call a version of pack() under the hood), and unpack() alone doesn't feel that useful without meta_make().
@krlmlr I am confused a little bit now - if meta_make() does not execute anything and returns a plan, what is than the difference to drake_plan()? If there is a difference (I assume the execution in an isolated environment), why not simply add an argument subplan = TRUE to drake_plan() which is FALSE by default and can be set when used in this context here?
@rkrug: Great question. I think we want to give users a way to create "sub-plans" from home-grown tibbles created outside of drake_plan(), as in
meta_make(tibble(
target = ...
command = ...
))
So meta_make() would take a tibble (which might be homegrown or created by drake_plan() and turns it into a sub-plan - correct? But in then why is it called _make, as it is doing the same as drake_plan() - wouldn't drake_subplan() than be the better word?
Actually - do we really need this? Isnt't it possible to simply use drake_plan() inside drake_plan() which than automatically creates a sub-plan? I mean a n argument subplan could be set automatically when the function is called iteratively?
I want to implement this in a transparent way, using as little magic as possible. I don't see how meta_make() does the same as drake_plan() does, it takes a drake plan as input.?
I'm open to other names for the meta_make() operation, but it should match the effect, which is that it effectively executes commands defined dynamically when used inside make(). (At least this is the perceived effect; the real effect is that it creates a proxy object that is interpreted differently from normal targets by make(). It feels that we need to start writing documentation or tutorials before we can really come up with a good way to teach/communicate this idea, and the naming choice could be based on that.)
I don't see how meta_make() does the same as drake_plan() does, it takes a drake plan as input?
True - but the output is identical to a plan, only that it has an attribute sub-plan?
....[quite a bit of thinking in the background ....]
Ahhhh - I think I see what you are aiming at. This would make the assembling of complex plans much easier, as they can be split into sub-plans and assembled at a later stage. OK - that is a different thing (and you are right - we have to talk about the naming at the later stage).
Using these sub-plans is very powerful, as you say, but one question needs to be addressed, which is the naming of the targets.
target names have to be unique within each plan, within each sub-plan (which is obvious), but I think a sub-plan in a plan should be able to have the same names.
plan <- drake::drake_plan(
x = 1,
y = 2,
meta_plan = drake_plan(
x = x * 2,
y = y / x,
z = p(x)
),
results = meta_make(meta_plan)
)
should be possible. The targets would be x, y, results$x, results$y, result$z.
In addition, the x in the subplan should refer to the x in the sub-plan.
While p(x) would refer to the target x in the parent plan.
Thanks. It seems easiest if we impose the restriction that the meta-plan can only access outside targets. We need to think more about scoping issues, and maybe relax that restriction later.
I agree that we can have identical names in different sub-plans, but a subsequent unpack() should throw an error.
It seems easiest if we impose the restriction that the meta-plan can only access outside targets.
I think that would limit the usefulness quite a bit and lead to the combination of many targets into a single one in the meta-plan. To have be able to have inside the meta-plan a tree structure of dependent targets (inside as well as parent plan) would be, as I see it, the main advantage of a meta-plan. One could encapsulate e.g. in one meta-plan the simulation, cleaning and different analysis of the results, while the plan contains the generation of the parameter, which are used by the meta-plan.
So changing the parameter, would only require changing the (relatively simple) plan, and not the (possibly complex) meta-plan.
I hear you, but I'd prefer to keep complexity as low as possible at this point -- even the simplest solution will take some time and careful thought to implement.
Could you please share a concrete example, maybe a plan that would work for one particular parameter value? We can use this to discuss how this can be implemented based on the current proposal.
OK - here comes one (I hope this is what you are looking for - if yes, I could throw in the missing function definitions):
Assuming this one function to create a parameter set:
parms <- function(n0, r) {
return(
list(
n0 = n0,
r = r,
K = 100
)
)
}
to simulate one has to do:
all1 <- drake_plan(
parameter = parms(10, 1.5),
sim = sim(parameter),
analysis1 = analysis1(sim),
analysis2 = analysis2(sim),
compare = comp(analysis1, analysis2)
)
So if one want's to chage the parameter, one has to edit a longish plan.
In contrast:
simAndAnalysis <- drake_plan(
sim = sim(parameter),
analysis1 = analysis1(sim),
analysis2 = analysis2(sim),
compare = comp(analysis1, analysis2)
)
all2 <- drake_plan(
parameter = parms(10, 1.5),
result = meta_make(simAndAnalysis)
)
here one only has to edit the all2 plan, while the logic of the simulation and analysis stays untouched, therefore much less errorprone.
But the requirement is to have access to the all the targets.
No need to provide implementations, we can also use functions from cooking ;-)
With the current proposal, implementing this plan with multiple parameters is likely to involve multiple meta_make() targets constructed in a very ugly way:
drake_plan(
parameters = list(paramA = parms(...), paramB = parms(...), ...),
sim = meta_make(sim_from_parameters(parameters)),
analysis1 = meta_make(analysis1_from_sim(sim)),
analysis2 = meta_make(analysis2_from_sim(sim)),
compare = meta_make(compare_from_analyses(analysis1, analysis2))
)
sim_from_parameters <- function(parameters) {
tibble(
target = paste0("sim_", names(parameters)),
command = paste0("sim(parameters[[\"", names(parameters), "\"]])")
)
}
analysis1_from_sim <- function(sim) {
tibble(
target = paste0("analysis1_", names(sim)),
command = paste0("analysis1(sim[[\"", names(sim), "\"]])")
)
}
...
Only the DSL (#233) aims at providing a nice syntax.
We could also use meta-plans to handle knitr_in()! Exploring the dependencies becomes an intermediate step:
drake_plan(
rmarkdown::render(knitr_in("report.Rmd"), ...)
)
then becomes (through an internal transformation)
`_knitr_deps_report.Rmd` = analyze_knitr_in("report.Rmd"),
{ pack(`_knitr_deps_report.Rmd`); rmarkdown::render(file_in("report.Rmd"), ...) }
Would that simplify or complicate things?
Yes, I think this is one of the problems the DSL is meant to solve. One edge case I have been thinking about is if the user has commands to write the Rmd source.
drake_plan(
write_my_report(); file_out("report.Rmd"),
rmarkdown::render(knitr_in("report.Rmd"), file_out("report.html"))
)
If report.Rmd does not already exist before make() is called, drake cannot analyze the report for dependencies. And once the first make() creates the report and those dependencies are found, the target "report.html" is invalidated. In other words, both the first and second make()s do stuff. If the DSL's internal transformation is delayed, this should no longer be a problem.
I wonder, do we still need this nested plan structure? Could a lazy application of the transformation/grouping interface from https://github.com/ropensci/drake/issues/233#issuecomment-454777508 accomplish the dynamic branching we want?
We may come back to meta-make eventually, but I think https://github.com/ropensci/drake/issues/233#issuecomment-454777508 brings us closer to dynamic branching and job groups.
See #685.
Most helpful comment
Ah, so
make()andmeta_make()are fundamentally different anyway.Sure, that's fine. If
spread()andgather()are not the right words, I thinkpack()andunpack()are great.