Drake: Meta-make

Created on 7 Mar 2018 · 32Comments · Source: ropensci/drake

I'd like to propose three very simple verbs that might get us halfway towards the goal of a DSL (#233). This proposal addresses the low-level technical part, which I think is required for any DSL.

The proposal is too long for an issue (and issues aren't version-controlled), I've added it to the wiki: https://github.com/ropensci/drake/wiki/Meta-make. Perhaps a 5 to 10 minute read.

I haven't closely followed previous related discussions, there might be overlap I'm not aware of. In particular, #79 might be a similar approach. Sharing this early for feedback and discussion, CC @AlexAxthelm @dapperjapper @rkrug @kendonB.

advanced api

Source

krlmlr

👍3

Most helpful comment

meta_make() is declarative only, it returns a data structure that contains the new plan but doesn't execute anything. Therefore, nesting shouldn't be a problem. There should be only one scheduler, even when some plans are constructed dynamically as part of the execution of the main plan.

Ah, so make() and meta_make() are fundamentally different anyway.

The way spread() and gather() are defined in tidyr seems to be fairly different from the intended semantics of pack() and unpack(). I'm entirely open to other names.

Sure, that's fine. If spread() and gather() are not the right words, I think pack() and unpack() are great.

wlandau on 8 Mar 2018

👍2

All 32 comments

I like the idea of having a wiki page to discuss possible verbs.

I am not to clear about your verbs - could you please add the tibble / plan you would expect at the end?

rkrug on 7 Mar 2018

Thanks for your feedback. I updated the wiki and pinged you in the commit message, would be interested to learn if you received a notification. Please let me know if things are clearer now.

krlmlr on 7 Mar 2018

Thanks. Makes things much easier.

The pack() and unpack() sound extremely useful.

Looking at meta_plan(), it would address one question I raised at the workshop: the iterative plans. But now there might be one important issue coming in: the uniqueness of target names. Unique to cache, directory, or plan?

meta_plan() would be extremely powerful, but also very error prone due to the uniqueness of the target names. Don't get me wrong - I would love a feature like that, but there needs to be some mechanism to deal with the same target names in the meta_plan - maybe adding a suffix to the target names of the meta-plans?

rkrug on 7 Mar 2018

In the classical make, the target name has to be unique to the makefile - and I think drake should really aim at achieving the same.

rkrug on 7 Mar 2018

Done adding reprexes. Switching to rmarkdown would have been smarter ;-)

I'd suggest that drake just refuses to run plans with duplicate names, just the way it works currently. That feels like the safest option.

Iterative plans will be covered by the DSL. This is just about low-level tooling to make the DSL possible.

krlmlr on 7 Mar 2018

@krlmlr OK - so meta_plan() does not know anything about the main plan - makes perfect sense.

But than the creation of the plan should include the check, not the make(). If make() gets a plan with duplicate target names, it should refuse to run it - that is obvious. B

But the error message should come during creation of the plan in the drake_plan() step. It is a helper function anyway, to populate a simple tibble / data.frame.

rkrug on 7 Mar 2018

Yes, meta_plan() will usually receive a tracked target, or something computed from tracked targets, as argument.

krlmlr on 7 Mar 2018

Thanks so much, Kirill! I read your proposal, but it will take me more time to come up with feedback.

wlandau-lilly on 7 Mar 2018

👍1

I'm really liking the modularity of the design here, and I want to see if I'm understanding correctly. If I have a problem along the lines of "take a big data set, split by an unknown set of factors, and run a arbitrart code on each" (something for which I would use dplyr::do), woudl that be something along the lines of:

drake_plan(
  data = generic_data_importing_function(file_in("data.csv")),
  factor_levels = unique(pull(data, "factor_column") ),
  meta_plan = drake_plan(
    this_factor = unpack(factor_levels),
    data_subset = filter(data, factor_column == this_factor),
    analysis = analysis_function(data_subset)
  ),
  results = meta_make(meta_plan)
)

or would this require some sort of 4th verb (along the lines of a drake_foreach)?

AlexAxthelm on 7 Mar 2018

👍1

Thanks. I don't intend to change the semantics of drake_plan() for this proposal. Your analysis might work with the following plan:

drake_plan(
  data = generic_data_importing_function(file_in("data.csv")),
  splits = data %>% nest(-factor_column) %>% deframe(),
  meta_plan = tibble(
    target = paste0("analyzed_", names(splits)),
    command = paste0("analysis_function(splits[[\"", names(splits), "\"]])")
  ),
  results = meta_make(meta_plan)
)

This will create targets of the form analyzed_xxx in the meta_plan, where xxx come from the factor levels. If you'd like to test the plan on a real example (iris?), you can double-check the embedded meta_plan. Does this make sense?

The targets in meta_plan will only be available as components of results, they are only brought into the main plan with unpack(), if necessary.

The command to create meta_plan is utterly awkward, that's where a DSL might help.

krlmlr on 7 Mar 2018

👍1

Should make() just be "meta" natively? I have played around with

plan <- drake::drake_plan(
  x = 1,
  y = 2,
  meta_plan = drake_plan(
    a = x,
    b = y
  ),
  results = drake::make(meta_plan)
)
make(plan)

It fails because target a could not find its dependency x. Solving this problem might be simpler and cleaner than implementing new meta_make(), especially since meta_make() may need to deal with self-reference anyway.

plan1 <- drake_plan(
  x1 = 1,
  results1 = make(plan1),
  plan1 = drake_plan(
    x2 = x1,
    results2 = make(plan2),
    plan2 = drake_plan(
      x3 = x2,
      results3 = make(plan3),
      plan3 = drake_plan(
        ...
      )
    )
  )
)

In any case, I do like where this is headed. The "meta" piece will make the hardest parts of the DSL possible.

wlandau-lilly on 8 Mar 2018

I think pack() is similar to gather_plan(). pack() and unpack() are great choices of words, but I also wonder if the standard gather() and spread() would be more recognizable.

wlandau-lilly on 8 Mar 2018

Thanks. I'd rather give an informative error message if make() is called from within a plan. We can see later if we can let make() converge towards meta_make(), or the other way round.

meta_make() is declarative only, it returns a data structure that contains the new plan but doesn't execute anything. Therefore, nesting shouldn't be a problem. There should be only one scheduler, even when some plans are constructed dynamically as part of the execution of the main plan.

The way spread() and gather() are defined in tidyr seems to be fairly different from the intended semantics of pack() and unpack(). I'm entirely open to other names.

krlmlr on 8 Mar 2018

meta_make() is declarative only, it returns a data structure that contains the new plan but doesn't execute anything. Therefore, nesting shouldn't be a problem. There should be only one scheduler, even when some plans are constructed dynamically as part of the execution of the main plan.

Ah, so make() and meta_make() are fundamentally different anyway.

The way spread() and gather() are defined in tidyr seems to be fairly different from the intended semantics of pack() and unpack(). I'm entirely open to other names.

Sure, that's fine. If spread() and gather() are not the right words, I think pack() and unpack() are great.

wlandau on 8 Mar 2018

👍2

I think it would be nice to unify the making interface under one function make, but I agree with krlmlr that they serve different purposes for now...

I'm confused about the return value of meta_make(). What does the data structure look like? When do the meta-targets (?) get made? And how might one loadd a meta-target?

Maybe I need some clarity around what declarative means and how this affects working with drake targets both within plans and outside of them.

Going off of your analogy in the wiki, meta_make targets return a list of made targets when loadded. Thus you would access meta-targets like drake::readd(results)$a. However, with the targets I use, it would use wayyy too much memory to load every target within a meta_make just to access one of them.

dapperjapper on 8 Mar 2018

@dapperjapper: Thanks. I've added an implementation sketch to the wiki (https://github.com/ropensci/drake/wiki/Meta-make#implementation-sketch), and also tweaked a bit -- see differences. Does this help answer your questions?

krlmlr on 8 Mar 2018

👍1

If we follow the implementation sketch, it feels easiest to start with pack(), because meta_make() will apply very similar ideas (or maybe even call a version of pack() under the hood), and unpack() alone doesn't feel that useful without meta_make().

krlmlr on 8 Mar 2018

👍1

@krlmlr I am confused a little bit now - if meta_make() does not execute anything and returns a plan, what is than the difference to drake_plan()? If there is a difference (I assume the execution in an isolated environment), why not simply add an argument subplan = TRUE to drake_plan() which is FALSE by default and can be set when used in this context here?

rkrug on 9 Mar 2018

@rkrug: Great question. I think we want to give users a way to create "sub-plans" from home-grown tibbles created outside of drake_plan(), as in

meta_make(tibble(
  target = ...
  command = ...
))

krlmlr on 9 Mar 2018

So meta_make() would take a tibble (which might be homegrown or created by drake_plan() and turns it into a sub-plan - correct? But in then why is it called _make, as it is doing the same as drake_plan() - wouldn't drake_subplan() than be the better word?

Actually - do we really need this? Isnt't it possible to simply use drake_plan() inside drake_plan() which than automatically creates a sub-plan? I mean a n argument subplan could be set automatically when the function is called iteratively?

rkrug on 9 Mar 2018

I want to implement this in a transparent way, using as little magic as possible. I don't see how meta_make() does the same as drake_plan() does, it takes a drake plan as input.?

I'm open to other names for the meta_make() operation, but it should match the effect, which is that it effectively executes commands defined dynamically when used inside make(). (At least this is the perceived effect; the real effect is that it creates a proxy object that is interpreted differently from normal targets by make(). It feels that we need to start writing documentation or tutorials before we can really come up with a good way to teach/communicate this idea, and the naming choice could be based on that.)

krlmlr on 9 Mar 2018

I don't see how meta_make() does the same as drake_plan() does, it takes a drake plan as input?

True - but the output is identical to a plan, only that it has an attribute sub-plan?

....[quite a bit of thinking in the background ....]

Ahhhh - I think I see what you are aiming at. This would make the assembling of complex plans much easier, as they can be split into sub-plans and assembled at a later stage. OK - that is a different thing (and you are right - we have to talk about the naming at the later stage).

Using these sub-plans is very powerful, as you say, but one question needs to be addressed, which is the naming of the targets.

target names have to be unique within each plan, within each sub-plan (which is obvious), but I think a sub-plan in a plan should be able to have the same names.

plan <- drake::drake_plan(
  x = 1,
  y = 2,
  meta_plan = drake_plan(
    x = x * 2,
    y = y / x,
    z = p(x)
  ),
  results = meta_make(meta_plan)
)

should be possible. The targets would be x, y, results$x, results$y, result$z.
In addition, the x in the subplan should refer to the x in the sub-plan.
While p(x) would refer to the target x in the parent plan.

rkrug on 9 Mar 2018

Thanks. It seems easiest if we impose the restriction that the meta-plan can only access outside targets. We need to think more about scoping issues, and maybe relax that restriction later.

I agree that we can have identical names in different sub-plans, but a subsequent unpack() should throw an error.

krlmlr on 9 Mar 2018

It seems easiest if we impose the restriction that the meta-plan can only access outside targets.

I think that would limit the usefulness quite a bit and lead to the combination of many targets into a single one in the meta-plan. To have be able to have inside the meta-plan a tree structure of dependent targets (inside as well as parent plan) would be, as I see it, the main advantage of a meta-plan. One could encapsulate e.g. in one meta-plan the simulation, cleaning and different analysis of the results, while the plan contains the generation of the parameter, which are used by the meta-plan.
So changing the parameter, would only require changing the (relatively simple) plan, and not the (possibly complex) meta-plan.

rkrug on 9 Mar 2018

I hear you, but I'd prefer to keep complexity as low as possible at this point -- even the simplest solution will take some time and careful thought to implement.

Could you please share a concrete example, maybe a plan that would work for one particular parameter value? We can use this to discuss how this can be implemented based on the current proposal.

krlmlr on 9 Mar 2018

OK - here comes one (I hope this is what you are looking for - if yes, I could throw in the missing function definitions):

Assuming this one function to create a parameter set:

parms <- function(n0,  r) {
  return(
    list(
      n0 = n0,
      r = r,
      K = 100
    )
  )
}

to simulate one has to do:

all1 <- drake_plan(
  parameter = parms(10, 1.5),
  sim = sim(parameter),
  analysis1 = analysis1(sim),
  analysis2 = analysis2(sim),
  compare = comp(analysis1, analysis2)    
)

So if one want's to chage the parameter, one has to edit a longish plan.

In contrast:

simAndAnalysis <- drake_plan(
  sim = sim(parameter),
  analysis1 = analysis1(sim),
  analysis2 = analysis2(sim),
  compare = comp(analysis1, analysis2)    
)

all2 <- drake_plan(
  parameter = parms(10, 1.5),
  result = meta_make(simAndAnalysis)
)

here one only has to edit the all2 plan, while the logic of the simulation and analysis stays untouched, therefore much less errorprone.

But the requirement is to have access to the all the targets.

rkrug on 9 Mar 2018

No need to provide implementations, we can also use functions from cooking ;-)

With the current proposal, implementing this plan with multiple parameters is likely to involve multiple meta_make() targets constructed in a very ugly way:

drake_plan(
  parameters = list(paramA = parms(...), paramB = parms(...), ...),
  sim = meta_make(sim_from_parameters(parameters)),
  analysis1 = meta_make(analysis1_from_sim(sim)),
  analysis2 = meta_make(analysis2_from_sim(sim)),
  compare = meta_make(compare_from_analyses(analysis1, analysis2))
)

sim_from_parameters <- function(parameters) {
  tibble(
    target = paste0("sim_", names(parameters)),
    command = paste0("sim(parameters[[\"", names(parameters), "\"]])")
  )
}

analysis1_from_sim <- function(sim) {
  tibble(
    target = paste0("analysis1_", names(sim)),
    command = paste0("analysis1(sim[[\"", names(sim), "\"]])")
  )
}

...

Only the DSL (#233) aims at providing a nice syntax.

krlmlr on 9 Mar 2018

👍1

We could also use meta-plans to handle knitr_in()! Exploring the dependencies becomes an intermediate step:

drake_plan(
  rmarkdown::render(knitr_in("report.Rmd"), ...)
)

then becomes (through an internal transformation)

  `_knitr_deps_report.Rmd` = analyze_knitr_in("report.Rmd"),
  { pack(`_knitr_deps_report.Rmd`); rmarkdown::render(file_in("report.Rmd"), ...) }

Would that simplify or complicate things?

krlmlr on 22 Mar 2018

Yes, I think this is one of the problems the DSL is meant to solve. One edge case I have been thinking about is if the user has commands to write the Rmd source.

drake_plan(
  write_my_report(); file_out("report.Rmd"),
  rmarkdown::render(knitr_in("report.Rmd"), file_out("report.html"))
)

If report.Rmd does not already exist before make() is called, drake cannot analyze the report for dependencies. And once the first make() creates the report and those dependencies are found, the target "report.html" is invalidated. In other words, both the first and second make()s do stuff. If the DSL's internal transformation is delayed, this should no longer be a problem.

wlandau on 22 Mar 2018

👍1

I wonder, do we still need this nested plan structure? Could a lazy application of the transformation/grouping interface from https://github.com/ropensci/drake/issues/233#issuecomment-454777508 accomplish the dynamic branching we want?

wlandau on 17 Jan 2019

We may come back to meta-make eventually, but I think https://github.com/ropensci/drake/issues/233#issuecomment-454777508 brings us closer to dynamic branching and job groups.

wlandau on 18 Jan 2019

See #685.

wlandau on 18 Jan 2019

Was this page helpful?

0 / 5 - 0 ratings

Related issues

Manual scheduling

krlmlr · 32Comments

Special format for files?

wlandau · 27Comments

Remove all non-clustermq parallel backends?

wlandau · 29Comments

Dynamic branching

wlandau · 61Comments

DSL based on dplyr-like verbs?

krlmlr · 44Comments