Drake: A new interface for triggers

Created on 20 Jul 2018 · 14Comments · Source: ropensci/drake

The problem

The current trigger interface is limiting. The choices "any", "always", "depends", "command", and "file" miss important use cases. For example, what if we want to check some remote data before deciding whether to build a target? For most projects, we would need to create a special upstream target with the fingerprint of the data and then have the real target depend on that fingerprint.

Proposal

It would be better to let users write arbitrary R code for each trigger.

drake_plan(
  target1 = {
    get_data(file_in("data.csv"))
    trigger(
      command = FALSE,
      logical = today() == "Tuesday"
    )
  }
  target2 = trigger(custom = remote_hash()),
  target3 = {
    dplyr::bind_rows(target1, tail(target2))
    trigger(
      file = FALSE,
      condition = is_small_enough(target1)
    )
  },
  target4 = {
    summary(target3)
    trigger(
      depends = TRUE,
      change = head(target3, 1)
    )
  }
)

Proposed trigger() function:

#' @title Set the trigger of a target.
#' @description Use this function inside a target's command
#'   in your [drake_plan()]. The target will rebuild if and only if:
#'   - Any of `command`, `depends`, or `file` is `TRUE`, or
#'   - `condition` evaluates to `TRUE`, or
#'   - `change` evaluates to a value different from last time.
#' @export
#' @seealso [drake_plan()], [make()]
#' @return a list of trigger specification details that
#'   `drake` processes internally when it comes time to decide
#'   whether to build the target.
#' @param command logical, whether to rebuild the target if the
#'   [drake_plan()] command changes.
#' @param depends logical, whether to rebuild if a
#'   non-file dependency changes.
#' @param file logical, whether to rebuild the target
#'   if a [file_in()]/[file_out()]/[knitr_in()] file changes.
#' @param condition R code (expression or language object)
#'   that returns a logical. The target will rebuild
#'   if the code evaluates to `TRUE`.
#' @param change R code (expression or language object)
#'  that returns any value. The target will rebuild
#'   if that value is different from last time
#'   or not already cached.
trigger <- function(
  command = TRUE,
  depends = TRUE,
  file = TRUE,
  condition = NULL,
  change = NULL
){
  stopifnot(is.logical(command))
  stopifnot(is.logical(depends))
  stopifnot(is.logical(file))
  list(
    command = command,
    depends = depends,
    file = file,
    condition = rlang::enexpr(condition),
    change = rlang::enexpr(change)
  )
}

People so far

@noamross brought up a motivating scenario in #252. @AlexAxthelm also has a good idea of what is needed because of his drake-powered work with databases. And of course, @krlmlr had the vision behind some of the best parts of the interface, including #232.

Please let me know if I am forgetting anyone. If possible, it would be nice to have input from affected power users.

advanced priority api new feature

Source

wlandau

👍1

Most helpful comment

I'm not too sure about the syntax. I remember a discussion about a drake_command() function that returns a one-row tibble.

drake_plan(
  target1 = drake_command(
    get_data(file_in("data.csv")),
    trigger(
      command = FALSE,
      logical = today() == "Tuesday"
    )
  ),
  ...
)

krlmlr on 21 Jul 2018

👍2

All 14 comments

Also cc @kendonB.

wlandau on 21 Jul 2018

👍1

I'm not too sure about the syntax. I remember a discussion about a drake_command() function that returns a one-row tibble.

drake_plan(
  target1 = drake_command(
    get_data(file_in("data.csv")),
    trigger(
      command = FALSE,
      logical = today() == "Tuesday"
    )
  ),
  ...
)

krlmlr on 21 Jul 2018

👍2

I like the idea and think @krlmlr's syntax is the tidiest. My only concern would be that introducing a change like this might make large plans very slow to start making.

kendonB on 21 Jul 2018

I like Kirill's approach too, mainly because it will be SO much easier to implement. I won't need to complicate the code analysis or sensitive internal bookkeeping. Also, there is a clear separation between the command and the trigger, which is great for users.

The target() function is the one that returns the 1-row tibble, and it works inside and outside of drake_plan().

library(drake)
pkgconfig::set_config("drake::strings_in_dots" = "literals")
drake_plan(
  x = target(
    command = 1 + 1,
    trigger = "always"
  )
)
#> # A tibble: 1 x 3
#>   target command trigger
#> * <chr>  <chr>   <chr>  
#> 1 x      1 + 1   always
target(
  command = 1 + 1,
  trigger = "always"
)
#> # A tibble: 1 x 2
#>   command trigger
#>   <chr>   <chr>  
#> 1 1 + 1   always

Question: should we detect dependencies from the trigger column of the drake_plan()? On the one hand, like @AlexAxthelm said in our conversation on Thursday, we may need some targets to be up to date before a custom trigger can be evaluated properly. On the other hand, if triggers can change the dependency structure of the workflow, targets could fall out of date from changes to the condition or change argument to trigger(). At this point, I would prefer to

Not add any more dependency detection, and
In the manual, caution users about using targets in trigger() and encourage to also mention them in the command so they register as dependencies.

wlandau on 21 Jul 2018

If we go Kirill's route (which I suggest we do), here are some implementation to-do's. These are done on the master branch:

[x] As implemented right now, the user needs to explicitly name all arguments of target(). Going forward, we should assume the first unnamed argument is the command unless a command is given later in the argument list.
[x] Evaluate wildcards in all custom columns, including the trigger column, in evaluate_plan().

And the rest will only exist on the i473 branch until it is ready to merge.

[x] Add and test the trigger() function.
[x] Detect dependencies from plan$trigger.
[x] In should_build_target(), evaluate the trigger() call and make the appropriate build decision.
- [x] Ensure the triggers can be resolved whether the trigger column has characters or language objects.
- [x] In triggers.R, implement functions condition_trigger() and value_trigger() to activate those respective triggers.
[x] Deprecate (with a helpful warning) the old character triggers and convert them to the new ones. For example, the "always" trigger becomes trigger(condition = TRUE).
[x] Clean up the internal remnants of the old trigger interface:
- [x] Remove functions in triggers.R that are no longer needed.
- [x] Remove checks about legal triggers.
[x] Write the following unit tests.
- [x] For each trigger, all positive reactions are correct.
- [x] For each trigger, all negative reactions are correct.
- [x] make() brings targets up to date regardless of the trigger.
- [x] Triggers can be expressions.
- [x] Triggers in the plan override the trigger argument of make().
- [x] Old triggers are deprecated and converted to the new ones.

wlandau on 21 Jul 2018

And of course, we'll want to elaborate on triggers, custom columns, and target() in the drake_plan() chapter of the manual.

wlandau on 21 Jul 2018

@kendonB, Kirill's approach will definitely be at least as fast to execute as what I had in mind, probably faster because there is less static code analysis. If we check the condition trigger first and the value trigger last, I do not think there will be a loss in performance during make().

wlandau on 21 Jul 2018

You could change the interface of target(...) to target(.command, ...) ; then the usual R argument matching kicks in.

Just thinking out loud. I'm slightly worried about the added complexity, especially in the light of brittle dependency detection for trigger conditions. I'd be very much interested in a "bare bones" version of _drake_ that is restricted to evaluating commands and scheduling jobs, given the dependency graph and perhaps a run time estimate. Maybe this can be made available as a lower-level interface inside _drake_, or as a separate package? Complex triggers could then be implemented by automatically adding intermediate helper targets as necessary. (Error and progress messages would use artificial target names, though. Pretty sure there are other complications I'm not aware of.)

krlmlr on 21 Jul 2018

A bare bones job scheduler is what I was going for with workers, but I have still not figured out a way to cleverly disentangle the job scheduling internals from everything else. I am continuing thinking about it, but I consider it very long term.

As for triggers, my original proposal would have added clutter. But with your idea, I feel there is a lot of gain and not much added complexity.

wlandau on 22 Jul 2018

Changing my mind about detecting dependencies from triggers. I think we should analyze the code of plan$trigger just like plan$command. Reasons:

The behavior more self-consistent.
It avoids surprises. Users expect drake to magically detect dependencies, and it should do so.
Yes, this could potentially make workflows more brittle, but only in a select few edge cases. There are so few times when users will really need the condition or change argument of trigger().

wlandau on 22 Jul 2018

I'll chime in that having plan$trigger be analyzed the same way as the command would also be preferable in my mind, since I will probably be writing custom trigger functions (sampling from remote databases, to check if contents have updated), and I would want drake to know if I've changed my function.

AlexAxthelm on 23 Jul 2018

I think I've got you covered, Alex. The way development is going, if you have a target x with trigger(condition = f()), drake will make sure f() is checked and processed before building x. By default, it will also rebuild x whenever f() changes because drake does not distinguish between the dependencies of the command and those of the trigger. It is a useful simplifying assumption that greatly reduces the complexity of the implementation, but it may not be exactly what you want. To decrease sensitivity, there are times when you might want trigger(condition = f(), depends = FALSE).

wlandau on 23 Jul 2018

🎉1

Implemented via #478. The tests are very thorough, and it's on the master branch now. Let's keep iterating. I do not plan to submit to CRAN again until around August 10.

wlandau on 24 Jul 2018

@AlexAxthelm, please take a look at #480. I think the dependencies of triggers should be checked and processed beforehand, but I do not think changes to trigger-only dependencies should necessarily cause commands to rerun. Does the interface still do what you want?

wlandau on 24 Jul 2018

Was this page helpful?

0 / 5 - 0 ratings