The current trigger interface is limiting. The choices "any", "always", "depends", "command", and "file" miss important use cases. For example, what if we want to check some remote data before deciding whether to build a target? For most projects, we would need to create a special upstream target with the fingerprint of the data and then have the real target depend on that fingerprint.
It would be better to let users write arbitrary R code for each trigger.
drake_plan(
target1 = {
get_data(file_in("data.csv"))
trigger(
command = FALSE,
logical = today() == "Tuesday"
)
}
target2 = trigger(custom = remote_hash()),
target3 = {
dplyr::bind_rows(target1, tail(target2))
trigger(
file = FALSE,
condition = is_small_enough(target1)
)
},
target4 = {
summary(target3)
trigger(
depends = TRUE,
change = head(target3, 1)
)
}
)
Proposed trigger() function:
#' @title Set the trigger of a target.
#' @description Use this function inside a target's command
#' in your [drake_plan()]. The target will rebuild if and only if:
#' - Any of `command`, `depends`, or `file` is `TRUE`, or
#' - `condition` evaluates to `TRUE`, or
#' - `change` evaluates to a value different from last time.
#' @export
#' @seealso [drake_plan()], [make()]
#' @return a list of trigger specification details that
#' `drake` processes internally when it comes time to decide
#' whether to build the target.
#' @param command logical, whether to rebuild the target if the
#' [drake_plan()] command changes.
#' @param depends logical, whether to rebuild if a
#' non-file dependency changes.
#' @param file logical, whether to rebuild the target
#' if a [file_in()]/[file_out()]/[knitr_in()] file changes.
#' @param condition R code (expression or language object)
#' that returns a logical. The target will rebuild
#' if the code evaluates to `TRUE`.
#' @param change R code (expression or language object)
#' that returns any value. The target will rebuild
#' if that value is different from last time
#' or not already cached.
trigger <- function(
command = TRUE,
depends = TRUE,
file = TRUE,
condition = NULL,
change = NULL
){
stopifnot(is.logical(command))
stopifnot(is.logical(depends))
stopifnot(is.logical(file))
list(
command = command,
depends = depends,
file = file,
condition = rlang::enexpr(condition),
change = rlang::enexpr(change)
)
}
@noamross brought up a motivating scenario in #252. @AlexAxthelm also has a good idea of what is needed because of his drake-powered work with databases. And of course, @krlmlr had the vision behind some of the best parts of the interface, including #232.
Please let me know if I am forgetting anyone. If possible, it would be nice to have input from affected power users.
Also cc @kendonB.
I'm not too sure about the syntax. I remember a discussion about a drake_command() function that returns a one-row tibble.
drake_plan(
target1 = drake_command(
get_data(file_in("data.csv")),
trigger(
command = FALSE,
logical = today() == "Tuesday"
)
),
...
)
I like the idea and think @krlmlr's syntax is the tidiest. My only concern would be that introducing a change like this might make large plans very slow to start making.
I like Kirill's approach too, mainly because it will be SO much easier to implement. I won't need to complicate the code analysis or sensitive internal bookkeeping. Also, there is a clear separation between the command and the trigger, which is great for users.
The target() function is the one that returns the 1-row tibble, and it works inside and outside of drake_plan().
library(drake)
pkgconfig::set_config("drake::strings_in_dots" = "literals")
drake_plan(
x = target(
command = 1 + 1,
trigger = "always"
)
)
#> # A tibble: 1 x 3
#> target command trigger
#> * <chr> <chr> <chr>
#> 1 x 1 + 1 always
target(
command = 1 + 1,
trigger = "always"
)
#> # A tibble: 1 x 2
#> command trigger
#> <chr> <chr>
#> 1 1 + 1 always
Question: should we detect dependencies from the trigger column of the drake_plan()? On the one hand, like @AlexAxthelm said in our conversation on Thursday, we may need some targets to be up to date before a custom trigger can be evaluated properly. On the other hand, if triggers can change the dependency structure of the workflow, targets could fall out of date from changes to the condition or change argument to trigger(). At this point, I would prefer to
trigger() and encourage to also mention them in the command so they register as dependencies.If we go Kirill's route (which I suggest we do), here are some implementation to-do's. These are done on the master branch:
target(). Going forward, we should assume the first unnamed argument is the command unless a command is given later in the argument list.trigger column, in evaluate_plan().And the rest will only exist on the i473 branch until it is ready to merge.
trigger() function.plan$trigger.should_build_target(), evaluate the trigger() call and make the appropriate build decision.trigger column has characters or language objects.triggers.R, implement functions condition_trigger() and value_trigger() to activate those respective triggers.trigger(condition = TRUE).triggers.R that are no longer needed.make() brings targets up to date regardless of the trigger.trigger argument of make().And of course, we'll want to elaborate on triggers, custom columns, and target() in the drake_plan() chapter of the manual.
@kendonB, Kirill's approach will definitely be at least as fast to execute as what I had in mind, probably faster because there is less static code analysis. If we check the condition trigger first and the value trigger last, I do not think there will be a loss in performance during make().
You could change the interface of target(...) to target(.command, ...) ; then the usual R argument matching kicks in.
Just thinking out loud. I'm slightly worried about the added complexity, especially in the light of brittle dependency detection for trigger conditions. I'd be very much interested in a "bare bones" version of _drake_ that is restricted to evaluating commands and scheduling jobs, given the dependency graph and perhaps a run time estimate. Maybe this can be made available as a lower-level interface inside _drake_, or as a separate package? Complex triggers could then be implemented by automatically adding intermediate helper targets as necessary. (Error and progress messages would use artificial target names, though. Pretty sure there are other complications I'm not aware of.)
A bare bones job scheduler is what I was going for with workers, but I have still not figured out a way to cleverly disentangle the job scheduling internals from everything else. I am continuing thinking about it, but I consider it very long term.
As for triggers, my original proposal would have added clutter. But with your idea, I feel there is a lot of gain and not much added complexity.
Changing my mind about detecting dependencies from triggers. I think we should analyze the code of plan$trigger just like plan$command. Reasons:
drake to magically detect dependencies, and it should do so.condition or change argument of trigger().I'll chime in that having plan$trigger be analyzed the same way as the command would also be preferable in my mind, since I will probably be writing custom trigger functions (sampling from remote databases, to check if contents have updated), and I would want drake to know if I've changed my function.
I think I've got you covered, Alex. The way development is going, if you have a target x with trigger(condition = f()), drake will make sure f() is checked and processed before building x. By default, it will also rebuild x whenever f() changes because drake does not distinguish between the dependencies of the command and those of the trigger. It is a useful simplifying assumption that greatly reduces the complexity of the implementation, but it may not be exactly what you want. To decrease sensitivity, there are times when you might want trigger(condition = f(), depends = FALSE).
Implemented via #478. The tests are very thorough, and it's on the master branch now. Let's keep iterating. I do not plan to submit to CRAN again until around August 10.
@AlexAxthelm, please take a look at #480. I think the dependencies of triggers should be checked and processed beforehand, but I do not think changes to trigger-only dependencies should necessarily cause commands to rerun. Does the interface still do what you want?
Most helpful comment
I'm not too sure about the syntax. I remember a discussion about a
drake_command()function that returns a one-row tibble.