This one got lost in a previous thread.
I quite liked remake's different levels of checking: "exists" (current if the target exists), "depends" (current if exists and dependencies unchanged), "code" (current if exists and code unchanged) or "all" (current if exists and both dependencies and code unchanged).
The main use case I see is while testing or initially developing a project. My current project takes around 10 minutes to determine if it's up to date, but all I'm really trying to determine as I develop is if all the targets have been built at all, since I know there haven't been any upstream changes to code or data inputs. Having an option to lighten the level of checking would help development a lot for projects that have either large data or a large number of targets.
For now, does built() cover the main use case? (See also cached() and imported()). That will tell you if the targets are built at all, and all it does is read from the cache.
I think the best way to solve this one is through #129. When we spread out the levels of the cache, we will also refactor how targets are checked. In my opinion, that is the time to expose more user-side functionality. Does that sound reasonable? Thank you for speaking up.
I can see the distinction between the "depends" and "code" checking. You may want to know why your targets are out of date. If it is because of an accidental change to a workflow plan command, you have the chance to correct it before a target is incorrectly built (and possibly mangled).
Regarding whether built solves this, I think there's still a use for exists checking when you want to quickly build remaining targets and you're sure that inputs haven't changed.
There are also cases where users can rule out certain inputs ever changing and it would be nice to be able to allow drake to skip checking those inputs.
Ah, I think I get it now. These are different build criteria for make(), not just user-side inspection tools. I am fully on board now, but it will have to wait until #129 (which I have also decided I will also fix after the next CRAN release).
On a further look, accommodating exists will be harder than I thought. Early on for each parallelizable stage, before any build decisions are made, drake retrieves or computes all the hashes of the targets and imports in that stage. For speed, it is important to get the hashes for the stage all together, and it is important that they all stay in memory. For exists, we would need to skip hashing and avoid storing the dependency hashes.
@kendonB what about an optional TRUE/FALSE "rush" argument to make()? It was thinking it could remove imports and existing targets from the igraph network and then build the rest normally. Otherwise, I think we should leave the hashing alone because targets built with make(..., rush = TRUE) should count as up to date. After #129 is solved, we can think about a user-side function that explains exactly why a given target is listed as out of date. (Is it the command? If not, which one of the dependencies?) I sorely needed something like this in the early days of debugging, and I will probably need it again to debug any work on #129.
The above would be equivalent to setting check = exists for all the targets? That would cover the main use case for this feature, yes. However, since the user can rule out certain types of changes in many cases, it would still be useful to have target-level flexibility in the level of checking.
Looking at an example remake YAML file using the check field, I am not sure about this feature. Given drake's existing internals, the most time-saving changes would probably be hacks to hash_list() and supporting functions. If we're setting the check level on a per-target basis, it would also probably belong as another column in the workflow plan data frame. But the workflow plan has targets only, which means we would be ignoring the large imported files that would benefit the most from check equal to "exists". I am not saying no, but I am saying this feature needs a clean implementation, and we should think about the implementation details together if it is going to become part of drake. I may even need to leave it to you to submit a pull request.
I don't see why you say that a target-level flag ignores large imported files? Any target that depends on the large input could be set to check = exists and the large imported file would be ignored. This seems to be the behaviour of remake. Perhaps you were interpreting check = "exists" to mean that the target still updates if dependencies change?
Regardless, I agree that this is a difficult change and should be done carefully. I won't have time to dig into this for several months but may revisit it then.
Also, I think the rush = TRUE idea is excellent and could easily be made backward compatible with a finer system in the future.
I see. It does not seem too hard to make sure any check = exists targets have no imported dependencies. And I think we are in agreement: the exists mode should treat dependencies only as a means to arrange the jobs, not a way to decide whether to build.
For the rush option: maybe a function called rush() that calls make() from within with_mock(). That way, we can substitute in versions of hash_list() and dependencies() that cut out all the imports and pre-build hashing. Maybe we could use with_mock() to help with remake-style levels of checking.
For compatibility with regular make()s later on, we might want to go back and cache the "depends" hashes, and that may not be possible without imported objects. Maybe rush() will not truly bring targets up to date as far as drake is concerned.
Hmm... maybe with_mock() is not so appropriate. But I think we could have different versions of hashes() for different check levels. It is already a successful design pattern: we have different run_*() functions, and the appropriate one is invoked with run_parallel_backend():
run_parallel_backend <- function(config){
get(
paste0("run_", config$parallelism),
envir = getNamespace("drake")
)(config)
}
I would also add a "file" level of checking to just react to any output target files outside the cache that have changed. Unlike remake, drake checks for modifications in file targets (such as 'report.md' in the basic example) and repairs them if applicable. It would be difficult to accommodate imported files this way, however.
Also, I like the term "trigger" more than "check level". It seems a bit clearer.
Not sure I agree with that. "check" pairs with "exists" while "trigger" pairs with "does not exist". But there's probably some pairing of terms which is clearer.
I agree that the terms need to be consistent. Maybe trigger or build could be paired with "missing", etc. Specific word choice aside, I find it easier to think about conditions for building than conditions for skipping. Also, drake already has a function called check() that is totally unrelated to this.
Here are the design specs so far. For now, I am going with the term "trigger" and choosing slightly different trigger names that make sense. The result will not look exactly like remake, but with a "trigger" column next to the "command" column in the workflow plan data frame, I think the meaning will be clear.
sanitize_plan() to check that all triggers are valid or the "trigger" column is missing.rush argument to make() will overwrite the trigger column with a character vector with all entries "missing".any trigger will apply everywhere in that case (see below).missing:file:depends: At least one dependency target/import changed (dependency hash), or the target itself is missing.command: workflow plan command changed or the target itself is missing.any: any of the above triggersigraphdepends or any.build_graph() after the following:command_deps <- lightly_parallelize(
plan$command, command_dependencies, jobs = jobs)
names(command_deps) <- plan$target
meta() function (previously hashes()) should depend on the trigger type. Skip some metadata/hashes for some trigger types, especially missing. Otherwise, time is wasted checking hashes, etc.build_in_hook() (after the decision to build has actually been made).target_current() to should_build_target().should_build_target() will walk though the target metadata (from meta()) and evaluate the triggers for which metadata is available.build_in_hook().dbug() example.con <- make() with the missing trigger. Throughout, use the testing scenario from get_testing_scenario().outdated(config = con).intermediatefile.rds target and test that all triggers detect the move.intermediatefile.rds and check that everything is up to date with all triggers.intermediatefile.rds. The depends, command, and missing triggers should ignore it. Only the file and any triggers should show it out of date.make(..., rush = TRUE) and verify that nothing was built.make(..., rush = TRUE). Verify that intermediatefile.rds was built.intermediatefile.rds. Verify that only the command and all triggers notice. Restore the original command.intermediatefile.rds with clean(). Verify that only the depends and any triggers notice.clean() everything and make(rush = TRUE). Verify that no imports were cached.This all sounds superb. I don't suppose it's also possible to have a "code" trigger which would check all the upstream code dependencies of the command? Or is that your vision of the command option?
@kendonB unless I misunderstand, I think you are describing the depends trigger. Or are you looking for a trigger that only bothers to check upstream dependencies when the command has changed? Such a trigger would require both the command and depends triggers to fire, and it would only look one level back in the DAG.
I am open to new triggers, but I would like to keep the existing command and depends triggers orthogonal.
Implementation went much faster than I expected (see the issue131 branch). Remaining in this issue:
As a side note, drake is accumulating a large number of vignettes, which makes me concerned about eventually reaching the 5MB limit for packages. I would like to publish a website, but my company's policies would almost certainly not allow me to be the one do it. (It was a long, hard fight just to change the rules so we could do ordinary open source.)
Just merged to master, but still needs a vignette at the very least.
By the way: if you pick triggers, drake will remind you what you picked in the console messages:
make(my_plan)
...
target small: trigger "any"
target large: trigger "command"
...
Wait: @kendonB, did you originally suggest a trigger that watches the commands of upstream dependencies? I am less sure about that one. Useful, yes, but it would require searching the entire incoming subgraph of targets for each target triggered this way, and that would be extremely slow to execute. Better to rely on drake's existing way to check dependencies so you only have to search adjacent incoming nodes.
See #137: will subsume triggers into a new vignette on debugging.
We can contemplate new triggers, but the heart of this issue is solved. Closing for now. The remaining documentation will be covered in #137. In the meantime, see ?triggers.
FYI: I changed the logical rush argument to a trigger argument where you can specify a global trigger (overridden by the trigger column of the workflow plan if present). It's all documented in a section of the new "debug" vignette.
@kendonB, I am really glad you pushed me to make this happen. There is now a lower barrier of entry for developing and testing reproducible workflows.
It just occurred to me: with the inception of make(..., trigger = "missing"), maybe we can advertise drake as a powerful job scheduler without always getting into the reproducibility.
@wlandau-lilly, drake still seems to be engaging in lots of unnecessary checking with trigger = "missing". I see no speed benefit from using this option:
library(drake)
load_basic_example()
my_graph <- build_graph(my_plan)
#> connect 4 imports: my_plan, reg2, reg1, simulate
#> connect 15 targets: 'report.md', small, large, regression1_small, regression1...
make(my_plan, targets = my_plan$target[1], trigger = "missing", graph = my_graph)
#> cache C:/Users/kendonb/AppData/Local/Temp/RtmpAl3Oce/.drake
#> check 7 items: 'report.Rmd', knit, coefficients, lm, data.frame, rpois, stats...
#> import 'report.Rmd'
#> import knit
#> import coefficients
#> import lm
#> import data.frame
#> import rpois
#> import stats::rnorm
#> check 2 items: reg2, simulate
#> import reg2
#> import simulate
#> check 2 items: small, large
#> check 1 item: regression2_small
#> check 1 item: coef_regression2_small
#> check 1 item: 'report.md'
#> Used non-default triggers. Some targets may be not be up to date.
system.time(make(my_plan, targets = my_plan$target[1], trigger = "missing", graph = my_graph))
#> cache C:/Users/kendonb/AppData/Local/Temp/RtmpAl3Oce/.drake
#> check 7 items: 'report.Rmd', knit, coefficients, lm, data.frame, rpois, stats...
#> import 'report.Rmd'
#> import knit
#> import coefficients
#> import lm
#> import data.frame
#> import rpois
#> import stats::rnorm
#> check 2 items: reg2, simulate
#> import reg2
#> import simulate
#> check 2 items: small, large
#> check 1 item: regression2_small
#> check 1 item: coef_regression2_small
#> check 1 item: 'report.md'
#> Used non-default triggers. Some targets may be not be up to date.
#> user system elapsed
#> 1.05 0.31 1.81
system.time(make(my_plan, targets = my_plan$target[1], graph = my_graph))
#> cache C:/Users/kendonb/AppData/Local/Temp/RtmpAl3Oce/.drake
#> check 7 items: 'report.Rmd', knit, coefficients, lm, data.frame, rpois, stats...
#> import 'report.Rmd'
#> import knit
#> import coefficients
#> import lm
#> import data.frame
#> import rpois
#> import stats::rnorm
#> check 2 items: reg2, simulate
#> import reg2
#> import simulate
#> check 2 items: small, large
#> check 1 item: regression2_small
#> check 1 item: coef_regression2_small
#> check 1 item: 'report.md'
#> All targets are already up to date.
#> user system elapsed
#> 0.89 0.28 1.59
For the missing trigger, drake really is only checking to see that the target exists (with hashing postponed until the target actually needs to be built). This in itself does not boost speed unless you have large corrupted file targets or a large number of dependencies. What really boosts speed is the ability to not refresh targets that already exist.
A dependency hash is really a hash of hashes, which in turn are provided by the storr cache for free. And file modification times allow drake to skip rehashing import files a lot of the time. So in ordinary cases, dependency hashing/checking is not the bottleneck.
I see. Is there a way you can think of to get drake to truly "rush" for the case of building a small number of missing targets with a precalculated graph?
It's always a tricky tradeoff between reproducibility and speed. Drake was optimized more for reproducibility, so clean modifications in the other direction are hard.
In an upcoming wave of commits, the following should not sacrifice reproducibility if you already have all the imports cached.
make(..., skip_imports = TRUE, trigger = "missing", graph = YOUR_GRAPH)
Previously, skip_imports = TRUE put any built targets out of date because I was pruning the graph early. But in the code I am about to commit, I prune config$graph_remaining_targets at the last minute, which does not affect the graph used for dependency hashes.
Stay tuned for when #153 is merged. Then, make(..., skip_imports = TRUE, trigger = "missing", graph = YOUR_GRAPH) will be very close to just rushing, and targets will be up to date under the right conditions.
FYI: I just started a project with 1463 targets, and I really like having triggers!
Make that 8761 targets...