Drake: Consider different levels of checks

Created on 2 Nov 2017 · 35Comments · Source: ropensci/drake

This one got lost in a previous thread.

I quite liked remake's different levels of checking: "exists" (current if the target exists), "depends" (current if exists and dependencies unchanged), "code" (current if exists and code unchanged) or "all" (current if exists and both dependencies and code unchanged).

The main use case I see is while testing or initially developing a project. My current project takes around 10 minutes to determine if it's up to date, but all I'm really trying to determine as I develop is if all the targets have been built at all, since I know there haven't been any upstream changes to code or data inputs. Having an option to lighten the level of checking would help development a lot for projects that have either large data or a large number of targets.

advanced new feature

Source

kendonB

👍1

All 35 comments

For now, does built() cover the main use case? (See also cached() and imported()). That will tell you if the targets are built at all, and all it does is read from the cache.

I think the best way to solve this one is through #129. When we spread out the levels of the cache, we will also refactor how targets are checked. In my opinion, that is the time to expose more user-side functionality. Does that sound reasonable? Thank you for speaking up.

wlandau-lilly on 2 Nov 2017

I can see the distinction between the "depends" and "code" checking. You may want to know why your targets are out of date. If it is because of an accidental change to a workflow plan command, you have the chance to correct it before a target is incorrectly built (and possibly mangled).

wlandau-lilly on 2 Nov 2017

Regarding whether built solves this, I think there's still a use for exists checking when you want to quickly build remaining targets and you're sure that inputs haven't changed.

There are also cases where users can rule out certain inputs ever changing and it would be nice to be able to allow drake to skip checking those inputs.

kendonB on 2 Nov 2017

Ah, I think I get it now. These are different build criteria for make(), not just user-side inspection tools. I am fully on board now, but it will have to wait until #129 (which I have also decided I will also fix after the next CRAN release).

wlandau-lilly on 2 Nov 2017

On a further look, accommodating exists will be harder than I thought. Early on for each parallelizable stage, before any build decisions are made, drake retrieves or computes all the hashes of the targets and imports in that stage. For speed, it is important to get the hashes for the stage all together, and it is important that they all stay in memory. For exists, we would need to skip hashing and avoid storing the dependency hashes.

wlandau-lilly on 2 Nov 2017

@kendonB what about an optional TRUE/FALSE "rush" argument to make()? It was thinking it could remove imports and existing targets from the igraph network and then build the rest normally. Otherwise, I think we should leave the hashing alone because targets built with make(..., rush = TRUE) should count as up to date. After #129 is solved, we can think about a user-side function that explains exactly why a given target is listed as out of date. (Is it the command? If not, which one of the dependencies?) I sorely needed something like this in the early days of debugging, and I will probably need it again to debug any work on #129.

wlandau-lilly on 3 Nov 2017

The above would be equivalent to setting check = exists for all the targets? That would cover the main use case for this feature, yes. However, since the user can rule out certain types of changes in many cases, it would still be useful to have target-level flexibility in the level of checking.

kendonB on 3 Nov 2017

Looking at an example remake YAML file using the check field, I am not sure about this feature. Given drake's existing internals, the most time-saving changes would probably be hacks to hash_list() and supporting functions. If we're setting the check level on a per-target basis, it would also probably belong as another column in the workflow plan data frame. But the workflow plan has targets only, which means we would be ignoring the large imported files that would benefit the most from check equal to "exists". I am not saying no, but I am saying this feature needs a clean implementation, and we should think about the implementation details together if it is going to become part of drake. I may even need to leave it to you to submit a pull request.

wlandau-lilly on 3 Nov 2017

I don't see why you say that a target-level flag ignores large imported files? Any target that depends on the large input could be set to check = exists and the large imported file would be ignored. This seems to be the behaviour of remake. Perhaps you were interpreting check = "exists" to mean that the target still updates if dependencies change?

Regardless, I agree that this is a difficult change and should be done carefully. I won't have time to dig into this for several months but may revisit it then.

kendonB on 3 Nov 2017

Also, I think the rush = TRUE idea is excellent and could easily be made backward compatible with a finer system in the future.

kendonB on 3 Nov 2017

👍1

I see. It does not seem too hard to make sure any check = exists targets have no imported dependencies. And I think we are in agreement: the exists mode should treat dependencies only as a means to arrange the jobs, not a way to decide whether to build.

For the rush option: maybe a function called rush() that calls make() from within with_mock(). That way, we can substitute in versions of hash_list() and dependencies() that cut out all the imports and pre-build hashing. Maybe we could use with_mock() to help with remake-style levels of checking.

For compatibility with regular make()s later on, we might want to go back and cache the "depends" hashes, and that may not be possible without imported objects. Maybe rush() will not truly bring targets up to date as far as drake is concerned.

wlandau-lilly on 3 Nov 2017

Hmm... maybe with_mock() is not so appropriate. But I think we could have different versions of hashes() for different check levels. It is already a successful design pattern: we have different run_*() functions, and the appropriate one is invoked with run_parallel_backend():

run_parallel_backend <- function(config){
  get(
    paste0("run_", config$parallelism),
    envir = getNamespace("drake")
  )(config)
}

wlandau-lilly on 4 Nov 2017

I would also add a "file" level of checking to just react to any output target files outside the cache that have changed. Unlike remake, drake checks for modifications in file targets (such as 'report.md' in the basic example) and repairs them if applicable. It would be difficult to accommodate imported files this way, however.

wlandau-lilly on 4 Nov 2017

Also, I like the term "trigger" more than "check level". It seems a bit clearer.

wlandau-lilly on 5 Nov 2017

Not sure I agree with that. "check" pairs with "exists" while "trigger" pairs with "does not exist". But there's probably some pairing of terms which is clearer.

kendonB on 5 Nov 2017

I agree that the terms need to be consistent. Maybe trigger or build could be paired with "missing", etc. Specific word choice aside, I find it easier to think about conditions for building than conditions for skipping. Also, drake already has a function called check() that is totally unrelated to this.

wlandau-lilly on 5 Nov 2017

Here are the design specs so far. For now, I am going with the term "trigger" and choosing slightly different trigger names that make sense. The result will not look exactly like remake, but with a "trigger" column next to the "command" column in the workflow plan data frame, I think the meaning will be clear.

User input

A "trigger" column in the workflow plan data frame with the condition that triggers each command.
Use sanitize_plan() to check that all triggers are valid or the "trigger" column is missing.
The new rush argument to make() will overwrite the trigger column with a character vector with all entries "missing".
Everything should work if the trigger column is absent. The any trigger will apply everywhere in that case (see below).

Trigger types

missing:
- The target is missing from the "kernels" namespace (config$inventory$kernels), or
- the target is a file target and the external output file is missing.
file:
- The target is a file target and the actual file is missing or has changed.
depends: At least one dependency target/import changed (dependency hash), or the target itself is missing.
command: workflow plan command changed or the target itself is missing.
any: any of the above triggers

Building the `igraph`

For each target, strip away the imports from the command dependencies unless the trigger is depends or any.
Can be done post-hoc in build_graph() after the following:

command_deps <- lightly_parallelize(
    plan$command, command_dependencies, jobs = jobs)
  names(command_deps) <- plan$target

Metadata

The meta() function (previously hashes()) should depend on the trigger type. Skip some metadata/hashes for some trigger types, especially missing. Otherwise, time is wasted checking hashes, etc.
Any missing hashes will be filled at the beginning of build_in_hook() (after the decision to build has actually been made).

Checking the triggers and deciding to build

Change target_current() to should_build_target().
should_build_target() will walk though the target metadata (from meta()) and evaluate the triggers for which metadata is available.
Decide to build if at least one of the triggers activates.

During builds

Once the build decision has been made, any skipped hashes must be restored at the beginning of build_in_hook().

Testing

Use the dbug() example.
Call con <- make() with the missing trigger. Throughout, use the testing scenario from get_testing_scenario().
All triggers should show everything up to date. Use outdated(config = con).
Move the intermediatefile.rds target and test that all triggers detect the move.
Restore intermediatefile.rds and check that everything is up to date with all triggers.
Corrupt intermediatefile.rds. The depends, command, and missing triggers should ignore it. Only the file and any triggers should show it out of date.
Run make(..., rush = TRUE) and verify that nothing was built.
Delete the file and run make(..., rush = TRUE). Verify that intermediatefile.rds was built.
Change the command of intermediatefile.rds. Verify that only the command and all triggers notice. Restore the original command.
Remove a dependency of intermediatefile.rds with clean(). Verify that only the depends and any triggers notice.
clean() everything and make(rush = TRUE). Verify that no imports were cached.

wlandau-lilly on 5 Nov 2017

This all sounds superb. I don't suppose it's also possible to have a "code" trigger which would check all the upstream code dependencies of the command? Or is that your vision of the command option?

kendonB on 5 Nov 2017

👍1

@kendonB unless I misunderstand, I think you are describing the depends trigger. Or are you looking for a trigger that only bothers to check upstream dependencies when the command has changed? Such a trigger would require both the command and depends triggers to fire, and it would only look one level back in the DAG.

I am open to new triggers, but I would like to keep the existing command and depends triggers orthogonal.