Drake: If a dynamic target fails, can I avoid remaking those subtargets that succeeded?

Created on 9 Mar 2020 · 42Comments · Source: ropensci/drake

Prework

[X] Read and abide by drake's code of conduct.
[X] Search for duplicates among the existing issues, both open and closed.
[X] If you think your question has a quick and definite answer, consider posting to Stack Overflow under the drake-r-package tag. (If you anticipate extended follow-up and discussion, you are already in the right place!)

Question

I'm working on a plan in which a dynamic target sometimes doesn't finish (e.g., because of an error in a subtarget). This dynamic target has many subtargets, and drake often makes many of those subtargets successfully. When rerunning the plan, is it possible to avoid remaking the already made subtargets?

The example below was generated with reprex, but -- since I didn't figure out how to automatically stop the second subtarget -- getting the outputs requires manually running the code and then stopping it while it's processing the second subtarget. Please let me know if that's confusing.


library(drake)

foo <- function(num) {
  print("I'm running...")
  out <- num
  if(num > 1) {
    print("if first run, user should cancel!")
    Sys.sleep(5)
  } 
  return(out)
} 

plan <- drake_plan(
  numbers = seq_len(2),
  result = target(
    foo(numbers), 
    dynamic = map(numbers))
)

make(plan, seed = 123)
#> ▶ target numbers
#> ▶ dynamic result
#> > subtarget result_0b3474bd
#> [1] "I'm running..."
#> > subtarget result_b2a5c9b8
#> [1] "I'm running..."
#> [1] "if first run, user should cancel!"

# but now cancel (e.g., ctrl+c, stop button in Rstudio, restart computer)

# the first subtarget is in the cache
cached()
#> [1] "numbers"         "result_0b3474bd"
# and can be readd
readd(cached()[2], character_only = TRUE)
#> [1] 1 

# but it looks like it's remade when trying again
make(plan, seed = 123)
#> ▶ target numbers
#> ▶ dynamic result
#> > subtarget result_0b3474bd
#> [1] "I'm running..."
#> > subtarget result_b2a5c9b8
#> [1] "I'm running..."
#> [1] "if first run, user should cancel!"
#> ■ finalize result

^{Created on 2020-03-09 by the reprex package (v0.3.0)}

Session info

devtools::session_info()
#> ─ Session info ───────────────────────────────────────────────────────────────
#>  setting  value                       
#>  version  R version 3.6.1 (2019-07-05)
#>  os       Ubuntu 18.04.4 LTS          
#>  system   x86_64, linux-gnu           
#>  ui       X11                         
#>  language (EN)                        
#>  collate  en_US.UTF-8                 
#>  ctype    en_US.UTF-8                 
#>  tz       America/New_York            
#>  date     2020-03-09                  
#> 
#> ─ Packages ───────────────────────────────────────────────────────────────────
#>  package     * version     date       lib source                         
#>  assertthat    0.2.1       2019-03-21 [1] CRAN (R 3.6.0)                 
#>  backports     1.1.5       2019-10-02 [1] CRAN (R 3.6.1)                 
#>  base64url     1.4         2018-05-14 [1] CRAN (R 3.6.0)                 
#>  callr         3.4.2       2020-02-12 [1] CRAN (R 3.6.1)                 
#>  cli           2.0.2       2020-02-28 [1] CRAN (R 3.6.1)                 
#>  crayon        1.3.4       2017-09-16 [1] CRAN (R 3.6.0)                 
#>  desc          1.2.0       2018-05-01 [1] CRAN (R 3.6.0)                 
#>  devtools      2.2.2       2020-02-17 [1] CRAN (R 3.6.1)                 
#>  digest        0.6.25      2020-02-23 [1] CRAN (R 3.6.1)                 
#>  drake       * 7.11.0.9000 2020-03-05 [1] Github (ropensci/drake@a628f6c)
#>  ellipsis      0.3.0       2019-09-20 [1] CRAN (R 3.6.1)                 
#>  evaluate      0.14        2019-05-28 [1] CRAN (R 3.6.0)                 
#>  fansi         0.4.1       2020-01-08 [1] CRAN (R 3.6.1)                 
#>  filelock      1.0.2       2018-10-05 [1] CRAN (R 3.6.1)                 
#>  fs            1.3.2       2020-03-05 [1] CRAN (R 3.6.1)                 
#>  glue          1.3.1       2019-03-12 [1] CRAN (R 3.6.0)                 
#>  highr         0.8         2019-03-20 [1] CRAN (R 3.6.0)                 
#>  hms           0.5.3       2020-01-08 [1] CRAN (R 3.6.1)                 
#>  htmltools     0.4.0       2019-10-04 [1] CRAN (R 3.6.1)                 
#>  igraph        1.2.4.2     2019-11-27 [1] CRAN (R 3.6.1)                 
#>  knitr         1.28        2020-02-06 [1] CRAN (R 3.6.1)                 
#>  magrittr      1.5         2014-11-22 [1] CRAN (R 3.6.0)                 
#>  memoise       1.1.0       2017-04-21 [1] CRAN (R 3.6.0)                 
#>  pillar        1.4.3       2019-12-20 [1] CRAN (R 3.6.1)                 
#>  pkgbuild      1.0.6       2019-10-09 [1] standard (@1.0.6)              
#>  pkgconfig     2.0.3       2019-09-22 [1] standard (@2.0.3)              
#>  pkgload       1.0.2       2018-10-29 [1] CRAN (R 3.6.0)                 
#>  prettyunits   1.1.1       2020-01-24 [1] CRAN (R 3.6.1)                 
#>  processx      3.4.2       2020-02-09 [1] CRAN (R 3.6.1)                 
#>  progress      1.2.2       2019-05-16 [1] CRAN (R 3.6.0)                 
#>  ps            1.3.2       2020-02-13 [1] CRAN (R 3.6.1)                 
#>  R6            2.4.1       2019-11-12 [1] CRAN (R 3.6.1)                 
#>  Rcpp          1.0.3       2019-11-08 [1] standard (@1.0.3)              
#>  remotes       2.1.1       2020-02-15 [1] CRAN (R 3.6.1)                 
#>  rlang         0.4.5       2020-03-01 [1] CRAN (R 3.6.1)                 
#>  rmarkdown     2.1         2020-01-20 [1] CRAN (R 3.6.1)                 
#>  rprojroot     1.3-2       2018-01-03 [1] CRAN (R 3.6.0)                 
#>  sessioninfo   1.1.1       2018-11-05 [1] CRAN (R 3.6.0)                 
#>  storr         1.2.1       2018-10-18 [1] CRAN (R 3.6.0)                 
#>  stringi       1.4.6       2020-02-17 [1] CRAN (R 3.6.1)                 
#>  stringr       1.4.0       2019-02-10 [1] CRAN (R 3.6.0)                 
#>  testthat      2.3.2       2020-03-02 [1] CRAN (R 3.6.1)                 
#>  tibble        2.1.3       2019-06-06 [1] CRAN (R 3.6.0)                 
#>  txtq          0.2.0       2019-10-15 [1] CRAN (R 3.6.1)                 
#>  usethis       1.5.1       2019-07-04 [1] CRAN (R 3.6.0)                 
#>  vctrs         0.2.3       2020-02-20 [1] CRAN (R 3.6.1)                 
#>  withr         2.1.2       2018-03-15 [1] CRAN (R 3.6.0)                 
#>  xfun          0.12        2020-01-13 [1] CRAN (R 3.6.1)                 
#>  yaml          2.2.1       2020-02-01 [1] CRAN (R 3.6.1)                 
#> 
#> [1] /home/psadil/R/x86_64-pc-linux-gnu-library/3.6
#> [2] /usr/local/lib/R/site-library
#> [3] /usr/lib/R/site-library
#> [4] /usr/lib/R/library

Thanks!

may revisit priority performance reproducibility bug

Source

psadil

👍1

All 42 comments

This is happening because drake takes shortcuts to check sub-targets. For the sake of speed, drake checks triggers for the whole dynamic target rather than each sub-target individually. That means in order to skip sub-targets, the metadata for result must already exist in the cache, which unfortunately does not happen until all the sub-targets run at least once. I will need to think more on what we can do about this. (Maybe we can store the metadata early?)

For now, you can select keep_going = TRUE so the dynamic target gets finalized and future make()s can skip sub-targets. (See below.)

Your issues are extremely helpful! They are identifying huge problems in drake I did not even know existed. Please continue to post them as they arise.

library(drake)

foo <- function(num) {
  if(num > 1) {
    print("num is too high!")
    stop("num is too high!")
    Sys.sleep(2)
  } else {
    print("num is low enough.")
  }
  num
} 

plan <- drake_plan(
  numbers = seq_len(2),
  result = target(
    foo(numbers), 
    dynamic = map(numbers)
  )
)

make(plan, keep_going = TRUE)
#> ▶ target numbers
#> ▶ dynamic result
#> > subtarget result_0b3474bd
#> [1] "num is low enough."
#> > subtarget result_b2a5c9b8
#> [1] "num is too high!"
#> x fail result_b2a5c9b8
#> ■ finalize result

make(plan, keep_going = TRUE)
#> ▶ dynamic result
#> > subtarget result_b2a5c9b8
#> [1] "num is too high!"
#> x fail result_b2a5c9b8
#> ■ finalize result

^{Created on 2020-03-09 by the reprex package (v0.3.0)}

wlandau on 9 Mar 2020

That sounds good. Thanks for the clarification (and again the package/support!). In the meantime, that keep_going = TRUE tip will be pretty helpful.

psadil on 10 Mar 2020

Implementation strategy

It is not enough to simply store the dynamic target's metadata ahead of time. On its own, that would falsely validate sub-targets that already exist. We also need to encode a representation of the dynamic target's trigger state into the names of the sub-targets (similar to the recovery key). With all that in place, we should be able to make the correct build decisions without the computational burden of checking every sub-target's metadata list.

wlandau on 10 Mar 2020

👍1

Unfortunately, https://github.com/ropensci/drake/issues/1209#issuecomment-597266664 will invalidate everyone's sub-targets. Still, the long-run time savings are worth it.

wlandau on 10 Mar 2020

Come to think of it, https://github.com/ropensci/drake/issues/1209#issuecomment-597266664 has an even bigger problem: it throws out the user's choice of custom triggers. So that's out of the question now.

I do not like it, but I think we will have to go with some kind of ad hoc tracking mechanism to make this happen.

Dynamic branching, dynamic files, triggers, data recovery, high-performance computing: together, these features have turned out to seriously exacerbate drake's conceptual complexity. It will take a lot of refactoring to streamline things out again and some careful planning and testing to fix this issue.

wlandau on 11 Mar 2020

For this issue, we might be able to leverage data recovery somehow. make(recover = TRUE) does not currently recover sub-targets in this case, but maybe it could. After that, we could think about making recover = TRUE the default.

wlandau on 11 Mar 2020

The downside is that we would be repurposing a system design that was not exactly designed for the task. Also, It requires checking sub-target metadata, which could be slow.

wlandau on 11 Mar 2020

So we need a new tracking mechanism. Again, I do not like the extra conceptual complexity, but it seems like the only way keep drake running fast without interfering with other parts of the design. Proposal:

When a sub-targets completes successfully, make a note of it in an ad hoc storr namespace (a "dynamic progress namespace"). The namespace name should begin with the prefix "dyn-" and contain the name and ~~recovery key~~ (a special key that only considers the triggers you are actually using) of the parent target. For speed, we should only store keys, not data. (We already do for set_progress().)
If the parent target gets the chance to finalize, delete the whole namespace, as well as any other namespaces that start with "dyn--". Note the hyphens here. Hyphens are illegal in target names, and that prevents us from accidentally removing another target's dynamic progress namespace.
Suppose we call make() again and the parent did not finalize last time. Then, there should be keys in the dynamic progress namespace. If the static dependencies remain unchanged (and if the condition trigger is not activated) we should be able to use those keys to avoid registering sub-targets that succeeded before.

wlandau on 11 Mar 2020

Important to note: the proposal above does not invalidate everyone's dynamic targets!

wlandau on 11 Mar 2020

🎉1

Another advantage of https://github.com/ropensci/drake/issues/1209#issuecomment-597666346 over https://github.com/ropensci/drake/issues/1209#issuecomment-597266664 is that if we went with the latter, the number of storr keys would explode. As it is, dynamic branching leaves a lot of unused keys behind, which users should regularly clean out with clean(list = cached_unplanned(plan)). (I need to make that more obvious.)

wlandau on 12 Mar 2020

https://github.com/ropensci/drake/commit/e72a3f7256c6ad9ec0f369567d2b88970a3be972 seems to work! Nice

kendonB on 12 Mar 2020

Glad to hear it. I actually just realized we need the full recovery key after all, so please hold off until I submit and merge a PR before you use it for serious work.

wlandau on 12 Mar 2020

👍1

On second thought, let's revert https://github.com/ropensci/drake/commit/c60ce32d13caca9c63936f92bf07dc3c0854714a to keep everyone's targets valid. That bug is annoying but does no tangible harm.

wlandau on 12 Mar 2020

Rethinking the implementation here. Because of https://github.com/richfitz/storr/issues/121, the ad hoc storr namespaces do not really go away. Yes, we clear them, but the folders are still there, and they could add up.

drake_cache()$list_namespaces()
[1] "dyn-y-926684c7" "dyn-y-d8df0a05" "dyn-y-f4c3c1c6" "memoize"        "meta"          
[6] "objects"        "progress"       "recover"        "session"

What we need is a single namespace and more descriptive keys.

wlandau on 23 Mar 2020

On reflection, a single namespace would decrease performance. Let's just remove the folders in the special case of RDS storrs. The problem will resolve itself after https://github.com/richfitz/storr/pull/122 is merged.

wlandau-lilly on 23 Mar 2020

Need to revert https://github.com/ropensci/drake/commit/caf6084d08cd955ff7c1ac02f8dd80f0948b83a2 after https://github.com/richfitz/storr/pull/122 is merged and the new storr is on CRAN. https://github.com/ropensci/drake/commit/caf6084d08cd955ff7c1ac02f8dd80f0948b83a2 does the highest-performant thing, but it assumes knowledge of the internal file structure of RDS storrs, which is not ideal if storr's internals change. (But I do trust Rich to preserve back compatibility.)

wlandau-lilly on 23 Mar 2020

I'm having this problem. Are you still waiting a storr version to be on CRAN in order to merge the fix? I tried drake 7.12.0.9000, but I'm still having to re-build subtargets.

ercbk on 29 Apr 2020

That's odd, and it should not be happening in 7.12.0.9000. I tested the patch pretty aggressively, and tests are still passing on my end. Would you post a reprex so I can take a look?

https://github.com/ropensci/drake/blob/935f95aa729aac2e2eda0ded821365a602704538/tests/testthat/test-9-dynamic.R#L2168-L2386

wlandau on 30 Apr 2020

I'm not sure how to simulate my target failure. I'm actually planning on creating an issue on the failure later on. I wanted to see if the target would finish though. It might narrow down the cause possibilities.
I can point you to the repo and I can push the .drake folder. I tried the example above and that works as expected. Any suggestions on how to go about this?

ercbk on 30 Apr 2020

It is best if we narrow down the exact conditions/behaviors that lead to incorrectly rebuilding sub-targets. Knowing the plan is a good start, but we also need to know when you are make()ing it and what you are changing in between make()s. The cause might not be #1209. Also, how small and fast can you make your plan and still reproduce the issue?

wlandau on 30 Apr 2020

And for what it's worth, r_make() resolves much of the brittleness of make(). I recommend having a look at https://books.ropensci.org/drake/projects.html#safer-interactivity.

wlandau on 30 Apr 2020

Regarding "when", are you talking about the actual times?
My sequence of changes has gone something like this:

I had this portion of my project in a local, unversioned directory while I got drake working. The primary targets of the drake_plan get repeated for each sample size. So eventually, the targets will be repeated 4 times for n = 100, 800, 2000, and 5000. The plan ran fine for n = 100, n = 800, so I moved it to the versioned directory.
The versioned directory is a separate renv environment. I ran make and got the "progress_bar" error. I thought the error might have something to do with a couple of the future_map .progress arguments I had in the code (even though I had them set to FALSE), so I removed them. Of course, that had nothing to do with it. I read the issue; installed the progress package (and maybe some other package I forget); and it ran, but that triggered a rebuild of targets unfortunately.
n=800 targets were built but it failed on the transition to the n =2000, ncv_results_2000 target. Even though the build fails, the instances still run the code, so I terminate the instances and start new ones. I re-run make, and the first 2 n =2000 subtargets get built. The 3rd subtarget fails.
I've since attempted to start make with some new arguments to try and deal with the fail, but I'm pretty sure I haven't changed any of the other code in the functions or plan (update: I've check commits local and remote, and I haven't changed anything else). Now, when I try and restart though, it tries to re-build the first subtarget. I think I tried to run it with the previous make arguments and it still tried to re-build the first target.
I, also, added the timeout argument to makeClusterPSOCK, added another setting in a saved PuTTY session, and the ips have changed of course, but I wouldn't think any of that would affect what's going on with the subtargets.

Here's the new make with the added arguments:

make(
      plan,
      verbose = 1,
      session_info = FALSE,
      retries = 2,
      lock_envir = FALSE,
      history = FALSE,
      log_progress = FALSE,
      jobs_preprocess = 7
)

and a pic of that last fail
https://www.dropbox.com/s/ryx92zaq3vlqgxx/Screenshot%20%2853%29.png?dl=0

Need to build a couple charts for another project, but I'll look into r_make either later tonight or tomorrow morning.

current session info


- Session info ------------------------------------------------------------------------------------------
 setting  value                       
 version  R version 3.6.2 (2019-12-12)
 os       Windows 10 x64              
 system   x86_64, mingw32             
 ui       RStudio                     
 language (EN)                        
 collate  English_United States.1252  
 ctype    English_United States.1252  
 tz       America/New_York            
 date     2020-04-30                  

- Packages ----------------------------------------------------------------------------------------------
 package     * version     date       lib source                         
 assertthat    0.2.1       2019-03-21 [1] CRAN (R 3.6.2)                 
 backports     1.1.6       2020-04-05 [1] CRAN (R 3.6.3)                 
 base64url     1.4         2018-05-14 [1] CRAN (R 3.6.3)                 
 cli           2.0.1       2020-01-08 [1] CRAN (R 3.6.2)                 
 clipr         0.7.0       2019-07-23 [1] CRAN (R 3.6.2)                 
 codetools     0.2-16      2018-12-24 [1] CRAN (R 3.6.2)                 
 crayon        1.3.4       2017-09-16 [1] CRAN (R 3.6.2)                 
 data.table  * 1.12.8      2019-12-09 [1] CRAN (R 3.6.2)                 
 desc          1.2.0       2018-05-01 [1] CRAN (R 3.6.1)                 
 details     * 0.2.1       2020-01-12 [1] CRAN (R 3.6.3)                 
 digest        0.6.25      2020-02-23 [1] CRAN (R 3.6.2)                 
 dplyr       * 0.8.4       2020-01-31 [1] CRAN (R 3.6.2)                 
 drake       * 7.12.0.9000 2020-04-29 [1] Github (ropensci/drake@935f95a)
 dtplyr      * 1.0.1       2020-01-23 [1] CRAN (R 3.6.2)                 
 fansi         0.4.1       2020-01-08 [1] CRAN (R 3.6.2)                 
 filelock      1.0.2       2018-10-05 [1] CRAN (R 3.6.3)                 
 furrr       * 0.1.0       2018-05-16 [1] CRAN (R 3.6.1)                 
 future      * 1.16.0      2020-01-16 [1] CRAN (R 3.6.2)                 
 globals       0.12.5      2019-12-07 [1] CRAN (R 3.6.1)                 
 glue          1.4.0       2020-04-03 [1] CRAN (R 3.6.2)                 
 hms           0.5.3       2020-01-08 [1] CRAN (R 3.6.2)                 
 httr          1.4.1       2019-08-05 [1] CRAN (R 3.6.1)                 
 igraph        1.2.5       2020-03-19 [1] CRAN (R 3.6.3)                 
 jsonlite      1.6.1       2020-02-02 [1] CRAN (R 3.6.2)                 
 knitr         1.28        2020-02-06 [1] CRAN (R 3.6.2)                 
 listenv       0.8.0       2019-12-05 [1] CRAN (R 3.6.2)                 
 magrittr      1.5         2014-11-22 [1] CRAN (R 3.6.2)                 
 packrat       0.5.0       2018-11-14 [1] CRAN (R 3.6.1)                 
 pacman        0.5.1       2019-03-11 [1] CRAN (R 3.6.2)                 
 pillar        1.4.3       2019-12-20 [1] CRAN (R 3.6.2)                 
 pkgconfig     2.0.3       2019-09-22 [1] CRAN (R 3.6.2)                 
 png           0.1-7       2013-12-03 [1] CRAN (R 3.6.0)                 
 prettyunits   1.1.1       2020-01-24 [1] CRAN (R 3.6.2)                 
 progress      1.2.2       2019-05-16 [1] CRAN (R 3.6.1)                 
 purrr         0.3.3       2019-10-18 [1] CRAN (R 3.6.2)                 
 R6            2.4.1       2019-11-12 [1] CRAN (R 3.6.2)                 
 Rcpp          1.0.3       2019-11-08 [1] CRAN (R 3.6.2)                 
 renv          0.9.3-30    2020-02-22 [1] Github (rstudio/renv@916923a)  
 reticulate    1.14        2019-12-17 [1] CRAN (R 3.6.2)                 
 rlang         0.4.5       2020-03-01 [1] CRAN (R 3.6.3)                 
 rprojroot     1.3-2       2018-01-03 [1] CRAN (R 3.6.1)                 
 rstudioapi    0.11        2020-02-07 [1] CRAN (R 3.6.2)                 
 sessioninfo   1.1.1       2018-11-05 [1] CRAN (R 3.6.2)                 
 storr         1.2.1       2018-10-18 [1] CRAN (R 3.6.3)                 
 tibble        2.1.3       2019-06-06 [1] CRAN (R 3.6.2)                 
 tidyselect    1.0.0       2020-01-27 [1] CRAN (R 3.6.2)                 
 txtq          0.2.0       2019-10-15 [1] CRAN (R 3.6.3)                 
 vctrs         0.2.4       2020-03-10 [1] CRAN (R 3.6.3)                 
 withr         2.1.2       2018-03-15 [1] CRAN (R 3.6.1)                 
 xfun          0.12        2020-01-13 [1] CRAN (R 3.6.2)                 
 xml2          1.2.2       2019-08-09 [1] CRAN (R 3.6.1)                 

[1] C:/Users/tbats/Documents/R/Projects/nested-cross-validation-comparison/renv/library/R-3.6/x86_64-w64-mingw32
[2] C:/Users/tbats/AppData/Local/Temp/RtmpGMIymj/renv-system-library

ercbk on 30 Apr 2020

Kudos on persevering to this point. Changes to make() arguments should not invalidate targets except for seed and format. Unfortunately, I cannot replicate how you interacted with your project in 1-4, which is why reprexes on downsized examples are helpful.

You can use the deps_profile() function to see what drake thinks about the state of dependencies. If it thinks at least one upstream function or target changed since last time, it will tell you. Here is a reprex to demonstrate.

library(drake)

f <- function(x) {
  x + 1
}

plan <- drake_plan(x = 1, y = f(x))

make(plan)
#> ▶ target x
#> ▶ target y

deps_profile(y, plan)
#> # A tibble: 5 x 4
#>   name     changed old                new               
#>   <chr>    <lgl>   <chr>              <chr>             
#> 1 command  FALSE   "7974fa383d985540" "7974fa383d985540"
#> 2 depend   FALSE   "4ce8d6b05dc7e1f9" "4ce8d6b05dc7e1f9"
#> 3 file_in  FALSE   ""                 ""                
#> 4 file_out FALSE   ""                 ""                
#> 5 seed     FALSE   "965938315"        "965938315"

# Change a function.
f <- function(x) {
  x + 2
}

# Register the change in the cache.
make(plan, skip_targets = TRUE)

# The `change` column is now `TRUE` in the `depend` row. 
deps_profile(y, plan)
#> # A tibble: 5 x 4
#>   name     changed old                new               
#>   <chr>    <lgl>   <chr>              <chr>             
#> 1 command  FALSE   "7974fa383d985540" "7974fa383d985540"
#> 2 depend   TRUE    "4ce8d6b05dc7e1f9" "cd6bb9a6ec1eea7a"
#> 3 file_in  FALSE   ""                 ""                
#> 4 file_out FALSE   ""                 ""                
#> 5 seed     FALSE   "965938315"        "965938315"

^{Created on 2020-05-01 by the reprex package (v0.3.0)}

By the way, I see you are using custom PSOCK clusters for parallel computing. drake has built-in high-performance computing, which may be more convenient and could give you more parallel efficiency in some cases: https://books.ropensci.org/drake/hpc.html. Since you are running Linux, it is straightforward to use clustermq's multicore backend. And if you have a computing cluster with a resource manager, even better. Sketch:

library(drake)
# install.packages("clustermq")
options(clustermq.scheduler = "multicore")
make(plan, parallelism = "clustermq", jobs = 4)

wlandau on 1 May 2020

Reprex doesn't find anything: files or functions (unless I package::function). I've tried inside the project environment and outside of it with the working directory set the project directory. Don't understand it. It's worked fine before. Doesn't find the plan-kj.R file with source either.


library(drake)

error_FUN <- function(y_obs, y_hat){
      y_obs <- unlist(y_obs)
      y_hat <- unlist(y_hat)
      Metrics::mae(y_obs, y_hat)
}

method <- "kj"
algorithms <- list("glmnet", "rf")
repeats <- seq(1:5)
grid_size <- 100

plan <- drake_plan(
      # model functions for each algorithm
      mod_FUN_list = create_models(algorithms),
      # data used to estimate out-of-sample error
      # noise_sd, seed settings are the defaults
      large_dat = mlbench_data(n = 10^5,
                               noise_sd = 1,
                               seed = 2019),
      # sample size = 100
      sim_dat_100 = mlbench_data(100),
      # hyperparameter grids for each algorithm
      # This probably doesn't need to be a "dynamic" target since mtry is only concerned about the number of columns in data (see script), but I'll do it anyways
      params_list_100 = create_grids(sim_dat_100,
                                     algorithms,
                                     size = grid_size),
      # create a separate ncv data object for each repeat value
      ncv_dat_100 = create_ncv_objects(sim_dat_100,
                                       repeats,
                                       method),
      # runs nested-cv and compares ncv error with out-of-sample error
      # outputs: ncv error, oos error, delta error, chosen algorithm, chosen hyperparameters 
      ncv_results_100 = target(
            run_ncv(ncv_dat_100,
                    sim_dat_100,
                    large_dat,
                    mod_FUN_list,
                    params_list_100,
                    error_FUN,
                    method),
            dynamic = map(ncv_dat_100)
      ),
      # add index columns to identify the results according to sample size and number of repeats
      perf_results_100 = tibble(n = 100, repeats = repeats) %>%
            bind_cols(ncv_results_100),

      # repeat for the rest of the sample sizes
      # sample size = 800
      sim_dat_800 = mlbench_data(800),
      params_list_800 = create_grids(sim_dat_800,
                                     algorithms,
                                     size = grid_size),
      ncv_dat_800 = create_ncv_objects(sim_dat_800,
                                       repeats,
                                       method),
      ncv_results_800 = target(
            run_ncv(ncv_dat_800,
                    sim_dat_800,
                    large_dat,
                    mod_FUN_list,
                    params_list_800,
                    error_FUN,
                    method),
            dynamic = map(ncv_dat_800)
      ),
      perf_results_800 = tibble(n = 800, repeats = repeats) %>%
            bind_cols(ncv_results_800),

      # sample size = 2000
      sim_dat_2000 = mlbench_data(2000),
      params_list_2000 = create_grids(sim_dat_2000,
                                      algorithms,
                                      size = grid_size),
      ncv_dat_2000 = create_ncv_objects(sim_dat_2000,
                                        repeats,
                                        method),
      ncv_results_2000 = target(
            run_ncv(ncv_dat_2000,
                    sim_dat_2000,
                    large_dat,
                    mod_FUN_list,
                    params_list_2000,
                    error_FUN,
                    method),
            dynamic = map(ncv_dat_2000)
      ),
      perf_results_2000 = tibble(n = 2000, repeats = repeats) %>%
            bind_cols(ncv_results_2000)

)


drake::deps_profile(ncv_results_2000, plan)
#> Error in deps_profile_impl(target = ncv_results_2000, config = config): no recorded metadata for target ncv_results_2000.
rlang::last_error()
#> Error: Can't show last error because no error was recorded yet
rlang::last_trace()
#> Error: Can't show last error because no error was recorded yet
deps_profile(ncv_results_2000_7d80d14d, plan)
#> Error in deps_profile_impl(target = ncv_results_2000_7d80d14d, config = config): no recorded metadata for target ncv_results_2000_7d80d14d.
rlang::last_error()
#> Error: Can't show last error because no error was recorded yet
rlang::last_trace()
#> Error: Can't show last error because no error was recorded yet
drake::loadd(ncv_results_2000_7d80d14d)
#> Error in loadd_handle_empty_targets(targets = targets, cache = cache, : object 'ncv_results_2000_7d80d14d' not found

^{Created on 2020-05-02 by the reprex package (v0.3.0)}

Here's what happens when I excute the script myself (starting after plan). Apologize for the readability.



drake::deps_profile(ncv_results_2000, plan)
# Error: Tibble columns must have compatible sizes.
# * Size 5: Existing data.
# * Size 2: Column `old`.
# i Only values of size one are recycled.
# Run `rlang::last_error()` to see where the error occurred.
# In addition: Warning message:
#       In old_values != new_values :
#       longer object length is not a multiple of shorter object length


rlang::last_error()
# <error/tibble_error_incompatible_size>
#       Tibble columns must have compatible sizes.
# * Size 5: Existing data.
# * Size 2: Column `old`.
# i Only values of size one are recycled.
# Backtrace:
#       1. drake::deps_profile(ncv_results_2000, plan)
# 4. drake::deps_profile_impl(target = ncv_results_2000, config = config)
# 5. drake:::weak_tibble(...)
# 6. tibble::tibble(...)
# 7. tibble:::tibble_quos(xs[!is_null], .rows, .name_repair)
# 8. tibble:::vectbl_recycle_rows(res, first_size, j, given_col_names[[j]])
# Run `rlang::last_trace()` to see the full context.


rlang::last_trace()
# <error/tibble_error_incompatible_size>
#       Tibble columns must have compatible sizes.
# * Size 5: Existing data.
# * Size 2: Column `old`.
# i Only values of size one are recycled.
# Backtrace:
#       x
# 1. \-drake::deps_profile(ncv_results_2000, plan)
# 2.   +-base::eval(call)
# 3.   | \-base::eval(call)
# 4.   \-drake::deps_profile_impl(target = ncv_results_2000, config = config)
# 5.     \-drake:::weak_tibble(...)
# 6.       \-tibble::tibble(...)
# 7.         \-tibble:::tibble_quos(xs[!is_null], .rows, .name_repair)
# 8.           \-tibble:::vectbl_recycle_rows(res, first_size, j, given_col_names[[j]])


deps_profile(ncv_results_2000_7d80d14d, plan)
# Error: Tibble columns must have compatible sizes.
# * Size 5: Existing data.
# * Size 2: Column `old`.
# i Only values of size one are recycled.
# Run `rlang::last_error()` to see where the error occurred.
# In addition: Warning message:
#       In old_values != new_values :
#       longer object length is not a multiple of shorter object length


rlang::last_error()
# <error/tibble_error_incompatible_size>
#       Tibble columns must have compatible sizes.
# * Size 5: Existing data.
# * Size 2: Column `old`.
# i Only values of size one are recycled.
# Backtrace:
#       1. drake::deps_profile(ncv_results_2000_7d80d14d, plan)
# 4. drake::deps_profile_impl(...)
# 5. drake:::weak_tibble(...)
# 6. tibble::tibble(...)
# 7. tibble:::tibble_quos(xs[!is_null], .rows, .name_repair)
# 8. tibble:::vectbl_recycle_rows(res, first_size, j, given_col_names[[j]])
# Run `rlang::last_trace()` to see the full context.


rlang::last_trace()
# <error/tibble_error_incompatible_size>
#       Tibble columns must have compatible sizes.
# * Size 5: Existing data.
# * Size 2: Column `old`.
# i Only values of size one are recycled.
# Backtrace:
#       x
# 1. \-drake::deps_profile(ncv_results_2000_7d80d14d, plan)
# 2.   +-base::eval(call)
# 3.   | \-base::eval(call)
# 4.   \-drake::deps_profile_impl(...)
# 5.     \-drake:::weak_tibble(...)
# 6.       \-tibble::tibble(...)
# 7.         \-tibble:::tibble_quos(xs[!is_null], .rows, .name_repair)
# 8.           \-tibble:::vectbl_recycle_rows(res, first_size, j, given_col_names[[j]])


drake::loadd(ncv_results_2000_7d80d14d)
# ncv_results_2000_7d80d14d
# # A tibble: 1 x 7
# method oos_error ncv_error delta_error chosen_algorithm  mtry trees
# <chr>      <dbl>     <dbl>       <dbl> <chr>            <int> <int>
#       1 kj          1.39      1.36      0.0214 rf                   5  1325

drake::loadd(ncv_results_2000)
# Error in loadd_handle_empty_targets(targets = targets, cache = cache,  : 
# object 'ncv_results_2000' not found

ercbk on 2 May 2020

Just fyi, my .drake file is a couple layers above my make and plan scripts. Wonder if that's whats messing with the reprex function.

ercbk on 3 May 2020

Reprex doesn't find anything: files or functions (unless I package::function). I've tried inside the project environment and outside of it with the working directory set the project directory. Don't understand it. It's worked fine before. Doesn't find the plan-kj.R file with source either.

It may be frustrating, yes, but it would really help to identify and fix the problem if it is possible to whittle down your project into to something that fits into reprex and is easier to understand and run.

Just fyi, my .drake file is a couple layers above my make and plan scripts. Wonder if that's whats messing with the reprex function.

Part of what I am requesting is a script that creates an entirely new .drake cache that recreates the problem. Once we can reproduce it from scratch using automated code, we have a much better shot of figuring out what is going on. Otherwise, it is difficult to speculate on the more human-related ad hoc steps you may have taken to set up your project.

wlandau on 4 May 2020

Yeah, I agree. Unfortunately, it's a complicated project that takes days to run on my desktop even for the small datasets. I'll guess I'll try to run the whole thing with very, very small datasets to get the runtime down to something manageable, and see if I can recreate the issue once I get reprex to work. In terms of reducing the complexity of the code, I'm not sure how that can be done here. I'll talk to my duck, though, and see if we can figure out something. :)
Any ideas about what's happening with deps_profile or are the error messages not much help without the reprex?

ercbk on 4 May 2020

Any ideas about what's happening with deps_profile or are the error messages not much help without the reprex?

Again, based on what we have to go on right now, I am not sure. But I do have a suspicion, and I just pushed a patch to try to deal with it: https://github.com/ropensci/drake/commit/50eb9f5d056a1ef512cc341e84a6714806d43cbb. You might have better luck if you install the update (though I cannot make guarantees).

wlandau on 4 May 2020

The good news is that I've figured out how reprex determines its working directory. You have to specify the outfile argument, else it works out of temp directory. Guess I've only used it for simple situations in the past, because I didn't remember that that's how that worked. The bad news is that, if I'm reading this correctly, drake thinks that every target needs rebuilt. Still working on a simplified version of the project. I think I interrupted make twice in row once I saw it was trying to rebuild targets. Wonder if that made things worse.


library(drake)

source("performance-experiment/Kuhn-Johnson/plan-kj.R")

deps_profile(ncv_results_2000, plan)
#> # A tibble: 5 x 4
#>   name     changed old              new               
#>   <chr>    <lgl>   <chr>            <chr>             
#> 1 command  TRUE    4f18907a711e6c41 "d958fb47b0b8f88d"
#> 2 depend   NA      <NA>             "aef603d261217ef0"
#> 3 file_in  NA      <NA>             ""                
#> 4 file_out NA      <NA>             ""                
#> 5 seed     NA      <NA>             "540153646"

deps_profile(ncv_results_2000_7d80d14d, plan)
#> # A tibble: 5 x 4
#>   name     changed old              new               
#>   <chr>    <lgl>   <chr>            <chr>             
#> 1 command  TRUE    4f18907a711e6c41 "ef46db3751d8e999"
#> 2 depend   NA      <NA>             ""                
#> 3 file_in  NA      <NA>             ""                
#> 4 file_out NA      <NA>             ""                
#> 5 seed     FALSE   2136092035       "2136092035"

loadd(ncv_results_2000_7d80d14d)
ncv_results_2000_7d80d14d
#> # A tibble: 1 x 7
#>   method oos_error ncv_error delta_error chosen_algorithm  mtry trees
#>   <chr>      <dbl>     <dbl>       <dbl> <chr>            <int> <int>
#> 1 kj          1.39      1.36      0.0214 rf                   5  1325

outdated(plan)
#>  [1] "large_dat"         "mod_FUN_list"      "ncv_dat_100"      
#>  [4] "ncv_dat_2000"      "ncv_dat_800"       "ncv_results_100"  
#>  [7] "ncv_results_2000"  "ncv_results_800"   "params_list_100"  
#> [10] "params_list_2000"  "params_list_800"   "perf_results_100" 
#> [13] "perf_results_2000" "perf_results_800"  "sim_dat_100"      
#> [16] "sim_dat_2000"      "sim_dat_800"

deps_profile(ncv_results_800, plan)
#> # A tibble: 5 x 4
#>   name     changed old                new               
#>   <chr>    <lgl>   <chr>              <chr>             
#> 1 command  FALSE   "55016b093d194572" "55016b093d194572"
#> 2 depend   TRUE    "26874d8b848c3d63" "4b6794fb258a5773"
#> 3 file_in  FALSE   ""                 ""                
#> 4 file_out FALSE   ""                 ""                
#> 5 seed     FALSE   "1942602105"       "1942602105"
deps_profile(ncv_results_100, plan)
#> # A tibble: 5 x 4
#>   name     changed old              new               
#>   <chr>    <lgl>   <chr>            <chr>             
#> 1 command  TRUE    4f18907a711e6c41 "ab37a0ecea2b1f23"
#> 2 depend   NA      <NA>             "5bd1d20abf0b48f9"
#> 3 file_in  NA      <NA>             ""                
#> 4 file_out NA      <NA>             ""                
#> 5 seed     NA      <NA>             "395977714"

drake_history()
#> Error in parse(text = command): <text>:2:24: unexpected symbol
#> 1: run_ncv(ncv_dat_100, sim_dat_100, large_dat, mod_FUN_list, params_list, 
#> 2:     error_FUN, method) map
#>                           ^

^{Created on 2020-05-05 by the reprex package (v0.3.0)}

ercbk on 6 May 2020

For everything you showed except ncv_results_800, it looks like you changed the command in the plan. For ncv_results_800, drake thinks at least one of your dependency targets/functions/globals changed at some point, which is trickier to identify. Both these things trigger updates to all the sub-targets.

wlandau on 6 May 2020

These are the first 4 targets in my plan, and the profiles say their dependencies changed. It doesn't make any sense. The first 3 target's dependencies are constants that wouldn't be changed. The commands for n=100, n = 2000 didn't change either. There's too much change here for me not to have remembered it and it doesn't show-up in the commits. So is there anything else to look at or do before I try to recreate this in miniature?
Btw I don't see anyway to reduce the complexity of the code, but the plan is to create a public repo with very slimmed down data objects that you can clone. That way it shouldn't take long to run and you'll be able to create the .drake file. This is assuming I can recreate the fail which I'm not confident will happen since I think the size of the objects might be the cause. Will that work for you? Just to double-check — you looking at my current .drake file will not help, right?

method <- "kj"
algorithms <- list("glmnet", "rf")
repeats <- seq(1:5)
grid_size <- 100

plan <- drake_plan(
   # model functions for each algorithm
   mod_FUN_list = create_models(algorithms),
   # data used to estimate out-of-sample error
   # noise_sd, seed settings are the defaults
   large_dat = mlbench_data(n = 10^5,
                            noise_sd = 1,
                            seed = 2019),
   # sample size = 100
   sim_dat_100 = mlbench_data(100),


library(drake)

source("performance-experiment/Kuhn-Johnson/plan-kj.R")

vis_drake_graph(plan)


deps_profile(mod_FUN_list, plan)
#> # A tibble: 5 x 4
#>   name     changed old                new               
#>   <chr>    <lgl>   <chr>              <chr>             
#> 1 command  FALSE   "896777bfc4467875" "896777bfc4467875"
#> 2 depend   TRUE    "19a0f5400146eab4" "26cafb820b726b9a"
#> 3 file_in  FALSE   ""                 ""                
#> 4 file_out FALSE   ""                 ""                
#> 5 seed     FALSE   "787681411"        "787681411"

deps_profile(large_dat, plan)
#> # A tibble: 5 x 4
#>   name     changed old                new               
#>   <chr>    <lgl>   <chr>              <chr>             
#> 1 command  FALSE   "f7b4ac0ab068d769" "f7b4ac0ab068d769"
#> 2 depend   TRUE    "1a4e85c7d355a240" ""                
#> 3 file_in  FALSE   ""                 ""                
#> 4 file_out FALSE   ""                 ""                
#> 5 seed     FALSE   "1889768483"       "1889768483"

deps_profile(sim_dat_100, plan)
#> # A tibble: 5 x 4
#>   name     changed old                new               
#>   <chr>    <lgl>   <chr>              <chr>             
#> 1 command  FALSE   "31129d94b2c9b515" "31129d94b2c9b515"
#> 2 depend   TRUE    "1a4e85c7d355a240" ""                
#> 3 file_in  FALSE   ""                 ""                
#> 4 file_out FALSE   ""                 ""                
#> 5 seed     FALSE   "90873426"         "90873426"

deps_profile(params_list_100, plan)
#> # A tibble: 5 x 4
#>   name     changed old                new               
#>   <chr>    <lgl>   <chr>              <chr>             
#> 1 command  FALSE   "7cc8ecbd68b25f69" "7cc8ecbd68b25f69"
#> 2 depend   TRUE    "85798dc734adfacd" "1c4e453d1ffbc8d4"
#> 3 file_in  FALSE   ""                 ""                
#> 4 file_out FALSE   ""                 ""                
#> 5 seed     FALSE   "1602630628"       "1602630628"

^{Created on 2020-05-10 by the reprex package (v0.3.0)}

ercbk on 11 May 2020

So is there anything else to look at or do before I try to recreate this in miniature?

When was the last time you built the project from scratch? And how often do you restart your R session. It is best to start make() with a fresh global environment (especially if you're not using the custom envir argument). r_make() guarantees that the global environment is completely fresh and is created in the same way every time you run the pipeline. This is critically important for a project as large and complicated as yours. So a low-overhaul thing to try is to put these lines and plan-kj.R into a _drake.R file and run a fresh copy of your project with r_make() and see if things stay up to date.

Certain hidden circularities can also crop. For example, the following plan is self-invalidating. I tried to catch most of this with lock_envir = FALSE in drake version 7, but weird stuff could theoretically crop up with environment variables etc.

library(drake)

a <- 1

plan <- drake_plan(
  x = a,
  y = assign("a", x + 1, envir = globalenv())
)

make(plan, lock_envir = FALSE)
#> > target x
#> > target y

make(plan, lock_envir = FALSE)
#> > target x
#> > target y

^{Created on 2020-05-11 by the reprex package (v0.3.0)}

The existing repo has heavy package requirements and the targets still have heavy runtimes. Specifically, ncv_results_2000 is very slow for an example used for debugging purposes. The targets before ncv_results_2000, however, do stay up to date when I try to run things.

Also, I notice you are using custom future multicore processing which I skipped for the sake of convenience. You might have a look at drake's built in high-performance computing, which has multicore and cluster-powered capabilities: https://books.ropensci.org/drake/hpc.html.

wlandau on 11 May 2020

I haven't rebuilt the project from scratch since moving the project to the version controlled directory.
The ncv_results_2000 target failed at night and I didn't rerun it until the next morning. I'm pretty sure I closed R and started with a fresh session the next morning.
Along with the lines you highlighted for _drake.R, if I want verbose = 1, lock_envir = FALSE, then I need to put those args into a drake_config(), correct? And do I run r_make in the console with _drake.R in the project root directory?
I wouldn't mind trying clustermq, but I'm not sure all customization, template file stuff and SLURM schedulers, etc., applies to my project. As is, I'm only parallelizing within targets. If I knew my targets wouldn't fail, then I'd consider starting 8 instances to parallelize building the targets. Or did I miss something? Is there something that would make the within-target loops run more efficiently?

ercbk on 19 May 2020

Along with the lines you highlighted for _drake.R, if I want verbose = 1, lock_envir = FALSE, then I need to put those args into a drake_config(), correct? And do I run r_make in the console with _drake.R in the project root directory?

Yes to both.

I wouldn't mind trying clustermq, but I'm not sure all customization, template file stuff and SLURM schedulers, etc., applies to my project. As is, I'm only parallelizing within targets. If I knew my targets wouldn't fail, then I'd consider starting 8 instances to parallelize building the targets. Or did I miss something? Is there something that would make the within-target loops run more efficiently?

clustermq does have a multicore backend for multicore parallelism on non-Windows machines. (For Windows machines, drake does also have its own future backend.) . Other than that, there's always a tradeoff between parallelism within targets and parallelism among targets. If you know not very many targets are going to run at the same time but there is a lot to do within each target, then within-target parallelism seems reasonable. But if a lot of targets are conditionally independent and you want some targets to start while others are still running, drake's built-in parallelism can help. This was probably already intuitive to you, but it is the main thing I think about when writing a new pipeline that needs HPC.

wlandau on 19 May 2020

Reran the slimmed down version of the project in a separate repo with same environment and pretty much worked like a dream. Did have one small hiccup in the beginning though. I got a config error on my first run when there wasn't a .drake directory yet — see here and here. Restarted R, re-ran r_make, and everything ran fine. Is there an initialization step that I missed?

Had a connection error after some sub-target builds. Re-ran r_make and it picked up where it left off. No rebuilding of already-successfully-built sub-targets. Added a n=3000 section to the plan, And again, no triggered rebuilds and it finished smoothly.

Next step, I guess, will be to delete the .drake directory in the real repo, create a _drake.R and try again.

ercbk on 22 May 2020

I could not reproduce those errors running https://github.com/ercbk/temp-nested-cv. I did get connection errors, but those disappeared when I commented out the custom future configuration you had. Looks like you're setting up drake properly.

wlandau on 22 May 2020

What custom future configuration are you talking about? The saved PuTTY session settings?

ercbk on 22 May 2020

https://github.com/ercbk/temp-nested-cv/blob/153e428f35a99ab8a70ac95fbc23d9bb4721c2ef/_drake.R#L33-L63

wlandau on 22 May 2020

Okay, I assume there's a way to do the same thing using the methods you describe in the hpc chapter of your book, but I'm not grasping it. Could you show me the code youe using to connect to the instances?

ercbk on 22 May 2020

I was just using the repo you linked: https://github.com/ercbk/temp-nested-cv. I could not connect to your instances, so I commented out https://github.com/ercbk/temp-nested-cv/blob/153e428f35a99ab8a70ac95fbc23d9bb4721c2ef/_drake.R#L33-L63 to try to run locally. The pipeline successfully started from a fresh cache (no "argument 'config' missing with no default" errors on my end).

wlandau on 22 May 2020

To parallelize among targets using the existing setup you have, you could keep https://github.com/ercbk/temp-nested-cv/blob/153e428f35a99ab8a70ac95fbc23d9bb4721c2ef/_drake.R#L33-L63, disable within-target parallelism, and use the following drake_config():

drake_config(
  plan,
  parallelism = "future",
  jobs = 2, # or however many workers you want
  verbose = 1,
  lock_envir = FALSE,
  jobs_preprocess = 7
)

wlandau on 22 May 2020

OH. When you said you got rid of the connection errors, I thought you had some better way to stabilize my connection to compute this remotely. I see.

ercbk on 22 May 2020

Was this page helpful?

0 / 5 - 0 ratings

Related issues

Meta-make

krlmlr · 32Comments

Accommodation of script-based imperative workflows

wlandau · 45Comments

Maintainer out of office

wlandau-lilly · 29Comments

Additional modes of parallelism

wlandau-lilly · 41Comments

Manual scheduling

krlmlr · 32Comments