drake's code of conduct.drake-r-package tag. (If you anticipate extended follow-up and discussion, you are already in the right place!)I'm working on a plan in which a dynamic target sometimes doesn't finish (e.g., because of an error in a subtarget). This dynamic target has many subtargets, and drake often makes many of those subtargets successfully. When rerunning the plan, is it possible to avoid remaking the already made subtargets?
The example below was generated with reprex, but -- since I didn't figure out how to automatically stop the second subtarget -- getting the outputs requires manually running the code and then stopping it while it's processing the second subtarget. Please let me know if that's confusing.
library(drake)
foo <- function(num) {
print("I'm running...")
out <- num
if(num > 1) {
print("if first run, user should cancel!")
Sys.sleep(5)
}
return(out)
}
plan <- drake_plan(
numbers = seq_len(2),
result = target(
foo(numbers),
dynamic = map(numbers))
)
make(plan, seed = 123)
#> โถ target numbers
#> โถ dynamic result
#> > subtarget result_0b3474bd
#> [1] "I'm running..."
#> > subtarget result_b2a5c9b8
#> [1] "I'm running..."
#> [1] "if first run, user should cancel!"
# but now cancel (e.g., ctrl+c, stop button in Rstudio, restart computer)
# the first subtarget is in the cache
cached()
#> [1] "numbers" "result_0b3474bd"
# and can be readd
readd(cached()[2], character_only = TRUE)
#> [1] 1
# but it looks like it's remade when trying again
make(plan, seed = 123)
#> โถ target numbers
#> โถ dynamic result
#> > subtarget result_0b3474bd
#> [1] "I'm running..."
#> > subtarget result_b2a5c9b8
#> [1] "I'm running..."
#> [1] "if first run, user should cancel!"
#> โ finalize result
Created on 2020-03-09 by the reprex package (v0.3.0)
devtools::session_info()
#> โ Session info โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
#> setting value
#> version R version 3.6.1 (2019-07-05)
#> os Ubuntu 18.04.4 LTS
#> system x86_64, linux-gnu
#> ui X11
#> language (EN)
#> collate en_US.UTF-8
#> ctype en_US.UTF-8
#> tz America/New_York
#> date 2020-03-09
#>
#> โ Packages โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
#> package * version date lib source
#> assertthat 0.2.1 2019-03-21 [1] CRAN (R 3.6.0)
#> backports 1.1.5 2019-10-02 [1] CRAN (R 3.6.1)
#> base64url 1.4 2018-05-14 [1] CRAN (R 3.6.0)
#> callr 3.4.2 2020-02-12 [1] CRAN (R 3.6.1)
#> cli 2.0.2 2020-02-28 [1] CRAN (R 3.6.1)
#> crayon 1.3.4 2017-09-16 [1] CRAN (R 3.6.0)
#> desc 1.2.0 2018-05-01 [1] CRAN (R 3.6.0)
#> devtools 2.2.2 2020-02-17 [1] CRAN (R 3.6.1)
#> digest 0.6.25 2020-02-23 [1] CRAN (R 3.6.1)
#> drake * 7.11.0.9000 2020-03-05 [1] Github (ropensci/drake@a628f6c)
#> ellipsis 0.3.0 2019-09-20 [1] CRAN (R 3.6.1)
#> evaluate 0.14 2019-05-28 [1] CRAN (R 3.6.0)
#> fansi 0.4.1 2020-01-08 [1] CRAN (R 3.6.1)
#> filelock 1.0.2 2018-10-05 [1] CRAN (R 3.6.1)
#> fs 1.3.2 2020-03-05 [1] CRAN (R 3.6.1)
#> glue 1.3.1 2019-03-12 [1] CRAN (R 3.6.0)
#> highr 0.8 2019-03-20 [1] CRAN (R 3.6.0)
#> hms 0.5.3 2020-01-08 [1] CRAN (R 3.6.1)
#> htmltools 0.4.0 2019-10-04 [1] CRAN (R 3.6.1)
#> igraph 1.2.4.2 2019-11-27 [1] CRAN (R 3.6.1)
#> knitr 1.28 2020-02-06 [1] CRAN (R 3.6.1)
#> magrittr 1.5 2014-11-22 [1] CRAN (R 3.6.0)
#> memoise 1.1.0 2017-04-21 [1] CRAN (R 3.6.0)
#> pillar 1.4.3 2019-12-20 [1] CRAN (R 3.6.1)
#> pkgbuild 1.0.6 2019-10-09 [1] standard (@1.0.6)
#> pkgconfig 2.0.3 2019-09-22 [1] standard (@2.0.3)
#> pkgload 1.0.2 2018-10-29 [1] CRAN (R 3.6.0)
#> prettyunits 1.1.1 2020-01-24 [1] CRAN (R 3.6.1)
#> processx 3.4.2 2020-02-09 [1] CRAN (R 3.6.1)
#> progress 1.2.2 2019-05-16 [1] CRAN (R 3.6.0)
#> ps 1.3.2 2020-02-13 [1] CRAN (R 3.6.1)
#> R6 2.4.1 2019-11-12 [1] CRAN (R 3.6.1)
#> Rcpp 1.0.3 2019-11-08 [1] standard (@1.0.3)
#> remotes 2.1.1 2020-02-15 [1] CRAN (R 3.6.1)
#> rlang 0.4.5 2020-03-01 [1] CRAN (R 3.6.1)
#> rmarkdown 2.1 2020-01-20 [1] CRAN (R 3.6.1)
#> rprojroot 1.3-2 2018-01-03 [1] CRAN (R 3.6.0)
#> sessioninfo 1.1.1 2018-11-05 [1] CRAN (R 3.6.0)
#> storr 1.2.1 2018-10-18 [1] CRAN (R 3.6.0)
#> stringi 1.4.6 2020-02-17 [1] CRAN (R 3.6.1)
#> stringr 1.4.0 2019-02-10 [1] CRAN (R 3.6.0)
#> testthat 2.3.2 2020-03-02 [1] CRAN (R 3.6.1)
#> tibble 2.1.3 2019-06-06 [1] CRAN (R 3.6.0)
#> txtq 0.2.0 2019-10-15 [1] CRAN (R 3.6.1)
#> usethis 1.5.1 2019-07-04 [1] CRAN (R 3.6.0)
#> vctrs 0.2.3 2020-02-20 [1] CRAN (R 3.6.1)
#> withr 2.1.2 2018-03-15 [1] CRAN (R 3.6.0)
#> xfun 0.12 2020-01-13 [1] CRAN (R 3.6.1)
#> yaml 2.2.1 2020-02-01 [1] CRAN (R 3.6.1)
#>
#> [1] /home/psadil/R/x86_64-pc-linux-gnu-library/3.6
#> [2] /usr/local/lib/R/site-library
#> [3] /usr/lib/R/site-library
#> [4] /usr/lib/R/library
Thanks!
This is happening because drake takes shortcuts to check sub-targets. For the sake of speed, drake checks triggers for the whole dynamic target rather than each sub-target individually. That means in order to skip sub-targets, the metadata for result must already exist in the cache, which unfortunately does not happen until all the sub-targets run at least once. I will need to think more on what we can do about this. (Maybe we can store the metadata early?)
For now, you can select keep_going = TRUE so the dynamic target gets finalized and future make()s can skip sub-targets. (See below.)
Your issues are extremely helpful! They are identifying huge problems in drake I did not even know existed. Please continue to post them as they arise.
library(drake)
foo <- function(num) {
if(num > 1) {
print("num is too high!")
stop("num is too high!")
Sys.sleep(2)
} else {
print("num is low enough.")
}
num
}
plan <- drake_plan(
numbers = seq_len(2),
result = target(
foo(numbers),
dynamic = map(numbers)
)
)
make(plan, keep_going = TRUE)
#> โถ target numbers
#> โถ dynamic result
#> > subtarget result_0b3474bd
#> [1] "num is low enough."
#> > subtarget result_b2a5c9b8
#> [1] "num is too high!"
#> x fail result_b2a5c9b8
#> โ finalize result
make(plan, keep_going = TRUE)
#> โถ dynamic result
#> > subtarget result_b2a5c9b8
#> [1] "num is too high!"
#> x fail result_b2a5c9b8
#> โ finalize result
Created on 2020-03-09 by the reprex package (v0.3.0)
That sounds good. Thanks for the clarification (and again the package/support!). In the meantime, that keep_going = TRUE tip will be pretty helpful.
It is not enough to simply store the dynamic target's metadata ahead of time. On its own, that would falsely validate sub-targets that already exist. We also need to encode a representation of the dynamic target's trigger state into the names of the sub-targets (similar to the recovery key). With all that in place, we should be able to make the correct build decisions without the computational burden of checking every sub-target's metadata list.
Unfortunately, https://github.com/ropensci/drake/issues/1209#issuecomment-597266664 will invalidate everyone's sub-targets. Still, the long-run time savings are worth it.
Come to think of it, https://github.com/ropensci/drake/issues/1209#issuecomment-597266664 has an even bigger problem: it throws out the user's choice of custom triggers. So that's out of the question now.
I do not like it, but I think we will have to go with some kind of ad hoc tracking mechanism to make this happen.
Dynamic branching, dynamic files, triggers, data recovery, high-performance computing: together, these features have turned out to seriously exacerbate drake's conceptual complexity. It will take a lot of refactoring to streamline things out again and some careful planning and testing to fix this issue.
For this issue, we might be able to leverage data recovery somehow. make(recover = TRUE) does not currently recover sub-targets in this case, but maybe it could. After that, we could think about making recover = TRUE the default.
The downside is that we would be repurposing a system design that was not exactly designed for the task. Also, It requires checking sub-target metadata, which could be slow.
So we need a new tracking mechanism. Again, I do not like the extra conceptual complexity, but it seems like the only way keep drake running fast without interfering with other parts of the design. Proposal:
storr namespace (a "dynamic progress namespace"). The namespace name should begin with the prefix "dyn-" and contain the name and set_progress().)make() again and the parent did not finalize last time. Then, there should be keys in the dynamic progress namespace. If the static dependencies remain unchanged (and if the condition trigger is not activated) we should be able to use those keys to avoid registering sub-targets that succeeded before.Important to note: the proposal above does not invalidate everyone's dynamic targets!
Another advantage of https://github.com/ropensci/drake/issues/1209#issuecomment-597666346 over https://github.com/ropensci/drake/issues/1209#issuecomment-597266664 is that if we went with the latter, the number of storr keys would explode. As it is, dynamic branching leaves a lot of unused keys behind, which users should regularly clean out with clean(list = cached_unplanned(plan)). (I need to make that more obvious.)
Glad to hear it. I actually just realized we need the full recovery key after all, so please hold off until I submit and merge a PR before you use it for serious work.
On second thought, let's revert https://github.com/ropensci/drake/commit/c60ce32d13caca9c63936f92bf07dc3c0854714a to keep everyone's targets valid. That bug is annoying but does no tangible harm.
Rethinking the implementation here. Because of https://github.com/richfitz/storr/issues/121, the ad hoc storr namespaces do not really go away. Yes, we clear them, but the folders are still there, and they could add up.
drake_cache()$list_namespaces()
[1] "dyn-y-926684c7" "dyn-y-d8df0a05" "dyn-y-f4c3c1c6" "memoize" "meta"
[6] "objects" "progress" "recover" "session"
What we need is a single namespace and more descriptive keys.
On reflection, a single namespace would decrease performance. Let's just remove the folders in the special case of RDS storrs. The problem will resolve itself after https://github.com/richfitz/storr/pull/122 is merged.
Need to revert https://github.com/ropensci/drake/commit/caf6084d08cd955ff7c1ac02f8dd80f0948b83a2 after https://github.com/richfitz/storr/pull/122 is merged and the new storr is on CRAN. https://github.com/ropensci/drake/commit/caf6084d08cd955ff7c1ac02f8dd80f0948b83a2 does the highest-performant thing, but it assumes knowledge of the internal file structure of RDS storrs, which is not ideal if storr's internals change. (But I do trust Rich to preserve back compatibility.)
I'm having this problem. Are you still waiting a storr version to be on CRAN in order to merge the fix? I tried drake 7.12.0.9000, but I'm still having to re-build subtargets.
That's odd, and it should not be happening in 7.12.0.9000. I tested the patch pretty aggressively, and tests are still passing on my end. Would you post a reprex so I can take a look?
I'm not sure how to simulate my target failure. I'm actually planning on creating an issue on the failure later on. I wanted to see if the target would finish though. It might narrow down the cause possibilities.
I can point you to the repo and I can push the .drake folder. I tried the example above and that works as expected. Any suggestions on how to go about this?
It is best if we narrow down the exact conditions/behaviors that lead to incorrectly rebuilding sub-targets. Knowing the plan is a good start, but we also need to know when you are make()ing it and what you are changing in between make()s. The cause might not be #1209. Also, how small and fast can you make your plan and still reproduce the issue?
And for what it's worth, r_make() resolves much of the brittleness of make(). I recommend having a look at https://books.ropensci.org/drake/projects.html#safer-interactivity.
Regarding "when", are you talking about the actual times?
My sequence of changes has gone something like this:
drake_plan get repeated for each sample size. So eventually, the targets will be repeated 4 times for n = 100, 800, 2000, and 5000. The plan ran fine for n = 100, n = 800, so I moved it to the versioned directory. renv environment. I ran make and got the "progress_bar" error. I thought the error might have something to do with a couple of the future_map .progress arguments I had in the code (even though I had them set to FALSE), so I removed them. Of course, that had nothing to do with it. I read the issue; installed the progress package (and maybe some other package I forget); and it ran, but that triggered a rebuild of targets unfortunately. make, and the first 2 n =2000 subtargets get built. The 3rd subtarget fails. make with some new arguments to try and deal with the fail, but I'm pretty sure I haven't changed any of the other code in the functions or plan (update: I've check commits local and remote, and I haven't changed anything else). Now, when I try and restart though, it tries to re-build the first subtarget. I think I tried to run it with the previous make arguments and it still tried to re-build the first target. makeClusterPSOCK, added another setting in a saved PuTTY session, and the ips have changed of course, but I wouldn't think any of that would affect what's going on with the subtargets. Here's the new make with the added arguments:
make(
plan,
verbose = 1,
session_info = FALSE,
retries = 2,
lock_envir = FALSE,
history = FALSE,
log_progress = FALSE,
jobs_preprocess = 7
)
and a pic of that last fail
https://www.dropbox.com/s/ryx92zaq3vlqgxx/Screenshot%20%2853%29.png?dl=0
Need to build a couple charts for another project, but I'll look into r_make either later tonight or tomorrow morning.
current session info
- Session info ------------------------------------------------------------------------------------------
setting value
version R version 3.6.2 (2019-12-12)
os Windows 10 x64
system x86_64, mingw32
ui RStudio
language (EN)
collate English_United States.1252
ctype English_United States.1252
tz America/New_York
date 2020-04-30
- Packages ----------------------------------------------------------------------------------------------
package * version date lib source
assertthat 0.2.1 2019-03-21 [1] CRAN (R 3.6.2)
backports 1.1.6 2020-04-05 [1] CRAN (R 3.6.3)
base64url 1.4 2018-05-14 [1] CRAN (R 3.6.3)
cli 2.0.1 2020-01-08 [1] CRAN (R 3.6.2)
clipr 0.7.0 2019-07-23 [1] CRAN (R 3.6.2)
codetools 0.2-16 2018-12-24 [1] CRAN (R 3.6.2)
crayon 1.3.4 2017-09-16 [1] CRAN (R 3.6.2)
data.table * 1.12.8 2019-12-09 [1] CRAN (R 3.6.2)
desc 1.2.0 2018-05-01 [1] CRAN (R 3.6.1)
details * 0.2.1 2020-01-12 [1] CRAN (R 3.6.3)
digest 0.6.25 2020-02-23 [1] CRAN (R 3.6.2)
dplyr * 0.8.4 2020-01-31 [1] CRAN (R 3.6.2)
drake * 7.12.0.9000 2020-04-29 [1] Github (ropensci/drake@935f95a)
dtplyr * 1.0.1 2020-01-23 [1] CRAN (R 3.6.2)
fansi 0.4.1 2020-01-08 [1] CRAN (R 3.6.2)
filelock 1.0.2 2018-10-05 [1] CRAN (R 3.6.3)
furrr * 0.1.0 2018-05-16 [1] CRAN (R 3.6.1)
future * 1.16.0 2020-01-16 [1] CRAN (R 3.6.2)
globals 0.12.5 2019-12-07 [1] CRAN (R 3.6.1)
glue 1.4.0 2020-04-03 [1] CRAN (R 3.6.2)
hms 0.5.3 2020-01-08 [1] CRAN (R 3.6.2)
httr 1.4.1 2019-08-05 [1] CRAN (R 3.6.1)
igraph 1.2.5 2020-03-19 [1] CRAN (R 3.6.3)
jsonlite 1.6.1 2020-02-02 [1] CRAN (R 3.6.2)
knitr 1.28 2020-02-06 [1] CRAN (R 3.6.2)
listenv 0.8.0 2019-12-05 [1] CRAN (R 3.6.2)
magrittr 1.5 2014-11-22 [1] CRAN (R 3.6.2)
packrat 0.5.0 2018-11-14 [1] CRAN (R 3.6.1)
pacman 0.5.1 2019-03-11 [1] CRAN (R 3.6.2)
pillar 1.4.3 2019-12-20 [1] CRAN (R 3.6.2)
pkgconfig 2.0.3 2019-09-22 [1] CRAN (R 3.6.2)
png 0.1-7 2013-12-03 [1] CRAN (R 3.6.0)
prettyunits 1.1.1 2020-01-24 [1] CRAN (R 3.6.2)
progress 1.2.2 2019-05-16 [1] CRAN (R 3.6.1)
purrr 0.3.3 2019-10-18 [1] CRAN (R 3.6.2)
R6 2.4.1 2019-11-12 [1] CRAN (R 3.6.2)
Rcpp 1.0.3 2019-11-08 [1] CRAN (R 3.6.2)
renv 0.9.3-30 2020-02-22 [1] Github (rstudio/renv@916923a)
reticulate 1.14 2019-12-17 [1] CRAN (R 3.6.2)
rlang 0.4.5 2020-03-01 [1] CRAN (R 3.6.3)
rprojroot 1.3-2 2018-01-03 [1] CRAN (R 3.6.1)
rstudioapi 0.11 2020-02-07 [1] CRAN (R 3.6.2)
sessioninfo 1.1.1 2018-11-05 [1] CRAN (R 3.6.2)
storr 1.2.1 2018-10-18 [1] CRAN (R 3.6.3)
tibble 2.1.3 2019-06-06 [1] CRAN (R 3.6.2)
tidyselect 1.0.0 2020-01-27 [1] CRAN (R 3.6.2)
txtq 0.2.0 2019-10-15 [1] CRAN (R 3.6.3)
vctrs 0.2.4 2020-03-10 [1] CRAN (R 3.6.3)
withr 2.1.2 2018-03-15 [1] CRAN (R 3.6.1)
xfun 0.12 2020-01-13 [1] CRAN (R 3.6.2)
xml2 1.2.2 2019-08-09 [1] CRAN (R 3.6.1)
[1] C:/Users/tbats/Documents/R/Projects/nested-cross-validation-comparison/renv/library/R-3.6/x86_64-w64-mingw32
[2] C:/Users/tbats/AppData/Local/Temp/RtmpGMIymj/renv-system-library
Kudos on persevering to this point. Changes to make() arguments should not invalidate targets except for seed and format. Unfortunately, I cannot replicate how you interacted with your project in 1-4, which is why reprexes on downsized examples are helpful.
You can use the deps_profile() function to see what drake thinks about the state of dependencies. If it thinks at least one upstream function or target changed since last time, it will tell you. Here is a reprex to demonstrate.
library(drake)
f <- function(x) {
x + 1
}
plan <- drake_plan(x = 1, y = f(x))
make(plan)
#> โถ target x
#> โถ target y
deps_profile(y, plan)
#> # A tibble: 5 x 4
#> name changed old new
#> <chr> <lgl> <chr> <chr>
#> 1 command FALSE "7974fa383d985540" "7974fa383d985540"
#> 2 depend FALSE "4ce8d6b05dc7e1f9" "4ce8d6b05dc7e1f9"
#> 3 file_in FALSE "" ""
#> 4 file_out FALSE "" ""
#> 5 seed FALSE "965938315" "965938315"
# Change a function.
f <- function(x) {
x + 2
}
# Register the change in the cache.
make(plan, skip_targets = TRUE)
# The `change` column is now `TRUE` in the `depend` row.
deps_profile(y, plan)
#> # A tibble: 5 x 4
#> name changed old new
#> <chr> <lgl> <chr> <chr>
#> 1 command FALSE "7974fa383d985540" "7974fa383d985540"
#> 2 depend TRUE "4ce8d6b05dc7e1f9" "cd6bb9a6ec1eea7a"
#> 3 file_in FALSE "" ""
#> 4 file_out FALSE "" ""
#> 5 seed FALSE "965938315" "965938315"
Created on 2020-05-01 by the reprex package (v0.3.0)
By the way, I see you are using custom PSOCK clusters for parallel computing. drake has built-in high-performance computing, which may be more convenient and could give you more parallel efficiency in some cases: https://books.ropensci.org/drake/hpc.html. Since you are running Linux, it is straightforward to use clustermq's multicore backend. And if you have a computing cluster with a resource manager, even better. Sketch:
library(drake)
# install.packages("clustermq")
options(clustermq.scheduler = "multicore")
make(plan, parallelism = "clustermq", jobs = 4)
Reprex doesn't find anything: files or functions (unless I package::function). I've tried inside the project environment and outside of it with the working directory set the project directory. Don't understand it. It's worked fine before. Doesn't find the plan-kj.R file with source either.
library(drake)
error_FUN <- function(y_obs, y_hat){
y_obs <- unlist(y_obs)
y_hat <- unlist(y_hat)
Metrics::mae(y_obs, y_hat)
}
method <- "kj"
algorithms <- list("glmnet", "rf")
repeats <- seq(1:5)
grid_size <- 100
plan <- drake_plan(
# model functions for each algorithm
mod_FUN_list = create_models(algorithms),
# data used to estimate out-of-sample error
# noise_sd, seed settings are the defaults
large_dat = mlbench_data(n = 10^5,
noise_sd = 1,
seed = 2019),
# sample size = 100
sim_dat_100 = mlbench_data(100),
# hyperparameter grids for each algorithm
# This probably doesn't need to be a "dynamic" target since mtry is only concerned about the number of columns in data (see script), but I'll do it anyways
params_list_100 = create_grids(sim_dat_100,
algorithms,
size = grid_size),
# create a separate ncv data object for each repeat value
ncv_dat_100 = create_ncv_objects(sim_dat_100,
repeats,
method),
# runs nested-cv and compares ncv error with out-of-sample error
# outputs: ncv error, oos error, delta error, chosen algorithm, chosen hyperparameters
ncv_results_100 = target(
run_ncv(ncv_dat_100,
sim_dat_100,
large_dat,
mod_FUN_list,
params_list_100,
error_FUN,
method),
dynamic = map(ncv_dat_100)
),
# add index columns to identify the results according to sample size and number of repeats
perf_results_100 = tibble(n = 100, repeats = repeats) %>%
bind_cols(ncv_results_100),
# repeat for the rest of the sample sizes
# sample size = 800
sim_dat_800 = mlbench_data(800),
params_list_800 = create_grids(sim_dat_800,
algorithms,
size = grid_size),
ncv_dat_800 = create_ncv_objects(sim_dat_800,
repeats,
method),
ncv_results_800 = target(
run_ncv(ncv_dat_800,
sim_dat_800,
large_dat,
mod_FUN_list,
params_list_800,
error_FUN,
method),
dynamic = map(ncv_dat_800)
),
perf_results_800 = tibble(n = 800, repeats = repeats) %>%
bind_cols(ncv_results_800),
# sample size = 2000
sim_dat_2000 = mlbench_data(2000),
params_list_2000 = create_grids(sim_dat_2000,
algorithms,
size = grid_size),
ncv_dat_2000 = create_ncv_objects(sim_dat_2000,
repeats,
method),
ncv_results_2000 = target(
run_ncv(ncv_dat_2000,
sim_dat_2000,
large_dat,
mod_FUN_list,
params_list_2000,
error_FUN,
method),
dynamic = map(ncv_dat_2000)
),
perf_results_2000 = tibble(n = 2000, repeats = repeats) %>%
bind_cols(ncv_results_2000)
)
drake::deps_profile(ncv_results_2000, plan)
#> Error in deps_profile_impl(target = ncv_results_2000, config = config): no recorded metadata for target ncv_results_2000.
rlang::last_error()
#> Error: Can't show last error because no error was recorded yet
rlang::last_trace()
#> Error: Can't show last error because no error was recorded yet
deps_profile(ncv_results_2000_7d80d14d, plan)
#> Error in deps_profile_impl(target = ncv_results_2000_7d80d14d, config = config): no recorded metadata for target ncv_results_2000_7d80d14d.
rlang::last_error()
#> Error: Can't show last error because no error was recorded yet
rlang::last_trace()
#> Error: Can't show last error because no error was recorded yet
drake::loadd(ncv_results_2000_7d80d14d)
#> Error in loadd_handle_empty_targets(targets = targets, cache = cache, : object 'ncv_results_2000_7d80d14d' not found
Created on 2020-05-02 by the reprex package (v0.3.0)
Here's what happens when I excute the script myself (starting after plan). Apologize for the readability.
drake::deps_profile(ncv_results_2000, plan)
# Error: Tibble columns must have compatible sizes.
# * Size 5: Existing data.
# * Size 2: Column `old`.
# i Only values of size one are recycled.
# Run `rlang::last_error()` to see where the error occurred.
# In addition: Warning message:
# In old_values != new_values :
# longer object length is not a multiple of shorter object length
rlang::last_error()
# <error/tibble_error_incompatible_size>
# Tibble columns must have compatible sizes.
# * Size 5: Existing data.
# * Size 2: Column `old`.
# i Only values of size one are recycled.
# Backtrace:
# 1. drake::deps_profile(ncv_results_2000, plan)
# 4. drake::deps_profile_impl(target = ncv_results_2000, config = config)
# 5. drake:::weak_tibble(...)
# 6. tibble::tibble(...)
# 7. tibble:::tibble_quos(xs[!is_null], .rows, .name_repair)
# 8. tibble:::vectbl_recycle_rows(res, first_size, j, given_col_names[[j]])
# Run `rlang::last_trace()` to see the full context.
rlang::last_trace()
# <error/tibble_error_incompatible_size>
# Tibble columns must have compatible sizes.
# * Size 5: Existing data.
# * Size 2: Column `old`.
# i Only values of size one are recycled.
# Backtrace:
# x
# 1. \-drake::deps_profile(ncv_results_2000, plan)
# 2. +-base::eval(call)
# 3. | \-base::eval(call)
# 4. \-drake::deps_profile_impl(target = ncv_results_2000, config = config)
# 5. \-drake:::weak_tibble(...)
# 6. \-tibble::tibble(...)
# 7. \-tibble:::tibble_quos(xs[!is_null], .rows, .name_repair)
# 8. \-tibble:::vectbl_recycle_rows(res, first_size, j, given_col_names[[j]])
deps_profile(ncv_results_2000_7d80d14d, plan)
# Error: Tibble columns must have compatible sizes.
# * Size 5: Existing data.
# * Size 2: Column `old`.
# i Only values of size one are recycled.
# Run `rlang::last_error()` to see where the error occurred.
# In addition: Warning message:
# In old_values != new_values :
# longer object length is not a multiple of shorter object length
rlang::last_error()
# <error/tibble_error_incompatible_size>
# Tibble columns must have compatible sizes.
# * Size 5: Existing data.
# * Size 2: Column `old`.
# i Only values of size one are recycled.
# Backtrace:
# 1. drake::deps_profile(ncv_results_2000_7d80d14d, plan)
# 4. drake::deps_profile_impl(...)
# 5. drake:::weak_tibble(...)
# 6. tibble::tibble(...)
# 7. tibble:::tibble_quos(xs[!is_null], .rows, .name_repair)
# 8. tibble:::vectbl_recycle_rows(res, first_size, j, given_col_names[[j]])
# Run `rlang::last_trace()` to see the full context.
rlang::last_trace()
# <error/tibble_error_incompatible_size>
# Tibble columns must have compatible sizes.
# * Size 5: Existing data.
# * Size 2: Column `old`.
# i Only values of size one are recycled.
# Backtrace:
# x
# 1. \-drake::deps_profile(ncv_results_2000_7d80d14d, plan)
# 2. +-base::eval(call)
# 3. | \-base::eval(call)
# 4. \-drake::deps_profile_impl(...)
# 5. \-drake:::weak_tibble(...)
# 6. \-tibble::tibble(...)
# 7. \-tibble:::tibble_quos(xs[!is_null], .rows, .name_repair)
# 8. \-tibble:::vectbl_recycle_rows(res, first_size, j, given_col_names[[j]])
drake::loadd(ncv_results_2000_7d80d14d)
# ncv_results_2000_7d80d14d
# # A tibble: 1 x 7
# method oos_error ncv_error delta_error chosen_algorithm mtry trees
# <chr> <dbl> <dbl> <dbl> <chr> <int> <int>
# 1 kj 1.39 1.36 0.0214 rf 5 1325
drake::loadd(ncv_results_2000)
# Error in loadd_handle_empty_targets(targets = targets, cache = cache, :
# object 'ncv_results_2000' not found
Just fyi, my .drake file is a couple layers above my make and plan scripts. Wonder if that's whats messing with the reprex function.
Reprex doesn't find anything: files or functions (unless I package::function). I've tried inside the project environment and outside of it with the working directory set the project directory. Don't understand it. It's worked fine before. Doesn't find the plan-kj.R file with source either.
It may be frustrating, yes, but it would really help to identify and fix the problem if it is possible to whittle down your project into to something that fits into reprex and is easier to understand and run.
Just fyi, my .drake file is a couple layers above my make and plan scripts. Wonder if that's whats messing with the reprex function.
Part of what I am requesting is a script that creates an entirely new .drake cache that recreates the problem. Once we can reproduce it from scratch using automated code, we have a much better shot of figuring out what is going on. Otherwise, it is difficult to speculate on the more human-related ad hoc steps you may have taken to set up your project.
Yeah, I agree. Unfortunately, it's a complicated project that takes days to run on my desktop even for the small datasets. I'll guess I'll try to run the whole thing with very, very small datasets to get the runtime down to something manageable, and see if I can recreate the issue once I get reprex to work. In terms of reducing the complexity of the code, I'm not sure how that can be done here. I'll talk to my duck, though, and see if we can figure out something. :)
Any ideas about what's happening with deps_profile or are the error messages not much help without the reprex?
Any ideas about what's happening with deps_profile or are the error messages not much help without the reprex?
Again, based on what we have to go on right now, I am not sure. But I do have a suspicion, and I just pushed a patch to try to deal with it: https://github.com/ropensci/drake/commit/50eb9f5d056a1ef512cc341e84a6714806d43cbb. You might have better luck if you install the update (though I cannot make guarantees).
The good news is that I've figured out how reprex determines its working directory. You have to specify the outfile argument, else it works out of temp directory. Guess I've only used it for simple situations in the past, because I didn't remember that that's how that worked. The bad news is that, if I'm reading this correctly, drake thinks that every target needs rebuilt. Still working on a simplified version of the project. I think I interrupted make twice in row once I saw it was trying to rebuild targets. Wonder if that made things worse.
library(drake)
source("performance-experiment/Kuhn-Johnson/plan-kj.R")
deps_profile(ncv_results_2000, plan)
#> # A tibble: 5 x 4
#> name changed old new
#> <chr> <lgl> <chr> <chr>
#> 1 command TRUE 4f18907a711e6c41 "d958fb47b0b8f88d"
#> 2 depend NA <NA> "aef603d261217ef0"
#> 3 file_in NA <NA> ""
#> 4 file_out NA <NA> ""
#> 5 seed NA <NA> "540153646"
deps_profile(ncv_results_2000_7d80d14d, plan)
#> # A tibble: 5 x 4
#> name changed old new
#> <chr> <lgl> <chr> <chr>
#> 1 command TRUE 4f18907a711e6c41 "ef46db3751d8e999"
#> 2 depend NA <NA> ""
#> 3 file_in NA <NA> ""
#> 4 file_out NA <NA> ""
#> 5 seed FALSE 2136092035 "2136092035"
loadd(ncv_results_2000_7d80d14d)
ncv_results_2000_7d80d14d
#> # A tibble: 1 x 7
#> method oos_error ncv_error delta_error chosen_algorithm mtry trees
#> <chr> <dbl> <dbl> <dbl> <chr> <int> <int>
#> 1 kj 1.39 1.36 0.0214 rf 5 1325
outdated(plan)
#> [1] "large_dat" "mod_FUN_list" "ncv_dat_100"
#> [4] "ncv_dat_2000" "ncv_dat_800" "ncv_results_100"
#> [7] "ncv_results_2000" "ncv_results_800" "params_list_100"
#> [10] "params_list_2000" "params_list_800" "perf_results_100"
#> [13] "perf_results_2000" "perf_results_800" "sim_dat_100"
#> [16] "sim_dat_2000" "sim_dat_800"
deps_profile(ncv_results_800, plan)
#> # A tibble: 5 x 4
#> name changed old new
#> <chr> <lgl> <chr> <chr>
#> 1 command FALSE "55016b093d194572" "55016b093d194572"
#> 2 depend TRUE "26874d8b848c3d63" "4b6794fb258a5773"
#> 3 file_in FALSE "" ""
#> 4 file_out FALSE "" ""
#> 5 seed FALSE "1942602105" "1942602105"
deps_profile(ncv_results_100, plan)
#> # A tibble: 5 x 4
#> name changed old new
#> <chr> <lgl> <chr> <chr>
#> 1 command TRUE 4f18907a711e6c41 "ab37a0ecea2b1f23"
#> 2 depend NA <NA> "5bd1d20abf0b48f9"
#> 3 file_in NA <NA> ""
#> 4 file_out NA <NA> ""
#> 5 seed NA <NA> "395977714"
drake_history()
#> Error in parse(text = command): <text>:2:24: unexpected symbol
#> 1: run_ncv(ncv_dat_100, sim_dat_100, large_dat, mod_FUN_list, params_list,
#> 2: error_FUN, method) map
#> ^
Created on 2020-05-05 by the reprex package (v0.3.0)
For everything you showed except ncv_results_800, it looks like you changed the command in the plan. For ncv_results_800, drake thinks at least one of your dependency targets/functions/globals changed at some point, which is trickier to identify. Both these things trigger updates to all the sub-targets.
These are the first 4 targets in my plan, and the profiles say their dependencies changed. It doesn't make any sense. The first 3 target's dependencies are constants that wouldn't be changed. The commands for n=100, n = 2000 didn't change either. There's too much change here for me not to have remembered it and it doesn't show-up in the commits. So is there anything else to look at or do before I try to recreate this in miniature?
Btw I don't see anyway to reduce the complexity of the code, but the plan is to create a public repo with very slimmed down data objects that you can clone. That way it shouldn't take long to run and you'll be able to create the .drake file. This is assuming I can recreate the fail which I'm not confident will happen since I think the size of the objects might be the cause. Will that work for you? Just to double-check โ you looking at my current .drake file will not help, right?
method <- "kj"
algorithms <- list("glmnet", "rf")
repeats <- seq(1:5)
grid_size <- 100
plan <- drake_plan(
# model functions for each algorithm
mod_FUN_list = create_models(algorithms),
# data used to estimate out-of-sample error
# noise_sd, seed settings are the defaults
large_dat = mlbench_data(n = 10^5,
noise_sd = 1,
seed = 2019),
# sample size = 100
sim_dat_100 = mlbench_data(100),
library(drake)
source("performance-experiment/Kuhn-Johnson/plan-kj.R")
vis_drake_graph(plan)

deps_profile(mod_FUN_list, plan)
#> # A tibble: 5 x 4
#> name changed old new
#> <chr> <lgl> <chr> <chr>
#> 1 command FALSE "896777bfc4467875" "896777bfc4467875"
#> 2 depend TRUE "19a0f5400146eab4" "26cafb820b726b9a"
#> 3 file_in FALSE "" ""
#> 4 file_out FALSE "" ""
#> 5 seed FALSE "787681411" "787681411"
deps_profile(large_dat, plan)
#> # A tibble: 5 x 4
#> name changed old new
#> <chr> <lgl> <chr> <chr>
#> 1 command FALSE "f7b4ac0ab068d769" "f7b4ac0ab068d769"
#> 2 depend TRUE "1a4e85c7d355a240" ""
#> 3 file_in FALSE "" ""
#> 4 file_out FALSE "" ""
#> 5 seed FALSE "1889768483" "1889768483"
deps_profile(sim_dat_100, plan)
#> # A tibble: 5 x 4
#> name changed old new
#> <chr> <lgl> <chr> <chr>
#> 1 command FALSE "31129d94b2c9b515" "31129d94b2c9b515"
#> 2 depend TRUE "1a4e85c7d355a240" ""
#> 3 file_in FALSE "" ""
#> 4 file_out FALSE "" ""
#> 5 seed FALSE "90873426" "90873426"
deps_profile(params_list_100, plan)
#> # A tibble: 5 x 4
#> name changed old new
#> <chr> <lgl> <chr> <chr>
#> 1 command FALSE "7cc8ecbd68b25f69" "7cc8ecbd68b25f69"
#> 2 depend TRUE "85798dc734adfacd" "1c4e453d1ffbc8d4"
#> 3 file_in FALSE "" ""
#> 4 file_out FALSE "" ""
#> 5 seed FALSE "1602630628" "1602630628"
Created on 2020-05-10 by the reprex package (v0.3.0)
So is there anything else to look at or do before I try to recreate this in miniature?
When was the last time you built the project from scratch? And how often do you restart your R session. It is best to start make() with a fresh global environment (especially if you're not using the custom envir argument). r_make() guarantees that the global environment is completely fresh and is created in the same way every time you run the pipeline. This is critically important for a project as large and complicated as yours. So a low-overhaul thing to try is to put these lines and plan-kj.R into a _drake.R file and run a fresh copy of your project with r_make() and see if things stay up to date.
Certain hidden circularities can also crop. For example, the following plan is self-invalidating. I tried to catch most of this with lock_envir = FALSE in drake version 7, but weird stuff could theoretically crop up with environment variables etc.
library(drake)
a <- 1
plan <- drake_plan(
x = a,
y = assign("a", x + 1, envir = globalenv())
)
make(plan, lock_envir = FALSE)
#> > target x
#> > target y
make(plan, lock_envir = FALSE)
#> > target x
#> > target y
Created on 2020-05-11 by the reprex package (v0.3.0)
The existing repo has heavy package requirements and the targets still have heavy runtimes. Specifically, ncv_results_2000 is very slow for an example used for debugging purposes. The targets before ncv_results_2000, however, do stay up to date when I try to run things.
Also, I notice you are using custom future multicore processing which I skipped for the sake of convenience. You might have a look at drake's built in high-performance computing, which has multicore and cluster-powered capabilities: https://books.ropensci.org/drake/hpc.html.
ncv_results_2000 target failed at night and I didn't rerun it until the next morning. I'm pretty sure I closed R and started with a fresh session the next morning. verbose = 1, lock_envir = FALSE, then I need to put those args into a drake_config(), correct? And do I run r_make in the console with _drake.R in the project root directory? clustermq, but I'm not sure all customization, template file stuff and SLURM schedulers, etc., applies to my project. As is, I'm only parallelizing within targets. If I knew my targets wouldn't fail, then I'd consider starting 8 instances to parallelize building the targets. Or did I miss something? Is there something that would make the within-target loops run more efficiently?Along with the lines you highlighted for _drake.R, if I want verbose = 1, lock_envir = FALSE, then I need to put those args into a drake_config(), correct? And do I run r_make in the console with _drake.R in the project root directory?
Yes to both.
I wouldn't mind trying clustermq, but I'm not sure all customization, template file stuff and SLURM schedulers, etc., applies to my project. As is, I'm only parallelizing within targets. If I knew my targets wouldn't fail, then I'd consider starting 8 instances to parallelize building the targets. Or did I miss something? Is there something that would make the within-target loops run more efficiently?
clustermq does have a multicore backend for multicore parallelism on non-Windows machines. (For Windows machines, drake does also have its own future backend.) . Other than that, there's always a tradeoff between parallelism within targets and parallelism among targets. If you know not very many targets are going to run at the same time but there is a lot to do within each target, then within-target parallelism seems reasonable. But if a lot of targets are conditionally independent and you want some targets to start while others are still running, drake's built-in parallelism can help. This was probably already intuitive to you, but it is the main thing I think about when writing a new pipeline that needs HPC.
Reran the slimmed down version of the project in a separate repo with same environment and pretty much worked like a dream. Did have one small hiccup in the beginning though. I got a config error on my first run when there wasn't a .drake directory yet โ see here and here. Restarted R, re-ran r_make, and everything ran fine. Is there an initialization step that I missed?
Had a connection error after some sub-target builds. Re-ran r_make and it picked up where it left off. No rebuilding of already-successfully-built sub-targets. Added a n=3000 section to the plan, And again, no triggered rebuilds and it finished smoothly.
Next step, I guess, will be to delete the .drake directory in the real repo, create a _drake.R and try again.
I could not reproduce those errors running https://github.com/ercbk/temp-nested-cv. I did get connection errors, but those disappeared when I commented out the custom future configuration you had. Looks like you're setting up drake properly.
What custom future configuration are you talking about? The saved PuTTY session settings?
Okay, I assume there's a way to do the same thing using the methods you describe in the hpc chapter of your book, but I'm not grasping it. Could you show me the code youe using to connect to the instances?
I was just using the repo you linked: https://github.com/ercbk/temp-nested-cv. I could not connect to your instances, so I commented out https://github.com/ercbk/temp-nested-cv/blob/153e428f35a99ab8a70ac95fbc23d9bb4721c2ef/_drake.R#L33-L63 to try to run locally. The pipeline successfully started from a fresh cache (no "argument 'config' missing with no default" errors on my end).
To parallelize among targets using the existing setup you have, you could keep https://github.com/ercbk/temp-nested-cv/blob/153e428f35a99ab8a70ac95fbc23d9bb4721c2ef/_drake.R#L33-L63, disable within-target parallelism, and use the following drake_config():
drake_config(
plan,
parallelism = "future",
jobs = 2, # or however many workers you want
verbose = 1,
lock_envir = FALSE,
jobs_preprocess = 7
)
OH. When you said you got rid of the connection errors, I thought you had some better way to stabilize my connection to compute this remotely. I see.