Drake: Does drake run autoclean in between dynamic subtargets?

Created on 22 Jun 2020 · 19Comments · Source: ropensci/drake

Prework

[x] Read and abide by drake's code of conduct.
[x] Search for duplicates among the existing issues, both open and closed.
[x] If you think your question has a quick and definite answer, consider posting to Stack Overflow under the drake-r-package tag. (If you anticipate extended follow-up and discussion, you are already in the right place!)

Question

I had an HPC job that failed due to memory after 15ish subtargets that all had identical memory requirements. Does autoclean run between subtargets? If not, could we add an autoclean_subtarget option that does run between subtargets?

question

Source

kendonB

All 19 comments

It should. manage_deps.autoclean() calls discard_targets()

https://github.com/ropensci/drake/blob/7bb9b5163c0fd624183c4e7b1115057c235e82ff/R/manage_memory.R#L55

which calls discard_dynamic():

https://github.com/ropensci/drake/blob/7bb9b5163c0fd624183c4e7b1115057c235e82ff/R/manage_memory.R#L100-L116

But I do see that may not account for dynamic sub-targets after all.

https://github.com/ropensci/drake/blob/7bb9b5163c0fd624183c4e7b1115057c235e82ff/R/manage_memory.R#L51

wlandau on 22 Jun 2020

As a refresher, I ran a little experiment: at the bottom of manage_memory(), I added some print statements to show which targets and sub-targets are still in memory after memory management.

manage_memory <- function(target, config, downstream = NULL, jobs = 1) {
  stopifnot(length(target) == 1L)
  memory_strategy <- config$spec[[target]]$memory_strategy
  if (is.null(memory_strategy) || is.na(memory_strategy)) {
    memory_strategy <- config$settings$memory_strategy
  }
  class(target) <- memory_strategy
  if (!is_subtarget(target, config)) {
    clear_envir_subtargets(target = target, config = config)
  }
  manage_deps(
    target = target,
    config = config,
    downstream = downstream,
    jobs = jobs
  )
  sync_envir_dynamic(target, config)
  if (config$settings$garbage_collection) {
    gc()
  }

  print(paste(c("target:", target), collapse = " "))
  print(paste(c("loaded targets:", names(config$envir_targets)), collapse = " "))
  print(paste(c("loaded subtargets:", names(config$envir_subtargets)), collapse = " "))
  print("")

  invisible()
}

It looks like autoclean is unloading sub-targets correctly.

library(drake)
plan <- drake_plan(
  x = 1:2,
  y = target(x, dynamic = map(x)),
  z = target(y, dynamic = map(y)),
  w = target(z, dynamic = map(z))
)

make(plan)
#> ▶ target x
#> [1] "target: x"
#> [1] "loaded targets:"
#> [1] "loaded subtargets:"
#> [1] ""
#> ▶ dynamic y
#> > subtarget y_0b3474bd
#> [1] "target: y_0b3474bd"
#> [1] "loaded targets: x"
#> [1] "loaded subtargets: x"
#> [1] ""
#> > subtarget y_b2a5c9b8
#> [1] "target: y_b2a5c9b8"
#> [1] "loaded targets: x"
#> [1] "loaded subtargets: x"
#> [1] ""
#> ■ finalize y
#> [1] "target: y"
#> [1] "loaded targets: x"
#> [1] "loaded subtargets:"
#> [1] ""
#> ▶ dynamic z
#> > subtarget z_0b3474bd
#> [1] "target: z_0b3474bd"
#> [1] "loaded targets: x y y_0b3474bd"
#> [1] "loaded subtargets: y"
#> [1] ""
#> > subtarget z_b2a5c9b8
#> [1] "target: z_b2a5c9b8"
#> [1] "loaded targets: x y y_b2a5c9b8 y_0b3474bd"
#> [1] "loaded subtargets: y"
#> [1] ""
#> ■ finalize z
#> [1] "target: z"
#> [1] "loaded targets: x y y_b2a5c9b8 y_0b3474bd"
#> [1] "loaded subtargets:"
#> [1] ""
#> ▶ dynamic w
#> > subtarget w_0b3474bd
#> [1] "target: w_0b3474bd"
#> [1] "loaded targets: z_0b3474bd x y z y_b2a5c9b8 y_0b3474bd"
#> [1] "loaded subtargets: z"
#> [1] ""
#> > subtarget w_b2a5c9b8
#> [1] "target: w_b2a5c9b8"
#> [1] "loaded targets: z_0b3474bd x y z z_b2a5c9b8 y_b2a5c9b8 y_0b3474bd"
#> [1] "loaded subtargets: z"
#> [1] ""
#> ■ finalize w
#> [1] "target: w"
#> [1] "loaded targets: z_0b3474bd x y z z_b2a5c9b8 y_b2a5c9b8 y_0b3474bd"
#> [1] "loaded subtargets:"
#> [1] ""

clean()

make(plan, memory_strategy = "autoclean")
#> ▶ target x
#> [1] "target: x"
#> [1] "loaded targets:"
#> [1] "loaded subtargets:"
#> [1] ""
#> ▶ dynamic y
#> > subtarget y_0b3474bd
#> [1] "target: y_0b3474bd"
#> [1] "loaded targets: x"
#> [1] "loaded subtargets: x"
#> [1] ""
#> > subtarget y_b2a5c9b8
#> [1] "target: y_b2a5c9b8"
#> [1] "loaded targets: x"
#> [1] "loaded subtargets: x"
#> [1] ""
#> ■ finalize y
#> [1] "target: y"
#> [1] "loaded targets: x"
#> [1] "loaded subtargets:"
#> [1] ""
#> ▶ dynamic z
#> > subtarget z_0b3474bd
#> [1] "target: z_0b3474bd"
#> [1] "loaded targets: y_0b3474bd"
#> [1] "loaded subtargets: y"
#> [1] ""
#> > subtarget z_b2a5c9b8
#> [1] "target: z_b2a5c9b8"
#> [1] "loaded targets: y_b2a5c9b8"
#> [1] "loaded subtargets: y"
#> [1] ""
#> ■ finalize z
#> [1] "target: z"
#> [1] "loaded targets: y"
#> [1] "loaded subtargets:"
#> [1] ""
#> ▶ dynamic w
#> > subtarget w_0b3474bd
#> [1] "target: w_0b3474bd"
#> [1] "loaded targets: z_0b3474bd"
#> [1] "loaded subtargets: z"
#> [1] ""
#> > subtarget w_b2a5c9b8
#> [1] "target: w_b2a5c9b8"
#> [1] "loaded targets: z_b2a5c9b8"
#> [1] "loaded subtargets: z"
#> [1] ""
#> ■ finalize w
#> [1] "target: w"
#> [1] "loaded targets: z"
#> [1] "loaded subtargets:"
#> [1] ""

^{Created on 2020-06-22 by the reprex package (v0.3.0)}

wlandau on 22 Jun 2020

Ran this below and never saw any browsing happen (sorry posted at the same time). What did I do wrong?

library(broom)
library(drake)
library(gapminder)
library(tidyverse)

# Split the Gapminder data by continent.
gapminder_continents <- function() {
  gapminder %>%
    mutate(gdpPercap = scale(gdpPercap)) %>%
    split(f = .$continent)
}

# Fit a model to a continent.
fit_model <- function(continent_data) {
  data <- continent_data[[1]]
  data %>%
    lm(formula = gdpPercap ~ year) %>%
    tidy() %>%
    mutate(continent = data$continent[1]) %>%
    dplyr::select(continent, term, statistic, p.value)
}

plan <- drake_plan(
  continents = gapminder_continents(),
  model = target(fit_model(continents), dynamic = map(continents))
)

clean(model)
debug(drake:::discard_dynamic)
make(plan, memory_strategy = "autoclean")
#> ▶ target continents
#> ▶ dynamic model
#> > subtarget model_c56e5407
#> > subtarget model_706a1529
#> > subtarget model_da843806
#> > subtarget model_862f8003
#> > subtarget model_ebb41f51
#> ■ finalize model

^{Created on 2020-06-23 by the reprex package (v0.3.0)}

kendonB on 22 Jun 2020

#> > subtarget z_0b3474bd
#> [1] "target: z_0b3474bd"
#> [1] "loaded targets: x y y_0b3474bd"
#> [1] "loaded subtargets: y"
#> [1] ""

Not sure I understand this output - shouldn't it only have y_0b3474bd loaded? Not x, y, or the second y?

kendonB on 22 Jun 2020

Objects in the sub-target environment get the names of the parents. So loaded subtargets: x really just refers to the slice of x that y_0b3474bd needs. But x the static target needs to be in memory too under "loaded targets" because x is an irreducible static target.

wlandau on 22 Jun 2020

For the purposes of dynamic branching internals, the "y" under "loaded targets" is a vector of hashes, which needs to be loaded to select sub-targets. When the user references whole aggregates of dynamic targets, all the values are aggregated and put in a different environment called config$envir_dynamic. But that only happens if absolutely necessary.

wlandau on 22 Jun 2020

In https://github.com/ropensci/drake/issues/1284#issuecomment-647780508, every model sub-target always needs continents, so there is nothing superfluous to discard. If you want to launch into the debugger, you could add dynamic targets downstream of model.

wlandau on 22 Jun 2020

I should also point out that I was finding memory accumulation in a single dynamic target with many subtargets

kendonB on 22 Jun 2020

The output in https://github.com/ropensci/drake/issues/1284#issuecomment-647784061 is using memory_strategy = "speed", which does not attempt to unload anything.

General reports of memory issues are difficult for me to reproduce, so a reprex would really help.

Also, I wonder if #1257 is related. cc @Plebejer

wlandau on 22 Jun 2020

👍1

sorry I was reading your code too fast! I will try and see if I can reproduce

kendonB on 22 Jun 2020

This looks pretty good:

library(drake)
plan <- drake_plan(
  x = 1:6,
  y = target({
    print(pryr::mem_used())
    x*1:1e+07
  }, dynamic = map(x)),
)
clean(y)
make(plan, memory_strategy = "autoclean")
#> Registered S3 method overwritten by 'pryr':
#>   method      from
#>   print.bytes Rcpp
#> ℹ Consider drake::r_make() to improve robustness.
#> ▶ target x
#> ▶ dynamic y
#> > subtarget y_0b3474bd
#> 54.1 MB
#> > subtarget y_b2a5c9b8
#> 54.3 MB
#> > subtarget y_71f311ad
#> 54.3 MB
#> > subtarget y_98cf3c11
#> 54.3 MB
#> > subtarget y_0a86c9cb
#> 54.3 MB
#> > subtarget y_cb15b01f
#> 54.3 MB
#> ■ finalize y

^{Created on 2020-06-23 by the reprex package (v0.3.0)}

kendonB on 22 Jun 2020

I'm going to try running pryr::mem_used() at the start of building each subtarget to see if I can see memory accumulating in the actual example. Lot's of differences in the actual call:

make(myplan, 
     verbose = 1,
     jobs_preprocess = 4,
     packages = c("tidyverse",
                  "drake",
                  "sf",
                  "weatherdata",
                  "mapview",
                  "lubridate",
                  "terra",
                  "progress"),
     log_make = "make_all.log",
     jobs = 100,
     # jobs = 400,
     caching = "worker",
     lazy_load = "promise",
     targets = "vcsn_daily_data",
     memory_strategy = "autoclean",
     parallelism = "clustermq",
     template = list(
       memory = round(20*1024), # MBs
       walltime = 120, # minutes
       log_file = "make.log",
       partition = "large"),
     garbage_collection = TRUE,
     keep_going = TRUE)

kendonB on 23 Jun 2020

So I did see some accumulation:

1.61 GB
Warning: target vcsn_daily_data_b419b3b1 warnings:
  The `x` argument of `as_tibble.matrix()` must have column names if `.name_repair` is omitted as of tibble 2.0.0.
Using compatibility `.name_repair`.
This warning is displayed once every 8 hours.
Call `lifecycle::last_warnings()` to see where this warning was generated.
ℹ target vcsn_daily_data_b419b3b1 messages:
  Registered S3 method overwritten by 'pryr':
  method      from
  print.bytes Rcpp
2020-06-23 11:08:43.906330 | eval'd: drake::cmq_buildtargetmetadepsspecconfig_tmpconfig
2020-06-23 11:08:44.086579 | > DO_CALL (0.001s wait)
2.78 GB
✖ fail vcsn_daily_data_e59aa2a3
2020-06-23 11:15:52.284250 | eval'd: drake::cmq_buildtargetmetadepsspecconfig_tmpconfig
2020-06-23 11:15:52.335304 | > DO_CALL (0.002s wait)
2.78 GB
slurmstepd: error: *** JOB 13179027 ON wbn222 CANCELLED AT 2020-06-22T23:16:47 ***

kendonB on 23 Jun 2020

Gonna also try out a minimal example

kendonB on 23 Jun 2020

Minimal example looks good - gonna try move it closer to the real one

kendonB on 23 Jun 2020

Was the minimal one also running the same HPC settings? And is there a possibility that your data structures are carrying stowaway environments along with them, like with ggplot2 objects?

wlandau on 23 Jun 2020

Same HPC settings and no ggplot or lm objects stowing away environments.

kendonB on 23 Jun 2020

In https://github.com/ropensci/drake/issues/1284#issuecomment-647821699, it looks like the worker was using 1.61 GB, then 2.78 GB, and then 2.78 GB again for three targets. As long as memory is leveling off, autoclean and garbage collection are doing their jobs. I usually expect some accumulation to happen because of the dependencies and the return values have to exist simultaneously in memory some of the time. What I'm really concerned about is if memory increases steadily without bound as in #1257.

By the way, which custom formats are you using for the large targets? fst's multithreading takes a lot of memory. So if you are using mostly fst-based formats, maybe try fst::threads_fst(1) in prework or in the target commands themselves? Not sure if the same considerations are relvant to qs.

wlandau on 23 Jun 2020

Just did a smaller one that could go further than 3 targets and I was seeing it level off as well - I was using fst_tbl so it might spike at that stage in a way that's a bit random. I will see how it goes with a bit more memory room but will close here for now. Thanks heaps for help troubleshooting!

kendonB on 23 Jun 2020

👍1

Was this page helpful?

0 / 5 - 0 ratings