Drake: Corrupted storr when killing making process

Created on 1 Nov 2017 · 17Comments · Source: ropensci/drake

I'm aware that this may be beyond the scope of drake, but occasionally I get this weird non-reproduce-able bug. In the process of working with my data, sometimes I mistakenly start making a target that I know will fail or be incorrect. This freezes up my R session while I wait for it to resolve. Instead of waiting, I kill the RStudio process, or the Makefile process from htop.

This usually works fine -- in fact, using Makefile parallelism is great because you can see which targets are associated with which processes directly from htop and when you kill a process you don't have to relaunch your RStudio session and lose everything in your global environment.

However, once in a blue moon, killing a process seems to have consequences. You get these targets that seem to be stuck in a half-saved state, and that cause drake functions to fail with the error:

Error in readRDS(self$name_hash(hash)) : error reading from connection

There seems to be no way to recover from this, other than deleting the whole .drake directory and starting from scratch. Even if you know which target was corrupted, clean(target) fails with the same error. I'm having some success with get_cache()$del(target), which makes some things work again but still seems to leave the cache in a semi-corrupted state.

I guess I'm just posting to see if anyone else has had similar experiences / how they fixed them without clearing the whole cache.

Source

dapperjapper

👍1

All 17 comments

@dapperjapper I am really glad you brought this up. I think the top priority is to improve storr, maybe adding something like this:

library(storr)
x <- storr_rds("my_cache")
x$set(key = "a", value = "b")
# break 'a'
x$repair() # remove dangling files for half-saved targets.

Would you also post this on the storr issue tracker? The solution will ultimately come from there. Still, I will keep this drake thread open for a while.

After a failed clean(target), do you get a traceback() that dives into the call stack of storr functions?

I am also wondering how to improve job monitoring. For example, what would it take to label and monitor jobs by target name for all kinds of parallelism, not just Makefile parallelism? For "future_lapply" parallelism, there is probably some way that involves the configuration *tmpl file, but I would also like something similar for other forms of parallelism. Drake has in_progress() and progress(), but that does not provide a way to kill jobs. processx has easier job monitoring, but the scope looks limited to processes launched with processx itself.

wlandau-lilly on 1 Nov 2017

I will post on storr issue tracker! Thanks for quick response!

I don't know a lot about processes/*nix systems in general, so I don't think I can be terribly helpful there. One thing I have been wondering about is make()ing in the background without blocking the R session. It seems like knitting Rmds is able to run without blocking, and I'm wondering if drake could ever get a spot like that. Does the knitting process have it's own separate R session that is maintained by RStudio? Can drake get one?

Here is a traceback:

> clean(premiers_data_raw_sept)

Error in readRDS(self$name_hash(hash)) : error reading from connection

> traceback()

13: readRDS(self$name_hash(hash))
12: self$driver$get_object(hash)
11: self$get_value(self$get_hash(key, namespace), use_cache)
10: cache$get(target)
9: (function (target, cache) 
   {
       if (!(target %in% cache$list())) 
           return(FALSE)
       cache$get(target)$imported
   })(target = dots[[1L]][[1L]], cache = <environment>)
8: mapply(FUN = function (target, cache) 
   {
       if (!(target %in% cache$list())) 
           return(FALSE)
       cache$get(target)$imported
   }, target = "premiers_data_raw_sept", MoreArgs = list(cache = <environment>), 
       SIMPLIFY = TRUE, USE.NAMES = TRUE)
7: do.call("mapply", c(FUN = FUN, args[dovec], MoreArgs = list(args[!dovec]), 
       SIMPLIFY = SIMPLIFY, USE.NAMES = USE.NAMES))
6: is_imported(target = target, cache = cache)
5: (function (target, cache) 
   {
       if (is.null(cache)) {
           return()
       }
       if (is_file(target) & !is_imported(target = target, cache = cache)) {
           unquote(target) %>% unlink(recursive = TRUE, force = TRUE)
       }
       default <- cache$default_namespace
       for (space in c(default, "depends", "filemtime", "functions")) if (target %in% 
           cache$list(namespace = space)) {
           cache$del(target, namespace = space)
       }
       invisible()
   })(target = dots[[1L]][[1L]], cache = <environment>)
4: mapply(FUN = function (target, cache) 
   {
       if (is.null(cache)) {
           return()
       }
       if (is_file(target) & !is_imported(target = target, cache = cache)) {
           unquote(target) %>% unlink(recursive = TRUE, force = TRUE)
       }
       default <- cache$default_namespace
       for (space in c(default, "depends", "filemtime", "functions")) if (target %in% 
           cache$list(namespace = space)) {
           cache$del(target, namespace = space)
       }
       invisible()
   }, target = "premiers_data_raw_sept", MoreArgs = list(cache = <environment>), 
       SIMPLIFY = TRUE, USE.NAMES = TRUE)
3: do.call("mapply", c(FUN = FUN, args[dovec], MoreArgs = list(args[!dovec]), 
       SIMPLIFY = SIMPLIFY, USE.NAMES = USE.NAMES))
2: uncache(targets, cache = cache)
1: clean(premiers_data_raw_sept)

dapperjapper on 1 Nov 2017

Semi-related Q: why are you storing object metadata like $type and $imported inside the objects namespace alongside the object itself, instead of peeling them off to a separate storr namespace?

dapperjapper on 1 Nov 2017

For $type and $imported, that was a poor design choice. Basically, $type and $imported detect imported functions. If a function is imported from the user's environment, I want it to depend on any nested imports in the body. For the basic example, storr_rds(".drake")$get("reg1") also has a $depends element that depends on nested objects. Essentially, I put in the default namespace everything that I want to be reproducibly tracked. In hindsight, $type and $imported should not be there. If the value doesn't change, it should not matter if you change a target into an import in your workflow plan.

Fixing that will be ambitious, especially because of back compatibility concerns. A separate preprocessing step will be needed in order to migrate data to the appropriate storr namespaces before a make(). But I think it can be done.

Here, I see that future can casually run stuff in the background. I will probably try this on my own projects, and maybe include it in the documentation, but probably not work it into drake's code base. It certainly is not compatible with drake's future_lapply() backend. processx and svSocket may be worth considering too.

wlandau-lilly on 1 Nov 2017

If we are going to refactor the most sensitive internals, we are most likely not going to have back compatibility. So I want to think long and hard about it, plan and discuss with you guys, and get your support. I probably will not work on it until well after the next CRAN release at least. When I do work on it, I want to share formal design specs and talk about them with the community.

I learned a TON over the course of drake's development, and there is a lot I would do differently, especially more elegantly, if I went back and did it again from scratch. But for sensitive changes to hashing and caching, we need patience and careful planning.

wlandau-lilly on 1 Nov 2017

To be clear, I have not made the decision to refactor anything that deep. I think there needs to be a stronger reason than just the elegance of the code. Back compatibility is more important than the slight oversensitivity due to tracking $type and $imported.

wlandau-lilly on 1 Nov 2017

Definitely -- I agree that it is a big decision. The reason I ask is that the traceback seems to include is_imported(), which seems to be loading the corrupted file before clean()ing it. It seems like overkill to have to load the whole built object just to determine if it is imported or not, and additionally it breaks clean() if the given object is corrupted.

Is this also the reason why plot_graph() is rather slow for me, when working on a project with large targets? Is it loading every target into memory one-by-one to check the metadata?

dapperjapper on 1 Nov 2017

We could put some exception handling around is_imported() in clean(). Better yet, for speed, we could even add a redundant imported namespace and check that instead. To avoid the problem of half-saved objects, we should cache to that namespace as early as possible in a make().

In plot_graph(), there are two bottlenecks:

Building the igraph object that shows dependency relationships. Skip this with a pre-built config.
Importing everything to check which targets are outdated. Skip this with from_scratch = TRUE in development drake.

wlandau-lilly on 1 Nov 2017

But anyway, the immediate way forward on this is definitely a separate imported namespace. I think it is enough to

Cache TRUE or FALSE to the imported namespace just above this line in build_in_hook()
Move that call to store_target() to the very end of build_in_hook(), just before returning the value. This ensures all the small bits of the cache are saved before any really big values.