Drake: Corrupted storr when killing making process

Created on 1 Nov 2017  Â·  17Comments  Â·  Source: ropensci/drake

I'm aware that this may be beyond the scope of drake, but occasionally I get this weird non-reproduce-able bug. In the process of working with my data, sometimes I mistakenly start making a target that I know will fail or be incorrect. This freezes up my R session while I wait for it to resolve. Instead of waiting, I kill the RStudio process, or the Makefile process from htop.

This usually works fine -- in fact, using Makefile parallelism is great because you can see which targets are associated with which processes directly from htop and when you kill a process you don't have to relaunch your RStudio session and lose everything in your global environment.

However, once in a blue moon, killing a process seems to have consequences. You get these targets that seem to be stuck in a half-saved state, and that cause drake functions to fail with the error:

Error in readRDS(self$name_hash(hash)) : error reading from connection

There seems to be no way to recover from this, other than deleting the whole .drake directory and starting from scratch. Even if you know which target was corrupted, clean(target) fails with the same error. I'm having some success with get_cache()$del(target), which makes some things work again but still seems to leave the cache in a semi-corrupted state.

I guess I'm just posting to see if anyone else has had similar experiences / how they fixed them without clearing the whole cache.

All 17 comments

@dapperjapper I am really glad you brought this up. I think the top priority is to improve storr, maybe adding something like this:

library(storr)
x <- storr_rds("my_cache")
x$set(key = "a", value = "b")
# break 'a'
x$repair() # remove dangling files for half-saved targets.

Would you also post this on the storr issue tracker? The solution will ultimately come from there. Still, I will keep this drake thread open for a while.

After a failed clean(target), do you get a traceback() that dives into the call stack of storr functions?

I am also wondering how to improve job monitoring. For example, what would it take to label and monitor jobs by target name for all kinds of parallelism, not just Makefile parallelism? For "future_lapply" parallelism, there is probably some way that involves the configuration *tmpl file, but I would also like something similar for other forms of parallelism. Drake has in_progress() and progress(), but that does not provide a way to kill jobs. processx has easier job monitoring, but the scope looks limited to processes launched with processx itself.

I will post on storr issue tracker! Thanks for quick response!

I don't know a lot about processes/*nix systems in general, so I don't think I can be terribly helpful there. One thing I have been wondering about is make()ing in the background without blocking the R session. It seems like knitting Rmds is able to run without blocking, and I'm wondering if drake could ever get a spot like that. Does the knitting process have it's own separate R session that is maintained by RStudio? Can drake get one?

Here is a traceback:

> clean(premiers_data_raw_sept)

Error in readRDS(self$name_hash(hash)) : error reading from connection

> traceback()

13: readRDS(self$name_hash(hash))
12: self$driver$get_object(hash)
11: self$get_value(self$get_hash(key, namespace), use_cache)
10: cache$get(target)
9: (function (target, cache) 
   {
       if (!(target %in% cache$list())) 
           return(FALSE)
       cache$get(target)$imported
   })(target = dots[[1L]][[1L]], cache = <environment>)
8: mapply(FUN = function (target, cache) 
   {
       if (!(target %in% cache$list())) 
           return(FALSE)
       cache$get(target)$imported
   }, target = "premiers_data_raw_sept", MoreArgs = list(cache = <environment>), 
       SIMPLIFY = TRUE, USE.NAMES = TRUE)
7: do.call("mapply", c(FUN = FUN, args[dovec], MoreArgs = list(args[!dovec]), 
       SIMPLIFY = SIMPLIFY, USE.NAMES = USE.NAMES))
6: is_imported(target = target, cache = cache)
5: (function (target, cache) 
   {
       if (is.null(cache)) {
           return()
       }
       if (is_file(target) & !is_imported(target = target, cache = cache)) {
           unquote(target) %>% unlink(recursive = TRUE, force = TRUE)
       }
       default <- cache$default_namespace
       for (space in c(default, "depends", "filemtime", "functions")) if (target %in% 
           cache$list(namespace = space)) {
           cache$del(target, namespace = space)
       }
       invisible()
   })(target = dots[[1L]][[1L]], cache = <environment>)
4: mapply(FUN = function (target, cache) 
   {
       if (is.null(cache)) {
           return()
       }
       if (is_file(target) & !is_imported(target = target, cache = cache)) {
           unquote(target) %>% unlink(recursive = TRUE, force = TRUE)
       }
       default <- cache$default_namespace
       for (space in c(default, "depends", "filemtime", "functions")) if (target %in% 
           cache$list(namespace = space)) {
           cache$del(target, namespace = space)
       }
       invisible()
   }, target = "premiers_data_raw_sept", MoreArgs = list(cache = <environment>), 
       SIMPLIFY = TRUE, USE.NAMES = TRUE)
3: do.call("mapply", c(FUN = FUN, args[dovec], MoreArgs = list(args[!dovec]), 
       SIMPLIFY = SIMPLIFY, USE.NAMES = USE.NAMES))
2: uncache(targets, cache = cache)
1: clean(premiers_data_raw_sept)

Semi-related Q: why are you storing object metadata like $type and $imported inside the objects namespace alongside the object itself, instead of peeling them off to a separate storr namespace?

For $type and $imported, that was a poor design choice. Basically, $type and $imported detect imported functions. If a function is imported from the user's environment, I want it to depend on any nested imports in the body. For the basic example, storr_rds(".drake")$get("reg1") also has a $depends element that depends on nested objects. Essentially, I put in the default namespace everything that I want to be reproducibly tracked. In hindsight, $type and $imported should not be there. If the value doesn't change, it should not matter if you change a target into an import in your workflow plan.

Fixing that will be ambitious, especially because of back compatibility concerns. A separate preprocessing step will be needed in order to migrate data to the appropriate storr namespaces before a make(). But I think it can be done.

Here, I see that future can casually run stuff in the background. I will probably try this on my own projects, and maybe include it in the documentation, but probably not work it into drake's code base. It certainly is not compatible with drake's future_lapply() backend. processx and svSocket may be worth considering too.

If we are going to refactor the most sensitive internals, we are most likely not going to have back compatibility. So I want to think long and hard about it, plan and discuss with you guys, and get your support. I probably will not work on it until well after the next CRAN release at least. When I do work on it, I want to share formal design specs and talk about them with the community.

I learned a TON over the course of drake's development, and there is a lot I would do differently, especially more elegantly, if I went back and did it again from scratch. But for sensitive changes to hashing and caching, we need patience and careful planning.

To be clear, I have not made the decision to refactor anything that deep. I think there needs to be a stronger reason than just the elegance of the code. Back compatibility is more important than the slight oversensitivity due to tracking $type and $imported.

Definitely -- I agree that it is a big decision. The reason I ask is that the traceback seems to include is_imported(), which seems to be loading the corrupted file before clean()ing it. It seems like overkill to have to load the whole built object just to determine if it is imported or not, and additionally it breaks clean() if the given object is corrupted.

Is this also the reason why plot_graph() is rather slow for me, when working on a project with large targets? Is it loading every target into memory one-by-one to check the metadata?

We could put some exception handling around is_imported() in clean(). Better yet, for speed, we could even add a redundant imported namespace and check that instead. To avoid the problem of half-saved objects, we should cache to that namespace as early as possible in a make().

In plot_graph(), there are two bottlenecks:

  1. Building the igraph object that shows dependency relationships. Skip this with a pre-built config.
  2. Importing everything to check which targets are outdated. Skip this with from_scratch = TRUE in development drake.

But anyway, the immediate way forward on this is definitely a separate imported namespace. I think it is enough to

  1. Cache TRUE or FALSE to the imported namespace just above this line in build_in_hook()
  2. Move that call to store_target() to the very end of build_in_hook(), just before returning the value. This ensures all the small bits of the cache are saved before any really big values.

@dapperjapper I think I fixed it in 661e59f627d3a4ea453025234b1da35e4b372f3a. Would you try again?

By the way, I think clean() might run faster now.

Is it possible to cache a redundant version of $depends in a separate
namespace, so that assembling the igraph works faster? I can also look at
this tomorrow..
On Wed, Nov 1, 2017 at 11:13 PM Will Landau notifications@github.com
wrote:

By the way, I think clean() runs much faster now.

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/wlandau-lilly/drake/issues/126#issuecomment-341305903,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AAPCB18FRv67a-iGs5xCVyTG0CEvWF1Rks5syTNkgaJpZM4QOnHU
.

Actually, building the igraph does not touch the cache at all. See calls to lightly_parallelize() for my attempt to make it faster, along with the thread for #121, particularly here and the follow-up.

Hmmm. I actually think I won't truly fix this one until I solve #129 or richfitz/storr#55 is fixed I thought I had fixed clean() but, but the trouble is that old projects do not have an imported namespace. We will have to have to migrate the cache of an old project to a new format before doing anything else. However, in new projects, clean() should work just fine.

FYI: 661e59f627d3a4ea453025234b1da35e4b372f3a created a back-compatibility bug in clean() that I fixed with 99b7db2b760f5df635f73c022e849369647926e8.

I also had the OP's error and it magically went away - likely after I inadvertently upgraded for a different new feature.

Very helpful to know. I think I will implement something like @fmichonneau's suggestion for storr over all the namespaces and then close.

Fixed along with the solution to #129. See drake::rescue_cache().

Was this page helpful?
0 / 5 - 0 ratings

Related issues

pat-s picture pat-s  Â·  5Comments

wlandau picture wlandau  Â·  4Comments

wlandau picture wlandau  Â·  4Comments

bart1 picture bart1  Â·  7Comments

htlin picture htlin  Â·  4Comments