Drake: Add support for disk.frame

Created on 5 Sep 2019  路  12Comments  路  Source: ropensci/drake

drake has recently added support for fst. It would be great to add disk.frame as a storage medium as well. E.g. make it possible to do the below.

library(drake)
n <- 1e8 # Each target is 1.6 GB in memory.
plan <- drake_plan(
  data_fst = target(
    data.frame(x = runif(n), y = runif(n)),
    format = "disk.frame"
  ),
  data_old = data.frame(x = runif(n), y = runif(n))
)
performance new feature

Most helpful comment

Yeah, let's definitely explore this. disk.frame looks really promising at a glance, I will need to try it out and learn more about it. As we go forward, we might also be able to use this approach to reduce the memory consumption of the split transformation, i.e. drake_plan(x = target(..., transform = split(...))).

All 12 comments

Yeah, let's definitely explore this. disk.frame looks really promising at a glance, I will need to try it out and learn more about it. As we go forward, we might also be able to use this approach to reduce the memory consumption of the split transformation, i.e. drake_plan(x = target(..., transform = split(...))).

After reading the quick start guide and the article on ingesting data, and I think a disk.frame format is definitely possible. Another sketch:

library(nycflights13)
library(dplyr)
library(disk.frame)
library(data.table)
library(drake)

plan <- drake_plan(
  flights.df = target(
    as.disk.frame(
      flights, 
      outdir = file.path(tempdir(), "tmp_flights.df"),
      overwrite = TRUE
    ),
    format = "disk.frame"
  )
)

But there are some big caveats:

  • Let's say the target's return value is already a disk.frame. No matter where it initially lives in storage, drake needs to physically move it to a special subdirectory of the cache. (The cache is usually a hidden .drake/ folder in the working directory where drake::make() is called.) If the disk.frame were to remain external, it would be much safer and more reproducible to track it as an external directory using file_in() and/or file_out(), which is less convenient than target(format = "disk.frame"). If the disk.frame is already on the same drive as the cache, I suspect a simple file.rename() will be quick. Otherwise, the transfer will take a long time and may even bottleneck the pipeline.
  • If the target is not already a disk.frame, it must be coercible to one. I think we can handle data.frames and data.tables, but users who begin with CSV or zip files will probably need to ingest them manually.

Is all this reasonable?

I probably need time to digest this. Apologies if I appear slow when I mull over this. Really appreciate your responsiveness here.

Sure, we can take our time. Let me know if you need clarification.

A benefit of disk.frame over the current split transform function is that you can shardby a variable if you want to keep a dimension of the data all together at the analysis stage. Or is there a way for drake's split to currently do that?

If shareby is similar to group_by(), then split() does not do this. However, I have plans for a dynamic group_map() when it comes time to extend https://github.com/ropensci/drake/issues/685.

Would it work that when you run create a disk.frame inside a plan using as.disk.frame, the out directory is automatically in the .drake folder? So we basically disable outdir for the user when using drake? This will circumvent the copy to .drake issue with disk.frame. I think tighter integration like this will be awesome. Ultimately, for many use-cases, I don't think the user really cares where the data lives, as long as it's there.

Seems like the most expedient approach we have. A couple thoughts:

  1. We just need to bear in mind that the initial out directory is still not going to be the final location, even if it is inside .drake/. The disk.frame directory with the chunks ultimately needs to have a special name (just a hash) so drake can find it again. That said, moving the files from the initial location to the final location should be practically instantaneous because the whole .drake/ folder is on the same drive. All we need to do is rename all the chunks. Still sound reasonable?
  2. In the docs, maybe we should advise users to consider running frequent garbage collection on the cache. New disk.frames could be created rather often, and drake keeps them all for the sake of history and data recovery. drake_gc() removes all but the most current targets.

By the way, do you have ideas on how we disable outdir automatically? In as.disk.frame(), the default outdir is tempfile(fileext = ".df"). It may be simpler to require something like as.disk.frame(outdir = drake_cache_tempfile()) or as.disk.frame(outdir = drake_cache_staging_area()).

Possible option is to have a disk.frame.temp.file() function and a global options disk.frame.use.drake = T then the function returns the drake_cache_tempfile()? This would annoying. Would be cool for plan to capture as.disk.frame() and just use meta-programming to overwrite the directory? That may be too intrusive. Firstly, I need to learn more about drake

Automatic metaprogramming is achievable if the affected commands in the plan contain as.disk.frame(), but things get tricky if people start using package or functions that return disk.frames and there is nothing in the plan to modify. And then there are cases when people want to define their own file_out() directories and control where disk.frames are saved. So I am not so sure. I hesitate to overfit the use cases we can imagine right now.

Personally, I am more concerned about clarity than automation. As long as people know how to use the disk.frame format correctly, a little extra work does not seem like a big deal. Correct me if I am wrong, but disk.frame seems like a tool where people are willing to think a little about file paths anyway.

To be clear, the only consequence of misspecifying outdir is the extra runtime it might take to copy files from one drive to another. For some users, tempdir() will be on the same drive as .drake/, so there is nothing special to do. Otherwise,drake should still be able to get the job done.

Implemented in #1022.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

psadil picture psadil  路  42Comments

wlandau picture wlandau  路  33Comments

wlandau-lilly picture wlandau-lilly  路  29Comments

wlandau picture wlandau  路  27Comments

wlandau picture wlandau  路  29Comments