Drake: What do you mean by 'local files' in the section on dynamic files? Is this also possible for a mounted samba drive?

Created on 23 Mar 2020 · 17Comments · Source: ropensci/drake

Prework

[x] Read and abide by drake's code of conduct.
[x] Search for duplicates among the existing issues, both open and closed.
[x] If you think your question has a quick and definite answer, consider posting to Stack Overflow under the drake-r-package tag. (If you anticipate extended follow-up and discussion, you are already in the right place!) ^{I expect this to lead to some discussion}

Question

What do you mean with "Return the paths to local files from the target." in https://github.com/ropensci/drake/pull/1178 and chapter 4.7 of the manual?

Would a mounted samba drive work for this as well?

What I would like to do is:

check whether a remote samba folder is mounted (locally?) or not
use this as the trigger condition for listing files in that folder with a certain pattern
have those files be dynamic inputs to my read function
^{see below for a pseudocode example}

I'm not sure step 3 is required. My main issue now is that connecting to this samba drive is very slow, and even listing the files in it takes ages. I'd like to speed this up somehow by only reading files if they've actually been changed. My read_files_function() is actually isoreader::iso_read_scan(), which has file caching already. So it's not re-reading all the files, only checking if they've been changed or not and then reading them in anew if they have. But since the overhead of connecting to the drive through VPN is the main limiting factor right now, is there any way in which drake could help me pass only those targets to this function that have been newly added or changed?

I hope I was able to explain this clearly enough. Making an actual reprex for this would be quite difficult…

{r} plan <- drake_plan( connected = target(command = dir.exists("/run/user/1000/gvfs/smb-share:server=<address>,share=<share>/"), condition = trigger(condition = TRUE)), # so that it always checks this first files = target(command = list.files(path = "/run/user/1000/gvfs/smb-share:server=<address>,share=<share>/sub/directory/with/these_files/", pattern = "\\.scn$", full.names = TRUE), format = "file", condition = trigger(condition = connected)), raw = read_files_function(files) ... )

question

Source

japhir

All 17 comments

Woah after re-reading the manual the whole section on dynamic targets seems pretty new to me.

Does this mean I could:

specify the individual files as "file" targets
make the target raw dynamic as well, mapped by file

or will that still cause it to check the slow VPN'd samba drive every time I make?

My attempt to specify the files as "file" formats has crashed everything and I've, for now, reverted my code back to only updating when I set condition = trigger(condition = TRUE).

japhir on 23 Mar 2020

I think it is worth experimenting with dynamic files + dynamic branching. For dynamic files that have not changed you will not need to re-run read_files_function(). All those dynamic files still need to be checked to make sure they are up to date, which could take time, but probably less time than actually reading the files.

wlandau on 23 Mar 2020

Here is a sketch.

drake_plan(
  all_file_names = target(
    list_all_smb_files(),
    condition = TRUE
  ),
  tracked_files = target(
    all_file_names,
    format = "file"
    dynamic = map(all_file_names)
  ),
  raw = target(read_files_function(tracked_files), dynamic = map(tracked_files))
)

wlandau on 23 Mar 2020

❤1

Does that help?

wlandau on 23 Mar 2020

Thanks again @wlandau for the fast and in-depth reply! :+1:

I wonder if it will be more optimal, since the other function was already vectorized and using a cache, but now it will (hopefully?) skip calling this function altogether if there is no file change. We'll see, it's currently running!

japhir on 23 Mar 2020

Glad it helps.

I wonder if it will be more optimal, since the other function was already vectorized and using a cache, but now it will (hopefully?) skip calling this function altogether if there is no file change. We'll see, it's currently running!

I am curious to know how long read_files_function(vector_of_files) takes vs for (file in vector_of_files) read_files_function(file) (proffer could help with that).

Also, drake might incur some overhead in raw if you have a lot of dynamic targets because it saves the return values to files (special formats might help). The speed gains will depend on how often only some of the files change. If the files on SMB all change together or not at all, or if the overhead due to drake is too much, you may prefer something resembling your original plan.

drake_plan(
  is_connected = target(is_connected_to_smb(), condition = TRUE),
  tracked_files = target(
    list_all_smb_files(),
    format = "file",
    condition = is_connected
  ),
  raw = target(
    read_files_function(tracked_files),
    condition = is_connected # Use the condition here too.
  )
)

Or even the equivalent plan with the static file interface. But that would only work if you are connected to SMB. Below, the !! operator from tidy evaluation is important.

drake_plan(
  raw = read_files_function(file_in(!!list_all_smb_files()))
)

wlandau on 23 Mar 2020

Sorry, I meant to use file_in() for that last plan. Now edited.

wlandau on 24 Mar 2020

Thanks for the ideas! I'll have a look at which works better with proffer. Normally vectorisation is always faster, but in the case of reading files and performing things with side-effects I'm not so sure.

The files on this drive are almost never updated, only new ones are added continuously. So probably the second approach should be better?

The dynamic approach ran for quite a while but wasn't finished yet when I cut it off. This morning I started it up again, accidentally without the VPN/samba drive connected, resulting in a single file failing. Now, when I re-run themake(plan), I get this error:

Error in if (any(out)) { : missing value where TRUE/FALSE needed

Is there any way to diagnose this further?

japhir on 24 Mar 2020

I am curious to know how long read_files_function(vector_of_files) takes vs for (file in vector_of_files) read_files_function(file) (proffer could help with that).

Here's a quick comparison using microbenchmark. Couldn't get proffer to run on my machine.

Hmm not sure, looks like the difference is rather small and maybe the for loop is actually faster?

library(drake)
library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
library(microbenchmark)
library(ggplot2)

setwd("~/SurfDrive/PhD/programming/pressure_baseline/")

# make sure to mount the samba drive

# let's check it out for 6 files
files <- readd(files_2019) %>% tail()

vec_read <- function(x = files) {
  isoreader::iso_read_scan(x, cache = FALSE, read_cache = FALSE, quiet = TRUE, parallel = FALSE)
}

for_read <- function(x = files) {
  for (file in x) {
    isoreader::iso_read_scan(file, cache = FALSE, read_cache = FALSE, quiet = TRUE, parallel = FALSE)
  }
}

mbm <- microbenchmark(
  vec_read = vec_read(files),
  for_read = for_read(files),
  times = 10
)

mbm
#> Unit: seconds
#>      expr      min       lq     mean   median       uq      max neval
#>  vec_read 5.774824 6.855198 7.634048 7.383977 8.872879 9.425549    10
#>  for_read 5.996087 6.216740 7.392939 7.280581 7.809035 9.971430    10

autoplot(mbm)
#> Coordinate system already present. Adding new coordinate system, which will replace the existing one.

^{Created on 2020-03-24 by the reprex package (v0.3.0)}

# previous run:
#> Unit: seconds
#>                                                                                                                                    expr
#>                                 isoreader::iso_read_scan(files, cache = FALSE, read_cache = FALSE,      quiet = TRUE, parallel = FALSE)
#>  for (file in files) {     isoreader::iso_read_scan(file, cache = FALSE, read_cache = FALSE,          quiet = TRUE, parallel = FALSE) }
#>       min       lq     mean   median       uq      max neval
#>  7.799658 10.36986 11.87893 12.23589 13.12881 17.19234    10
#>  9.509147 10.78577 11.84759 11.13904 12.09322 15.41604    10

# illustration of proffer not working
library(proffer)
pv <- pprof({isoreader::iso_read_scan(files)})
#> Info: preparing to read 6 data files (all will be cached)...
#> Info: reading file '200313_BG5V.scn' from cache...
#> Info: reading file '200316_BG10V.scn' from cache...
#> Info: reading file '200316_BG15V.scn' from cache...
#> Info: reading file '200316_BG20V.scn' from cache...
#> Info: reading file '200316_BG25V.scn' from cache...
#> Info: reading file '200316_BG5V.scn' from cache...
#> Info: finished reading 6 files in 0.75 secs
#> Info: encountered 6 problems in total
#> # A tibble: 6 x 4
#>   file_id      type    func        details                                      
#>   <chr>        <chr>   <chr>       <chr>                                        
#> 1 200313_BG5V… warning get_creati… file creation date cannot be accessed on thi…
#> 2 200316_BG10… warning get_creati… file creation date cannot be accessed on thi…
#> 3 200316_BG15… warning get_creati… file creation date cannot be accessed on thi…
#> 4 200316_BG20… warning get_creati… file creation date cannot be accessed on thi…
#> 5 200316_BG25… warning get_creati… file creation date cannot be accessed on thi…
#> 6 200316_BG5V… warning get_creati… file creation date cannot be accessed on thi…
#> Error in rethrow_call(c_processx_exec, command, c(command, args), stdin, : cannot start processx process '' (system error 2, No such file or directory) @unix/processx.c:590 (processx_exec)
pf <- pprof({for (file in files) { isoreader::iso_read_scan(file) }})
#> Info: preparing to read 1 data files (all will be cached)...
#> Info: reading file '200313_BG5V.scn' from cache...
#> Info: finished reading 1 files in 0.11 secs
#> Info: encountered 1 problems in total
#> # A tibble: 1 x 4
#>   file_id      type    func        details                                      
#>   <chr>        <chr>   <chr>       <chr>                                        
#> 1 200313_BG5V… warning get_creati… file creation date cannot be accessed on thi…
#> Info: preparing to read 1 data files (all will be cached)...
#> Info: reading file '200316_BG10V.scn' with '.scn' reader
#> Info: finished reading 1 files in 1.41 secs
#> Info: encountered 1 problems in total
#> # A tibble: 1 x 4
#>   file_id      type    func        details                                      
#>   <chr>        <chr>   <chr>       <chr>                                        
#> 1 200316_BG10… warning get_creati… file creation date cannot be accessed on thi…
#> Info: preparing to read 1 data files (all will be cached)...
#> Info: reading file '200316_BG15V.scn' with '.scn' reader
#> Info: finished reading 1 files in 1.76 secs
#> Info: encountered 1 problems in total
#> # A tibble: 1 x 4
#>   file_id      type    func        details                                      
#>   <chr>        <chr>   <chr>       <chr>                                        
#> 1 200316_BG15… warning get_creati… file creation date cannot be accessed on thi…
#> Info: preparing to read 1 data files (all will be cached)...
#> Info: reading file '200316_BG20V.scn' with '.scn' reader
#> Info: finished reading 1 files in 1.75 secs
#> Info: encountered 1 problems in total
#> # A tibble: 1 x 4
#>   file_id      type    func        details                                      
#>   <chr>        <chr>   <chr>       <chr>                                        
#> 1 200316_BG20… warning get_creati… file creation date cannot be accessed on thi…
#> Info: preparing to read 1 data files (all will be cached)...
#> Info: reading file '200316_BG25V.scn' with '.scn' reader
#> Info: finished reading 1 files in 1.17 secs
#> Info: encountered 1 problems in total
#> # A tibble: 1 x 4
#>   file_id      type    func        details                                      
#>   <chr>        <chr>   <chr>       <chr>                                        
#> 1 200316_BG25… warning get_creati… file creation date cannot be accessed on thi…
#> Info: preparing to read 1 data files (all will be cached)...
#> Info: reading file '200316_BG5V.scn' with '.scn' reader
#> Info: finished reading 1 files in 1.24 secs
#> Info: encountered 1 problems in total
#> # A tibble: 1 x 4
#>   file_id      type    func        details                                      
#>   <chr>        <chr>   <chr>       <chr>                                        
#> 1 200316_BG5V… warning get_creati… file creation date cannot be accessed on thi…
#> Error in rethrow_call(c_processx_exec, command, c(command, args), stdin, : cannot start processx process '' (system error 2, No such file or directory) @unix/processx.c:590 (processx_exec)

^{Created on 2020-03-24 by the reprex package (v0.3.0)}

japhir on 24 Mar 2020

Yeah, I think we know the benefit of vectorization is negligible here.

The files on this drive are almost never updated, only new ones are added continuously. So probably the second approach should be better?

That's actually a situation where dynamic files + dynamic branching should help. If you are accumulating files and only the (few) new ones need attention, it can be helpful to skip the rest.

By the way, is accessing SMB the most expensive part of the computation? Do you absolutely need to store those files on SMB? Depending, we should probably think about a totally different approach.

Now, when I re-run themake(plan), I get this error:

Sorry, I am not sure what to do with that error message without a reprex.

wlandau on 24 Mar 2020

Cool!

By the way, is accessing SMB the most expensive part of the computation? Do you absolutely need to store those files on SMB? Depending, we should probably think about a totally different approach.

I can't be sure, but it sure is very slow. I could perhaps rsync the files over to local every now and then, but I'm pretty low on local storage and there are quite a few files. Do you think copy operations would be faster than these read operations?

If I'd rsync them over, the last-modified date wouldn't update right? So using dynamic targets that link to a locally mounted (external) HDD should still not update the old files?

japhir on 24 Mar 2020

I can't be sure, but it sure is very slow. I could perhaps rsync the files over to local every now and then, but I'm pretty low on local storage and there are quite a few files. Do you think copy operations would be faster than these read operations?

If you can make space, it would certainly speed things up. But if you are low on storage, you should also try to limit the size of drake's cache. If the raw target is returning all the data, drake will store it. So I recommend only storing summarized versions. You may want to periodically empty .drake/scratch and .drake/drake/tmp, as well as call drake_gc().

If I'd rsync them over, the last-modified date wouldn't update right? So using dynamic targets that link to a locally mounted (external) HDD should still not update the old files?

drake cares more about the file hash than the time stamp, so you would be fine even if the time stamp changed. But if the file path changes, drake does rerun things.

wlandau on 24 Mar 2020

I figured I could check out the speed difference for local SSD vs local HDD vs remote files. To make it a fair comparison, I include copying the data over. Looks like it's much faster to batch-copy everything over and then to read it in with either the vectorized or the looped version.

library(drake)
library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
library(microbenchmark)
library(ggplot2)

setwd("~/SurfDrive/PhD/programming/pressure_baseline/")


# let's check it out for 6 files

# make sure to mount the samba drive
files <- readd(files_2019) %>% tail()

vec_read <- function(x = files) {
  isoreader::iso_read_scan(x, cache = FALSE, read_cache = FALSE, quiet = TRUE, parallel = FALSE)
}

for_read <- function(x = files) {
  for (file in x) {
    isoreader::iso_read_scan(file, cache = FALSE, read_cache = FALSE, quiet = TRUE, parallel = FALSE)
  }
}

local_copy <- function(x = files, fun, to_folder = "dat") {
  # copy the files
  to_loc <- paste0(to_folder, "/", basename(files))
  file.copy(x, to_loc, overwrite = TRUE)
  fun(to_loc)
}

mbm <- microbenchmark(
  vec_read_remote = vec_read(files),
  for_read_remote = for_read(files),
  vec_read_local = local_copy(files, vec_read, "~/SurfDrive/PhD/programming/pressure_baseline/dat"),
  for_read_local = local_copy(files, for_read, "~/SurfDrive/PhD/programming/pressure_baseline/dat"),
  vec_read_hdd = local_copy(files, vec_read, "/run/media/japhir/a584a078-43dd-414a-8356-561a0b5e96c1/Downloads/temp"),
  for_read_hdd = local_copy(files, for_read, "/run/media/japhir/a584a078-43dd-414a-8356-561a0b5e96c1/Downloads/temp"),
  times = 10
)

mbm
#> Unit: seconds
#>             expr      min       lq     mean   median       uq       max neval
#>  vec_read_remote 4.860143 5.500096 6.732677 6.648148 7.456187  9.175671    10
#>  for_read_remote 4.940595 5.390456 6.590071 6.157208 6.722379 11.912869    10
#>   vec_read_local 3.451090 3.726264 4.785966 4.503839 5.471826  7.801718    10
#>   for_read_local 3.577412 3.751829 5.554859 4.083884 5.031610 17.468640    10
#>     vec_read_hdd 3.545056 3.760137 4.419138 4.384251 4.882394  6.045486    10
#>     for_read_hdd 3.661752 3.919572 4.400746 4.227759 4.771214  5.593434    10

autoplot(mbm)
#> Coordinate system already present. Adding new coordinate system, which will replace the existing one.

^{Created on 2020-03-24 by the reprex package (v0.3.0)}

japhir on 24 Mar 2020

.drake/scratch appears empty in all my different pipelines.
.drake/drake is a very small (20K) folder
.drake/data appears to have tiny rds files for all the steps. If I clean this out, I assume it will have to run the whole plan from scratch, right?

japhir on 24 Mar 2020

https://github.com/ropensci/drake/issues/1230#issuecomment-603228489 is helpful. How many files do you expect to read this way?

Yup, I meant to say drake_gc(). I will edit the comment.

.drake/data appears to have tiny rds files for all the steps. If I clean this out, I assume it will have to run the whole plan from scratch, right?

Correct, please leave .drake/data/ alone unless you plan to rerun everything from scratch (in which case clean(destroy = TRUE) is safer).

More recommendations:

If you are worried about space, consider downsizing the raw target(s) before returning values. Maybe read_files_function() could have a group_by() %>% summarize(...) statement to make the data smaller before you save it.
The specialized formats at https://books.ropensci.org/drake/plans.html#special-data-formats-for-targets could save you both time and space. If you are returning data frames, I recommend format = "fst_tbl". For strange and complicated objects, format = "qs" should also work.

wlandau-lilly on 24 Mar 2020

Thanks again for such extensive feedback!

To give you an indication of what amount of files we're talking about:

almost 4k scans (approximately 63 kB per file)
- in one of the subdirectories 1820 files
- 1334 in another
- 804 more in yet another
almost 25k raw measurement files of about 700 kB per file
- 14791 + 4884 + 5201 are processed in separate pipelines

Note that these files contain very little data: with simple compression their file sizes drop down by 10x.

The guy who wrote isoreader wrote a special iso_file format, so I could try to use the format = "qs" for those outputs and fst_tbl for the resulting tibbles. I would like to be able to access the raw data if possible though, so early summaries are not so desirable. I can think about how many processing steps I would like to save after to prevent data duplication for every step of the pipeline.

So the plan for now would be to rsync or otherwise copy the files over and at that point optimisation won't matter a huge deal anymore, but probably dynamic targets according to https://github.com/ropensci/drake/issues/1230#issuecomment-602773782 would be best since they can skip all the files that haven't been touched since, right?

japhir on 24 Mar 2020

So the plan for now would be to rsync or otherwise copy the files over and at that point optimisation won't matter a huge deal anymore, but probably dynamic targets according to #1230 (comment) would be best since they can skip all the files that haven't been touched since, right?

Now that you have told me more about the files, I actually recommend aggregating as many of those tiny files as possible into a smaller number of larger files before running an analysis pipeline. I think this is the best thing you can do to speed up computation. If you create strategic zip archives ahead of time, each one small enough for your local storage to handle, you might be able to avoid a lot of the headache.

process_zip_archive <- function(zipped) {
  tmp_zip <- tempfile()
  file.copy(zipped, tmp_zip) # Copy the zip file locally before unzipping it.
  tmp_dir <- fs::dir_create(tempfile())
  unqip(zipped, exdir = tmp_dir) # Extract the IRMS files.
  out <- lapply(list.files(tmp), read_files_function)
  summarize_data(out) # Custom function to reduce the size of the cached target data.
}

Otherwise, dynamic branching + dynamic files will probably help by skipping read_files_function() for files already up to date.

wlandau on 25 Mar 2020

👍1

Was this page helpful?

0 / 5 - 0 ratings

Related issues

Fully embrace vctrs for dynamic branching

wlandau · 4Comments

Combining `workflowr` and `drake`: A perfect match

pat-s · 5Comments

Comparison with Netflix's Metaflow

wlandau · 8Comments

planning the 5.1.0 CRAN release

wlandau · 8Comments

Reproducibility with random numbers

bart1 · 7Comments