Drake: Helper function for creating file targets with multiple files

Created on 16 Feb 2018 · 14Comments · Source: ropensci/drake

I have a command that creates multiple files every time it runs. The command writes a spatial object in the shapefile format which results in the creation of four files:

st_write(spatial_data,"spatial_data.shp", driver = "ESRI Shapefile")

## Creates:
##   - "spatial_data.shp"
##   - "spatial_data.shx"
##   - "spatial_data.prj"
##   - "spatial_data.dbf"

All shapefiles need these four file types in order to work properly (actually, they need 3 of the 4 but that's irrelevant to this example).

This creates issues for any plan that includes this st_write() command. For instance, if I have the following plan:

plan <- drake_plan( 
    'spatial_data.shp' = st_write(spatial_data,"spatial_data.shp"),
    file_targets = TRUE,
    strings_in_dots = "literals"
  )

The plan is only tracking one of the four necessary file targets ( spatial_data.shp), and if I were to delete any of the untracked files and re-run the plan it would tell me that all targets are already up to date.

Could there be a function that allows users to create a list of file targets that come from a single command?

spatial_data_files <- drake::file_target_list('nc.shp', 'nc.shx','nc.dbf','nc.prj')  # proposed function

plan <- drake_plan( 
    spatial_data_files = st_write(spatial_data,"spatial_data.shp"),
    file_targets = TRUE,
    strings_in_dots = "literals"
  )

Thanks!

api documentation

Source

tiernanmartin

All 14 comments

@tiernanmartin It's a good point. I was actually trying to solve this sort of problem once and for all in #232, but as I explain here, it is extremely difficult to make drake break the one-file-per-target rule. But right now, you can make use of wildcard templating to make the non-.shp files depend on spatial_data.shp

EDIT: 2018-02-24

I modified the next bit to be FAQ-friendly. The "first solution" that @tiernanmartin refers to next is actually the solution for drake <= 5.0.0.

Solution for `drake` > 5.0.0

library(drake)
library(magrittr)
drake_plan(st_write(spatial_data, file_out("spatial_data.shp"), driver = "ESRI Shapefile"), 
  c(file_out("spatial_data.EXTN"), file_in("spatial_data.shp"))) %>% evaluate_plan(wildcard = "EXTN", 
  values = c("shx", "prj", "dbj"))
#> # A tibble: 4 x 2
#>   target                 command                                          
#>   <chr>                  <chr>                                            
#> 1 "\"spatial_data.shp\"" "st_write(spatial_data, file_out(\"spatial_data.…
#> 2 "\"spatial_data.shx\"" "c(file_out(\"spatial_data.shx\"), file_in(\"spa…
#> 3 "\"spatial_data.prj\"" "c(file_out(\"spatial_data.prj\"), file_in(\"spa…
#> 4 "\"spatial_data.dbj\"" "c(file_out(\"spatial_data.dbj\"), file_in(\"spa…

Solution for `drake` <= 5.0.0

library(drake)
library(magrittr)
plan <- drake_plan(list = c(
  spatial_data.shp = "st_write(spatial_data, 'spatial_data.shp', driver = \"ESRI Shapefile\")",
  spatial_data = "c(\"spatial_data.EXTN\", 'spatial_data.shp'))"
)) %>%
  evaluate_plan(wildcard = "EXTN", values = c("shx", "prj", "dbj"))
plan$target <- drake_quotes(plan$target, single = TRUE)
plan

## # A tibble: 4 x 2
##   target             command                                                       
##   <chr>              <chr>                                                         
## 1 'spatial_data.shp' "st_write(spatial_data, 'spatial_data.shp', driver = \"ESRI S…
## 2 'spatial_data_shx' "c(\"spatial_data.shx\", 'spatial_data.shp'))"                
## 3 'spatial_data_prj' "c(\"spatial_data.prj\", 'spatial_data.shp'))"                
## 4 'spatial_data_dbj' "c(\"spatial_data.dbj\", 'spatial_data.shp'))"

wlandau on 16 Feb 2018

Thanks for the explanation and for pointing me toward the wildcard feature. I'm looking forward to #232 getting merged into the master branch!

In the meantime, I'll use the first approach you recommended. Quick question: I notice that the file targets in the first solution you demonstrated lack file extensions (e.g., 'spatial_data_prj' instead of 'spatial_data.prj'):

## # A tibble: 4 x 2
##   target             command                                                       
##   <chr>              <chr>                                                         
## 1 'spatial_data.shp' "st_write(spatial_data, 'spatial_data.shp', driver = \"ESRI S…
## 2 'spatial_data_shx' "c(\"spatial_data.shx\", 'spatial_data.shp'))"                
## 3 'spatial_data_prj' "c(\"spatial_data.prj\", 'spatial_data.shp'))"                
## 4 'spatial_data_dbj' "c(\"spatial_data.dbj\", 'spatial_data.shp'))"

Is there a way to convert those _'s into .'s in the evaluate_plan() call, or do I need to figure out a post-processing step?

tiernanmartin on 17 Feb 2018

Sorry, I forgot about the way drake automatically uses underscore-delimited suffixes. For now, you'll probably have to do plan$target <- gsub("data_", "data\\.", plan$target). Unless I'm overruled in #232, you won't have to worry about setting the target column yourself when it comes to file outputs.

wlandau on 17 Feb 2018

👍1

Is there a reason why a target cannot be a directory?

In this example, I realized that the command st_write(spatial_data, "spatial_data", driver = "ESRI Shapefile") will create a directory that contains the four files:

spatial_data/ 
    ├── spatial_data.dbf
    ├── spatial_data.prj
    ├── spatial_data.shp
    └── spatial_data.shx

Having drake track the spatial_data directory rather than the individual files it contains would be nice because it eliminates the need to a many-to-one relationship between the plan's targets and command.

But when I tried implementing it I see the following error:

## Error: The specified pathname is not a file: spatial_data

Reprex

library(drake)
library(sf) 

spatial_data <- st_read(system.file("shape/nc.shp", package = "sf")) 

plan <- drake_plan(
  spatial_data = st_write(spatial_data, "spatial_data", driver = "ESRI Shapefile"),
  strings_in_dots = "literals",
  file_targets = TRUE
)

plan
## # A tibble: 1 x 2
##   target         command                                                  
##   <chr>          <chr>                                                    
## 1 'spatial_data' "st_write(spatial_data, \"spatial_data\", driver = \"ESR~

make(plan)
## cache C:/Users/UrbanDesigner/AppData/Local/Temp/Rtmp40h7BI/.drake
## connect 2 imports: plan, spatial_data
## connect 1 target: 'spatial_data'
## check 2 items: spatial_data, st_write
## check 1 item: 'spatial_data'
## target 'spatial_data'
## Writing layer `spatial_data' to data source `spatial_data' using driver `ESRI Shapefile'
## features:       100
## fields:         14
## geometry type:  Multi Polygon
## Error: The specified pathname is not a file: spatial_data

tiernanmartin on 19 Feb 2018

I believe the "specified pathname" error does not actually come from drake. Last time I checked, directories are not actually safe as file targets. The standard file hashing tools seem to avoid doing it.

$ md5sum file.csv
6463474bfe6973a81dc7cbc4a71e8dd1  file.csv
$ md5sum ~/projects
md5sum: projects: Is a directory
$ man md5sum # has no recursive option

Drake uses the digest package to hash files, and digest avoids hashing directories too.

library(digest)
> digest("~/projects/", file = TRUE)
Error: The specified pathname is not a file: /home/landau/projects/
> file.exists("~/projects")
[1] TRUE

Directory targets would be nice to have, but I do not think it is drake's responsibility to figure out how to hash them quickly and efficiently. It's a thorny problem, maybe for a separate package, maybe called dirgest.

wlandau on 19 Feb 2018

Oops: forgot this issue was about more than just directory hashes. Reopening.

wlandau on 19 Feb 2018

I just updated https://github.com/ropensci/drake/issues/257#issuecomment-366379134 to be more FAQ-friendly, and this thread is now part of our automatically-generated FAQ. I think we can close. We should discuss potential further development on #12.

wlandau on 24 Feb 2018

FYI: the best practices guide now has detailed guidance on output file targets, including the main drawback and main alternative to the workaround we talked about earlier in the thread.

wlandau on 2 May 2018

👍1

@wlandau awesome work implementing this feature 🎉

You asked for a shapefile workflow so I did my best to put something together:

Drake Shapefile Example

# SETUP -------------------------------------------------------------------


library(tibble)
library(purrr)
library(sf)
library(drake) # devtools::install_github("ropensci/drake")



# PLAN --------------------------------------------------------------------


make_place <- function(Name, Latitude, Longitude){
  tibble(Name = Name,
                    Latitude = Latitude,
                    Longitude = Longitude) %>% 
  st_as_sf(coords = c("Longitude", "Latitude")) %>% 
  st_set_crs(4326)
}


st_write_multiple <- function(..., file_outputs){ 
  pwalk(list(...), st_write)
}

u_auckland_plan <- drake_plan(u_auckland = make_place(Name = "University of Auckland", Latitude = -36.8521369, Longitude = 174.7688785),
                              u_aukland_shapefile = st_write_multiple(list(u_auckland), dsn = file_out("u-auckland.shp"), driver = "ESRI Shapefile",delete_dsn=TRUE, 
                                                file_outputs = file_out(c("u-auckland.prj","u-auckland.shx","u-auckland.dbf"))),
                              strings_in_dots = "literals")

u_auckland_plan

make(u_auckland_plan)

# TEST --------------------------------------------------------------------


file.remove("u-auckland.shp")

make(u_auckland_plan)


file.remove("u-auckland.prj")

make(u_auckland_plan)

file.remove("u-auckland.dbf")

make(u_auckland_plan)

It is more complicated than I hope it would be. The drake side works perfectly but I needed to write a wrapper function around st_write() to allow the extra file outputs to be tracked.

Perhaps someone else who is experimenting with using drake to manage a spatial data workflow can offer a simpler example? cc: @noamross @krlmlr @pat-s

tiernanmartin on 16 Jul 2018

👍1

@tiernanmartin Thanks for the quick start on this example! I am optimistic. drake commands can be arbitrary multi-line code chunks, so I do not think we need the wrapper around st_write(). What about this plan?

library(drake)
pkgconfig::set_config("drake::strings_in_dots" = "literals")
u_auckland_plan <- drake_plan(
  u_auckland = make_place(
    Name = "University of Auckland",
    Latitude = -36.8521369,
    Longitude = 174.7688785
  ),
  u_aukland_shapefile = {
    file_out("u-auckland.prj", "u-auckland.shx", "u-auckland.dbf")
    st_write(
      obj = u_auckland,
      dsn = file_out("u-auckland.shp"),
      driver = "ESRI Shapefile",
      delete_dsn = TRUE
    )
  }
)

I am not sure my changes are totally correct because make(u_auckland_plan) gives warnings:

Warning: target u_aukland_shapefile warnings:
  GDAL Error 1: u-auckland.shp does not appear to be a file or directory.

On the other hand, the four files do appear, including u-auckland.shp.

wlandau on 16 Jul 2018

Ah I didn't realize that commands we so flexible.

Your code looks good to me! The GDAL error is annoying but not a deal breaker. The reason it shows up is because I set delete_dsn = TRUE, causing GDAL to expect to have to delete a file before it creates the replacement. The default setting of delete_dsn = FALSE will not allow the files to be overwritten.

tiernanmartin on 16 Jul 2018

I just noticed there is a typo - here's the complete version with your suggested revisions:

library(tibble)
library(sf)
library(drake)
pkgconfig::set_config("drake::strings_in_dots" = "literals")

make_place <- function(Name, Latitude, Longitude){
  tibble(Name = Name,
                    Latitude = Latitude,
                    Longitude = Longitude) %>% 
  st_as_sf(coords = c("Longitude", "Latitude")) %>% 
  st_set_crs(4326)
}


u_auckland_plan <- drake_plan(
  u_auckland = make_place(
    Name = "University of Auckland",
    Latitude = -36.8521369,
    Longitude = 174.7688785
  ),
  u_auckland_shapefile = {
    file_out("u-auckland.prj", "u-auckland.shx", "u-auckland.dbf")
    st_write(
      obj = u_auckland,
      dsn = file_out("u-auckland.shp"),
      driver = "ESRI Shapefile",
      delete_dsn = TRUE
    )
  }
)

make(u_auckland_plan)

file.remove("u-auckland.shp")

make(u_auckland_plan)


file.remove("u-auckland.prj")

make(u_auckland_plan)

file.remove("u-auckland.dbf")

make(u_auckland_plan)

tiernanmartin on 16 Jul 2018

Thanks, @tiernanmartin. This is nice inspiration for a chapter in the docs.

wlandau on 19 Jul 2018

👍1

FYI: effective #795, you can write st_write(spatial_data, file_out("spatial_data"), driver = "ESRI Shapefile") and drake will track the entire directory of output files.

wlandau on 22 Mar 2019

🎉1

Was this page helpful?

0 / 5 - 0 ratings

Related issues

Reproducibility with random numbers

bart1 · 7Comments

conflict between assertive and drake's internal S3 classes

rsangole · 7Comments

Uptake of drake in the community

wlandau-lilly · 7Comments

Command that writes a file always runs

tiernanmartin · 3Comments

Comparison with Netflix's Metaflow

wlandau · 8Comments

Drake: Helper function for creating file targets with multiple files

All 14 comments

EDIT: 2018-02-24

Solution for drake > 5.0.0

Solution for drake <= 5.0.0

Related issues

Solution for `drake` > 5.0.0

Solution for `drake` <= 5.0.0