Drake: Helper function for creating file targets with multiple files

Created on 16 Feb 2018  Â·  14Comments  Â·  Source: ropensci/drake

I have a command that creates multiple files every time it runs. The command writes a spatial object in the shapefile format which results in the creation of four files:

st_write(spatial_data,"spatial_data.shp", driver = "ESRI Shapefile")

## Creates:
##   - "spatial_data.shp"
##   - "spatial_data.shx"
##   - "spatial_data.prj"
##   - "spatial_data.dbf"

All shapefiles need these four file types in order to work properly (actually, they need 3 of the 4 but that's irrelevant to this example).

This creates issues for any plan that includes this st_write() command. For instance, if I have the following plan:

plan <- drake_plan( 
    'spatial_data.shp' = st_write(spatial_data,"spatial_data.shp"),
    file_targets = TRUE,
    strings_in_dots = "literals"
  ) 

The plan is only tracking one of the four necessary file targets ( spatial_data.shp), and if I were to delete any of the untracked files and re-run the plan it would tell me that all targets are already up to date.

Could there be a function that allows users to create a list of file targets that come from a single command?

spatial_data_files <- drake::file_target_list('nc.shp', 'nc.shx','nc.dbf','nc.prj')  # proposed function

plan <- drake_plan( 
    spatial_data_files = st_write(spatial_data,"spatial_data.shp"),
    file_targets = TRUE,
    strings_in_dots = "literals"
  ) 

Thanks!

api documentation

All 14 comments

@tiernanmartin It's a good point. I was actually trying to solve this sort of problem once and for all in #232, but as I explain here, it is extremely difficult to make drake break the one-file-per-target rule. But right now, you can make use of wildcard templating to make the non-.shp files depend on spatial_data.shp

EDIT: 2018-02-24

I modified the next bit to be FAQ-friendly. The "first solution" that @tiernanmartin refers to next is actually the solution for drake <= 5.0.0.

Solution for drake > 5.0.0

library(drake)
library(magrittr)
drake_plan(st_write(spatial_data, file_out("spatial_data.shp"), driver = "ESRI Shapefile"), 
  c(file_out("spatial_data.EXTN"), file_in("spatial_data.shp"))) %>% evaluate_plan(wildcard = "EXTN", 
  values = c("shx", "prj", "dbj"))
#> # A tibble: 4 x 2
#>   target                 command                                          
#>   <chr>                  <chr>                                            
#> 1 "\"spatial_data.shp\"" "st_write(spatial_data, file_out(\"spatial_data.…
#> 2 "\"spatial_data.shx\"" "c(file_out(\"spatial_data.shx\"), file_in(\"spa…
#> 3 "\"spatial_data.prj\"" "c(file_out(\"spatial_data.prj\"), file_in(\"spa…
#> 4 "\"spatial_data.dbj\"" "c(file_out(\"spatial_data.dbj\"), file_in(\"spa…

Solution for drake <= 5.0.0

library(drake)
library(magrittr)
plan <- drake_plan(list = c(
  spatial_data.shp = "st_write(spatial_data, 'spatial_data.shp', driver = \"ESRI Shapefile\")",
  spatial_data = "c(\"spatial_data.EXTN\", 'spatial_data.shp'))"
)) %>%
  evaluate_plan(wildcard = "EXTN", values = c("shx", "prj", "dbj"))
plan$target <- drake_quotes(plan$target, single = TRUE)
plan

## # A tibble: 4 x 2
##   target             command                                                       
##   <chr>              <chr>                                                         
## 1 'spatial_data.shp' "st_write(spatial_data, 'spatial_data.shp', driver = \"ESRI S…
## 2 'spatial_data_shx' "c(\"spatial_data.shx\", 'spatial_data.shp'))"                
## 3 'spatial_data_prj' "c(\"spatial_data.prj\", 'spatial_data.shp'))"                
## 4 'spatial_data_dbj' "c(\"spatial_data.dbj\", 'spatial_data.shp'))"  

Thanks for the explanation and for pointing me toward the wildcard feature. I'm looking forward to #232 getting merged into the master branch!

In the meantime, I'll use the first approach you recommended. Quick question: I notice that the file targets in the first solution you demonstrated lack file extensions (e.g., 'spatial_data_prj' instead of 'spatial_data.prj'):

## # A tibble: 4 x 2
##   target             command                                                       
##   <chr>              <chr>                                                         
## 1 'spatial_data.shp' "st_write(spatial_data, 'spatial_data.shp', driver = \"ESRI S…
## 2 'spatial_data_shx' "c(\"spatial_data.shx\", 'spatial_data.shp'))"                
## 3 'spatial_data_prj' "c(\"spatial_data.prj\", 'spatial_data.shp'))"                
## 4 'spatial_data_dbj' "c(\"spatial_data.dbj\", 'spatial_data.shp'))"  

Is there a way to convert those _'s into .'s in the evaluate_plan() call, or do I need to figure out a post-processing step?

Sorry, I forgot about the way drake automatically uses underscore-delimited suffixes. For now, you'll probably have to do plan$target <- gsub("data_", "data\\.", plan$target). Unless I'm overruled in #232, you won't have to worry about setting the target column yourself when it comes to file outputs.

Is there a reason why a target cannot be a directory?

In this example, I realized that the command st_write(spatial_data, "spatial_data", driver = "ESRI Shapefile") will create a directory that contains the four files:

spatial_data/ 
    ├── spatial_data.dbf
    ├── spatial_data.prj
    ├── spatial_data.shp
    └── spatial_data.shx

Having drake track the spatial_data directory rather than the individual files it contains would be nice because it eliminates the need to a many-to-one relationship between the plan's targets and command.

But when I tried implementing it I see the following error:

## Error: The specified pathname is not a file: spatial_data

Reprex

library(drake)
library(sf) 

spatial_data <- st_read(system.file("shape/nc.shp", package = "sf")) 

plan <- drake_plan(
  spatial_data = st_write(spatial_data, "spatial_data", driver = "ESRI Shapefile"),
  strings_in_dots = "literals",
  file_targets = TRUE
)

plan
## # A tibble: 1 x 2
##   target         command                                                  
##   <chr>          <chr>                                                    
## 1 'spatial_data' "st_write(spatial_data, \"spatial_data\", driver = \"ESR~

make(plan)
## cache C:/Users/UrbanDesigner/AppData/Local/Temp/Rtmp40h7BI/.drake
## connect 2 imports: plan, spatial_data
## connect 1 target: 'spatial_data'
## check 2 items: spatial_data, st_write
## check 1 item: 'spatial_data'
## target 'spatial_data'
## Writing layer `spatial_data' to data source `spatial_data' using driver `ESRI Shapefile'
## features:       100
## fields:         14
## geometry type:  Multi Polygon
## Error: The specified pathname is not a file: spatial_data

I believe the "specified pathname" error does not actually come from drake. Last time I checked, directories are not actually safe as file targets. The standard file hashing tools seem to avoid doing it.

$ md5sum file.csv
6463474bfe6973a81dc7cbc4a71e8dd1  file.csv
$ md5sum ~/projects
md5sum: projects: Is a directory
$ man md5sum # has no recursive option

Drake uses the digest package to hash files, and digest avoids hashing directories too.

library(digest)
> digest("~/projects/", file = TRUE)
Error: The specified pathname is not a file: /home/landau/projects/
> file.exists("~/projects")
[1] TRUE

Directory targets would be nice to have, but I do not think it is drake's responsibility to figure out how to hash them quickly and efficiently. It's a thorny problem, maybe for a separate package, maybe called dirgest.

Oops: forgot this issue was about more than just directory hashes. Reopening.

I just updated https://github.com/ropensci/drake/issues/257#issuecomment-366379134 to be more FAQ-friendly, and this thread is now part of our automatically-generated FAQ. I think we can close. We should discuss potential further development on #12.

FYI: the best practices guide now has detailed guidance on output file targets, including the main drawback and main alternative to the workaround we talked about earlier in the thread.

@wlandau awesome work implementing this feature 🎉

You asked for a shapefile workflow so I did my best to put something together:


Drake Shapefile Example

# SETUP -------------------------------------------------------------------


library(tibble)
library(purrr)
library(sf)
library(drake) # devtools::install_github("ropensci/drake")



# PLAN --------------------------------------------------------------------


make_place <- function(Name, Latitude, Longitude){
  tibble(Name = Name,
                    Latitude = Latitude,
                    Longitude = Longitude) %>% 
  st_as_sf(coords = c("Longitude", "Latitude")) %>% 
  st_set_crs(4326)
}


st_write_multiple <- function(..., file_outputs){ 
  pwalk(list(...), st_write)
}

u_auckland_plan <- drake_plan(u_auckland = make_place(Name = "University of Auckland", Latitude = -36.8521369, Longitude = 174.7688785),
                              u_aukland_shapefile = st_write_multiple(list(u_auckland), dsn = file_out("u-auckland.shp"), driver = "ESRI Shapefile",delete_dsn=TRUE, 
                                                file_outputs = file_out(c("u-auckland.prj","u-auckland.shx","u-auckland.dbf"))),
                              strings_in_dots = "literals")

u_auckland_plan

make(u_auckland_plan)

# TEST --------------------------------------------------------------------


file.remove("u-auckland.shp")

make(u_auckland_plan)


file.remove("u-auckland.prj")

make(u_auckland_plan)

file.remove("u-auckland.dbf")

make(u_auckland_plan)

It is more complicated than I hope it would be. The drake side works perfectly but I needed to write a wrapper function around st_write() to allow the extra file outputs to be tracked.

Perhaps someone else who is experimenting with using drake to manage a spatial data workflow can offer a simpler example? cc: @noamross @krlmlr @pat-s

@tiernanmartin Thanks for the quick start on this example! I am optimistic. drake commands can be arbitrary multi-line code chunks, so I do not think we need the wrapper around st_write(). What about this plan?

library(drake)
pkgconfig::set_config("drake::strings_in_dots" = "literals")
u_auckland_plan <- drake_plan(
  u_auckland = make_place(
    Name = "University of Auckland",
    Latitude = -36.8521369,
    Longitude = 174.7688785
  ),
  u_aukland_shapefile = {
    file_out("u-auckland.prj", "u-auckland.shx", "u-auckland.dbf")
    st_write(
      obj = u_auckland,
      dsn = file_out("u-auckland.shp"),
      driver = "ESRI Shapefile",
      delete_dsn = TRUE
    )
  }
)

I am not sure my changes are totally correct because make(u_auckland_plan) gives warnings:

Warning: target u_aukland_shapefile warnings:
  GDAL Error 1: u-auckland.shp does not appear to be a file or directory.

On the other hand, the four files do appear, including u-auckland.shp.

Ah I didn't realize that commands we so flexible.

Your code looks good to me! The GDAL error is annoying but not a deal breaker. The reason it shows up is because I set delete_dsn = TRUE, causing GDAL to expect to have to delete a file before it creates the replacement. The default setting of delete_dsn = FALSE will not allow the files to be overwritten.

I just noticed there is a typo - here's the complete version with your suggested revisions:

library(tibble)
library(sf)
library(drake)
pkgconfig::set_config("drake::strings_in_dots" = "literals")

make_place <- function(Name, Latitude, Longitude){
  tibble(Name = Name,
                    Latitude = Latitude,
                    Longitude = Longitude) %>% 
  st_as_sf(coords = c("Longitude", "Latitude")) %>% 
  st_set_crs(4326)
}


u_auckland_plan <- drake_plan(
  u_auckland = make_place(
    Name = "University of Auckland",
    Latitude = -36.8521369,
    Longitude = 174.7688785
  ),
  u_auckland_shapefile = {
    file_out("u-auckland.prj", "u-auckland.shx", "u-auckland.dbf")
    st_write(
      obj = u_auckland,
      dsn = file_out("u-auckland.shp"),
      driver = "ESRI Shapefile",
      delete_dsn = TRUE
    )
  }
)

make(u_auckland_plan)

file.remove("u-auckland.shp")

make(u_auckland_plan)


file.remove("u-auckland.prj")

make(u_auckland_plan)

file.remove("u-auckland.dbf")

make(u_auckland_plan)

Thanks, @tiernanmartin. This is nice inspiration for a chapter in the docs.

FYI: effective #795, you can write st_write(spatial_data, file_out("spatial_data"), driver = "ESRI Shapefile") and drake will track the entire directory of output files.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

bart1 picture bart1  Â·  7Comments

rsangole picture rsangole  Â·  7Comments

wlandau-lilly picture wlandau-lilly  Â·  7Comments

tiernanmartin picture tiernanmartin  Â·  3Comments

wlandau picture wlandau  Â·  8Comments