I have a command that creates multiple files every time it runs. The command writes a spatial object in the shapefile format which results in the creation of four files:
st_write(spatial_data,"spatial_data.shp", driver = "ESRI Shapefile")
## Creates:
## - "spatial_data.shp"
## - "spatial_data.shx"
## - "spatial_data.prj"
## - "spatial_data.dbf"
All shapefiles need these four file types in order to work properly (actually, they need 3 of the 4 but that's irrelevant to this example).
This creates issues for any plan that includes this st_write() command. For instance, if I have the following plan:
plan <- drake_plan(
'spatial_data.shp' = st_write(spatial_data,"spatial_data.shp"),
file_targets = TRUE,
strings_in_dots = "literals"
)
The plan is only tracking one of the four necessary file targets ( spatial_data.shp), and if I were to delete any of the untracked files and re-run the plan it would tell me that all targets are already up to date.
Could there be a function that allows users to create a list of file targets that come from a single command?
spatial_data_files <- drake::file_target_list('nc.shp', 'nc.shx','nc.dbf','nc.prj') # proposed function
plan <- drake_plan(
spatial_data_files = st_write(spatial_data,"spatial_data.shp"),
file_targets = TRUE,
strings_in_dots = "literals"
)
Thanks!
@tiernanmartin It's a good point. I was actually trying to solve this sort of problem once and for all in #232, but as I explain here, it is extremely difficult to make drake break the one-file-per-target rule. But right now, you can make use of wildcard templating to make the non-.shp files depend on spatial_data.shp
I modified the next bit to be FAQ-friendly. The "first solution" that @tiernanmartin refers to next is actually the solution for drake <= 5.0.0.
drake > 5.0.0library(drake)
library(magrittr)
drake_plan(st_write(spatial_data, file_out("spatial_data.shp"), driver = "ESRI Shapefile"),
c(file_out("spatial_data.EXTN"), file_in("spatial_data.shp"))) %>% evaluate_plan(wildcard = "EXTN",
values = c("shx", "prj", "dbj"))
#> # A tibble: 4 x 2
#> target command
#> <chr> <chr>
#> 1 "\"spatial_data.shp\"" "st_write(spatial_data, file_out(\"spatial_data.…
#> 2 "\"spatial_data.shx\"" "c(file_out(\"spatial_data.shx\"), file_in(\"spa…
#> 3 "\"spatial_data.prj\"" "c(file_out(\"spatial_data.prj\"), file_in(\"spa…
#> 4 "\"spatial_data.dbj\"" "c(file_out(\"spatial_data.dbj\"), file_in(\"spa…
drake <= 5.0.0library(drake)
library(magrittr)
plan <- drake_plan(list = c(
spatial_data.shp = "st_write(spatial_data, 'spatial_data.shp', driver = \"ESRI Shapefile\")",
spatial_data = "c(\"spatial_data.EXTN\", 'spatial_data.shp'))"
)) %>%
evaluate_plan(wildcard = "EXTN", values = c("shx", "prj", "dbj"))
plan$target <- drake_quotes(plan$target, single = TRUE)
plan
## # A tibble: 4 x 2
## target command
## <chr> <chr>
## 1 'spatial_data.shp' "st_write(spatial_data, 'spatial_data.shp', driver = \"ESRI S…
## 2 'spatial_data_shx' "c(\"spatial_data.shx\", 'spatial_data.shp'))"
## 3 'spatial_data_prj' "c(\"spatial_data.prj\", 'spatial_data.shp'))"
## 4 'spatial_data_dbj' "c(\"spatial_data.dbj\", 'spatial_data.shp'))"
Thanks for the explanation and for pointing me toward the wildcard feature. I'm looking forward to #232 getting merged into the master branch!
In the meantime, I'll use the first approach you recommended. Quick question: I notice that the file targets in the first solution you demonstrated lack file extensions (e.g., 'spatial_data_prj' instead of 'spatial_data.prj'):
## # A tibble: 4 x 2 ## target command ## <chr> <chr> ## 1 'spatial_data.shp' "st_write(spatial_data, 'spatial_data.shp', driver = \"ESRI S… ## 2 'spatial_data_shx' "c(\"spatial_data.shx\", 'spatial_data.shp'))" ## 3 'spatial_data_prj' "c(\"spatial_data.prj\", 'spatial_data.shp'))" ## 4 'spatial_data_dbj' "c(\"spatial_data.dbj\", 'spatial_data.shp'))"
Is there a way to convert those _'s into .'s in the evaluate_plan() call, or do I need to figure out a post-processing step?
Sorry, I forgot about the way drake automatically uses underscore-delimited suffixes. For now, you'll probably have to do plan$target <- gsub("data_", "data\\.", plan$target). Unless I'm overruled in #232, you won't have to worry about setting the target column yourself when it comes to file outputs.
Is there a reason why a target cannot be a directory?
In this example, I realized that the command st_write(spatial_data, "spatial_data", driver = "ESRI Shapefile") will create a directory that contains the four files:
spatial_data/
├── spatial_data.dbf
├── spatial_data.prj
├── spatial_data.shp
└── spatial_data.shx
Having drake track the spatial_data directory rather than the individual files it contains would be nice because it eliminates the need to a many-to-one relationship between the plan's targets and command.
But when I tried implementing it I see the following error:
## Error: The specified pathname is not a file: spatial_data
Reprex
library(drake)
library(sf)
spatial_data <- st_read(system.file("shape/nc.shp", package = "sf"))
plan <- drake_plan(
spatial_data = st_write(spatial_data, "spatial_data", driver = "ESRI Shapefile"),
strings_in_dots = "literals",
file_targets = TRUE
)
plan
## # A tibble: 1 x 2
## target command
## <chr> <chr>
## 1 'spatial_data' "st_write(spatial_data, \"spatial_data\", driver = \"ESR~
make(plan)
## cache C:/Users/UrbanDesigner/AppData/Local/Temp/Rtmp40h7BI/.drake
## connect 2 imports: plan, spatial_data
## connect 1 target: 'spatial_data'
## check 2 items: spatial_data, st_write
## check 1 item: 'spatial_data'
## target 'spatial_data'
## Writing layer `spatial_data' to data source `spatial_data' using driver `ESRI Shapefile'
## features: 100
## fields: 14
## geometry type: Multi Polygon
## Error: The specified pathname is not a file: spatial_data
I believe the "specified pathname" error does not actually come from drake. Last time I checked, directories are not actually safe as file targets. The standard file hashing tools seem to avoid doing it.
$ md5sum file.csv
6463474bfe6973a81dc7cbc4a71e8dd1 file.csv
$ md5sum ~/projects
md5sum: projects: Is a directory
$ man md5sum # has no recursive option
Drake uses the digest package to hash files, and digest avoids hashing directories too.
library(digest)
> digest("~/projects/", file = TRUE)
Error: The specified pathname is not a file: /home/landau/projects/
> file.exists("~/projects")
[1] TRUE
Directory targets would be nice to have, but I do not think it is drake's responsibility to figure out how to hash them quickly and efficiently. It's a thorny problem, maybe for a separate package, maybe called dirgest.
Oops: forgot this issue was about more than just directory hashes. Reopening.
I just updated https://github.com/ropensci/drake/issues/257#issuecomment-366379134 to be more FAQ-friendly, and this thread is now part of our automatically-generated FAQ. I think we can close. We should discuss potential further development on #12.
FYI: the best practices guide now has detailed guidance on output file targets, including the main drawback and main alternative to the workaround we talked about earlier in the thread.
@wlandau awesome work implementing this feature 🎉
You asked for a shapefile workflow so I did my best to put something together:
# SETUP -------------------------------------------------------------------
library(tibble)
library(purrr)
library(sf)
library(drake) # devtools::install_github("ropensci/drake")
# PLAN --------------------------------------------------------------------
make_place <- function(Name, Latitude, Longitude){
tibble(Name = Name,
Latitude = Latitude,
Longitude = Longitude) %>%
st_as_sf(coords = c("Longitude", "Latitude")) %>%
st_set_crs(4326)
}
st_write_multiple <- function(..., file_outputs){
pwalk(list(...), st_write)
}
u_auckland_plan <- drake_plan(u_auckland = make_place(Name = "University of Auckland", Latitude = -36.8521369, Longitude = 174.7688785),
u_aukland_shapefile = st_write_multiple(list(u_auckland), dsn = file_out("u-auckland.shp"), driver = "ESRI Shapefile",delete_dsn=TRUE,
file_outputs = file_out(c("u-auckland.prj","u-auckland.shx","u-auckland.dbf"))),
strings_in_dots = "literals")
u_auckland_plan
make(u_auckland_plan)
# TEST --------------------------------------------------------------------
file.remove("u-auckland.shp")
make(u_auckland_plan)
file.remove("u-auckland.prj")
make(u_auckland_plan)
file.remove("u-auckland.dbf")
make(u_auckland_plan)
It is more complicated than I hope it would be. The drake side works perfectly but I needed to write a wrapper function around st_write() to allow the extra file outputs to be tracked.
Perhaps someone else who is experimenting with using drake to manage a spatial data workflow can offer a simpler example? cc: @noamross @krlmlr @pat-s
@tiernanmartin Thanks for the quick start on this example! I am optimistic. drake commands can be arbitrary multi-line code chunks, so I do not think we need the wrapper around st_write(). What about this plan?
library(drake)
pkgconfig::set_config("drake::strings_in_dots" = "literals")
u_auckland_plan <- drake_plan(
u_auckland = make_place(
Name = "University of Auckland",
Latitude = -36.8521369,
Longitude = 174.7688785
),
u_aukland_shapefile = {
file_out("u-auckland.prj", "u-auckland.shx", "u-auckland.dbf")
st_write(
obj = u_auckland,
dsn = file_out("u-auckland.shp"),
driver = "ESRI Shapefile",
delete_dsn = TRUE
)
}
)
I am not sure my changes are totally correct because make(u_auckland_plan) gives warnings:
Warning: target u_aukland_shapefile warnings:
GDAL Error 1: u-auckland.shp does not appear to be a file or directory.
On the other hand, the four files do appear, including u-auckland.shp.
Ah I didn't realize that commands we so flexible.
Your code looks good to me! The GDAL error is annoying but not a deal breaker. The reason it shows up is because I set delete_dsn = TRUE, causing GDAL to expect to have to delete a file before it creates the replacement. The default setting of delete_dsn = FALSE will not allow the files to be overwritten.
I just noticed there is a typo - here's the complete version with your suggested revisions:
library(tibble)
library(sf)
library(drake)
pkgconfig::set_config("drake::strings_in_dots" = "literals")
make_place <- function(Name, Latitude, Longitude){
tibble(Name = Name,
Latitude = Latitude,
Longitude = Longitude) %>%
st_as_sf(coords = c("Longitude", "Latitude")) %>%
st_set_crs(4326)
}
u_auckland_plan <- drake_plan(
u_auckland = make_place(
Name = "University of Auckland",
Latitude = -36.8521369,
Longitude = 174.7688785
),
u_auckland_shapefile = {
file_out("u-auckland.prj", "u-auckland.shx", "u-auckland.dbf")
st_write(
obj = u_auckland,
dsn = file_out("u-auckland.shp"),
driver = "ESRI Shapefile",
delete_dsn = TRUE
)
}
)
make(u_auckland_plan)
file.remove("u-auckland.shp")
make(u_auckland_plan)
file.remove("u-auckland.prj")
make(u_auckland_plan)
file.remove("u-auckland.dbf")
make(u_auckland_plan)
Thanks, @tiernanmartin. This is nice inspiration for a chapter in the docs.
FYI: effective #795, you can write st_write(spatial_data, file_out("spatial_data"), driver = "ESRI Shapefile") and drake will track the entire directory of output files.