Drake: Filtering groups of targets

Created on 9 Oct 2019  Â·  2Comments  Â·  Source: ropensci/drake

Prework

This hasn't been suggested from what I've seen.

Description

I have a workflow where multiple objects have some common initial processing steps, but very different subsequent analysis steps. Ideally I'd like to do something like the following:

plan <- drake_plan(

   data = load_data(),

   preprocessed = target(
      common_step(data, param1, param2),
      transform=cross(param1=c(1,2), param2=c(1,2,3,4,5))
   ),

   first_analysis = target(
       analyze_somehow(preprocessed, param2),
       transform=map(preprocessed, param2, .filter=(param1 == 1))
       # Or maybe make a new transform verb for this?
   ),

   aux_data = load_from_elsewhere(),

   second_analysis = target(
       analyze_differently(preprocessed, aux_data, param2),
       transform=map(preprocessed, param2, .filter=(param1 == 2))
   )
)

As far as I can tell, with current functionality to get the same results I either have to write different targets for the common step (what I've gone with in the actual workflow that inspired this suggestion), or make the divergent logic in the subsequent analysis look programatically similar enough to be able pass the functions to map, which may be harder for more complicated workflows.

api question

All 2 comments

map(.filter) is an interesting idea I have not considered before. I am currently ambivalent, but I will keep thinking about it. Right now, you can use trace = TRUE and dplyr to achieve the filtering you describe. Any targets downstream of first_* and second_* will require similar filtering.

library(drake)

plan <- drake_plan(
  data = load_data(),
  aux = load_from_elsewhere(),
  prep = target(
    common_step(data, param1, param2),
    transform = cross(
      param1 = c(1, 2),
      param2 = c("a", "b")
    )
  ),
  first = target(
    analyze_somehow(prep, param2),
    transform = map(prep, param2, .id = c(param1, param2))
  ),
  second = target(
    analyze_differently(prep, aux, param2),
    transform = map(prep, param2, .id = c(param1, param2))
  ),
  trace = TRUE
)

plan
#> # A tibble: 14 x 7
#>    target    command                  param1 param2 prep   first   second  
#>    <chr>     <expr>                   <chr>  <chr>  <chr>  <chr>   <chr>   
#>  1 data      load_data()            … <NA>   <NA>   <NA>   <NA>    <NA>    
#>  2 aux       load_from_elsewhere()  … <NA>   <NA>   <NA>   <NA>    <NA>    
#>  3 prep_1_a  common_step(data, 1, "a… 1      "\"a\… prep_… <NA>    <NA>    
#>  4 prep_2_a  common_step(data, 2, "a… 2      "\"a\… prep_… <NA>    <NA>    
#>  5 prep_1_b  common_step(data, 1, "b… 1      "\"b\… prep_… <NA>    <NA>    
#>  6 prep_2_b  common_step(data, 2, "b… 2      "\"b\… prep_… <NA>    <NA>    
#>  7 first_1_a analyze_somehow(prep_1_… 1      "\"a\… prep_… first_… <NA>    
#>  8 first_2_a analyze_somehow(prep_2_… 2      "\"a\… prep_… first_… <NA>    
#>  9 first_1_b analyze_somehow(prep_1_… 1      "\"b\… prep_… first_… <NA>    
#> 10 first_2_b analyze_somehow(prep_2_… 2      "\"b\… prep_… first_… <NA>    
#> 11 second_1… analyze_differently(pre… 1      "\"a\… prep_… <NA>    second_…
#> 12 second_2… analyze_differently(pre… 2      "\"a\… prep_… <NA>    second_…
#> 13 second_1… analyze_differently(pre… 1      "\"b\… prep_… <NA>    second_…
#> 14 second_2… analyze_differently(pre… 2      "\"b\… prep_… <NA>    second_…
config <- drake_config(plan)
vis_drake_graph(config)


library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union

plan <- plan %>%
  filter(is.na(first) | param1 == "1") %>%
  filter(is.na(second) | param1 == "2")

config <- drake_config(plan)
vis_drake_graph(config)

Created on 2019-10-09 by the reprex package (v0.3.0)

Decision about filtering

I reflected on this more, and while it probably would not be a terribly difficult proposition to implement filtering, I feel like it is a slippery slope. The DSL is already complicated, and I prefer not to get carried away.

Problem with original workaround

That said, there is a problem with https://github.com/ropensci/drake/issues/1026#issuecomment-540186418: combine() steps will not respond to post-hoc filtering.

library(drake)
library(tidyverse)
plan <- drake_plan(
  data = 1,
  prep = target(
    structure(data + param1, class = param2),
    transform = cross(
      param1 = c(1, 2),
      param2 = c("a", "b")
    )
  ),
  analysis = target(
    mean(prep),
    transform = map(prep, param2, .id = c(param1, param2))
  ),
  results = target(
    analysis,
    transform = combine(analysis)
  ),
  trace = TRUE
) %>%
  filter(is.na(analysis) | param1 == "1")

# We have the correct connections.
config <- drake_config(plan)
vis_drake_graph(config)


# But not the correct command for the results target.
tail(plan$command, 1)[[1]]
#> list(analysis_1_a, analysis_2_a, analysis_1_b, analysis_2_b)

make(plan)
#> target data
#> target prep_1_a
#> target prep_1_b
#> target prep_2_a
#> target prep_2_b
#> target analysis_1_a
#> target analysis_1_b
#> target results
#> fail results
#> Error: Target `results` failed. Call `diagnose(results)` for details. Error message:
#>   object 'analysis_2_a' not found

Created on 2019-10-18 by the reprex package (v0.3.0)

New workaround

Use a 2 transformations: one for the targets you want to filter, and another for the targets you want to leave out.

library(drake)
library(tidyverse)
plan <- drake_plan(
  data = 1,
  prep_in = target(
    structure(data + 1, class = param2),
    transform = map(param2 = c("a", "b"))
  ),
  prep_out = target(
    structure(data + 2, class = param2),
    transform = map(param2 = c("a", "b"))
  ),
  analysis = target(
    mean(prep_in),
    transform = map(prep_in, param2, .id = param2)
  ),
  results = target(
    analysis,
    transform = combine(analysis)
  )
)

# We have the correct connections.
config <- drake_config(plan)
vis_drake_graph(config)


# But not the correct command for the results target.
tail(plan$command, 1)[[1]]
#> list(analysis_a, analysis_b)

make(plan)
#> target data
#> target prep_in_a
#> target prep_in_b
#> target prep_out_a
#> target prep_out_b
#> target analysis_a
#> target analysis_b
#> target results

Created on 2019-10-18 by the reprex package (v0.3.0)

Was this page helpful?
0 / 5 - 0 ratings

Related issues

wlandau picture wlandau  Â·  9Comments

AlexAxthelm picture AlexAxthelm  Â·  8Comments

htlin picture htlin  Â·  4Comments

bart1 picture bart1  Â·  7Comments

maelle picture maelle  Â·  8Comments