Drake: Allow custom names in transformations

Created on 20 Apr 2020  Â·  11Comments  Â·  Source: ropensci/drake

Prework

Proposal

I'd like to be able to explicitly specify names of targets created with map() or cross().
I don't want to have to rely on variable names (see #1220) or remember numbers (.id = FALSE), but simply supply a character vector with my own meaningful names.

Something like this:

plan <- drake_plan(
    my_fun(path = file_in(my_path)),
    transform = map(
        my_path = c(
            "/nothing/interesting/METHOD1/stuff/DATA1",
            "/nothing/interesting/METHOD2_BLA_BLA/stuff/DATA2"
        ),
        .id = FALSE,
        .custom_names = c(
            "METHOD1_DATA1",
            "METHOD2_DATA2"
        )
    )
)

Having short meaningful names would help in downstream prototyping or exploratory analysis. Currently I have to do short_name <- loadd("a_la_la_la_la_long_long_li_long_long_name") for each target.

new feature

Most helpful comment

I let it sit for a while and changed my mind because implementation turned out to be trivially easy. Implemented in 312e12404d127d66f72caa440b80027333d513ed.

library(drake)
drake_plan(
  x = target(
    f(x),
    transform = map(x = !!seq_len(2), .names = c("a", "b"))
  ),
  y = target(
    f(w, x),
    transform = cross(
      w = !!seq_len(2),
      x,
      .names = c("aa", "ab", "ba", "bb")
    )
  ),
  z = target(
    g(y),
    transform = map(y)
  ),
  final = target(
    h(z),
    transform = combine(z, .by = x, .names = c("final1", "final2"))
  )
)
#> # A tibble: 12 x 2
#>    target command      
#>    <chr>  <expr>       
#>  1 final1 h(z_aa, z_ab)
#>  2 final2 h(z_ba, z_bb)
#>  3 a      f(1L)        
#>  4 b      f(2L)        
#>  5 aa     f(1L, a)     
#>  6 ab     f(2L, a)     
#>  7 ba     f(1L, b)     
#>  8 bb     f(2L, b)     
#>  9 z_aa   g(aa)        
#> 10 z_ab   g(ab)        
#> 11 z_ba   g(ba)        
#> 12 z_bb   g(bb)

Created on 2020-04-23 by the reprex package (v0.3.0)

I had considered this feature before, but I resisted on general principles. It breaks the abstraction of the interface, which is usually unwise from a software development perspective. I have made the mistake of being too eager to add flexibility over the last 3-ish years of developing drake. In other words, I was not lazy enough as a programmer, and I had little experience developing a tool this complicated.

If I could go back and develop the whole package from scratch, I would recommend clever workarounds for as long as possible before immediately implementing new features. For example, triggers was not a good idea in hindsight. The implementation was extremely complicated to create and even more burdensome to maintain, and I now strongly suspect the most important functionality can be completely covered with global objects, cancel(), and cancel_if(), which are much simpler.

For custom names, it took a go at implementation to see this, but it turned out to have less technical debt than I originally expected.

All 11 comments

Sorry, I do not have plans to implement 100% custom names. You can kind of work around this by defining a custom grouping variable and feeding it to .id.

library(drake)
custom_ids <- c("method1_data1", "method2_data2")
plan <- drake_plan(
  x = target( # Start with a short name.
    my_fun(path = file_in(my_path)),
    transform = map(
      my_path = c(
        "/nothing/interesting/METHOD1/stuff/DATA1",
        "/nothing/interesting/METHOD2_BLA_BLA/stuff/DATA2"
      ),
      custom_ids = !!custom_ids, # The !! is important.
      .id = custom_ids,
    )
  )
)

plan
#> # A tibble: 2 x 2
#>   target         command                                                        
#>   <chr>          <expr>                                                         
#> 1 x_method1_dat… my_fun(path = file_in("/nothing/interesting/METHOD1/stuff/DATA…
#> 2 x_method2_dat… my_fun(path = file_in("/nothing/interesting/METHOD2_BLA_BLA/st…

drake_plan_source(plan)
#> drake_plan(
#>   x_method1_data1 = my_fun(path = file_in("/nothing/interesting/METHOD1/stuff/DATA1")),
#>   x_method2_data2 = my_fun(path = file_in("/nothing/interesting/METHOD2_BLA_BLA/stuff/DATA2"))
#> )

Created on 2020-04-21 by the reprex package (v0.3.0)

Hm, I was just about to write a similar feature request. The arguments I map() and cross() over aren't really suited to generate target names from (long variable names, vectors, ...). I wanted to suggest that one just supplies named vectors to the parameters that then are used in the target name for those parameters, i.e., something like

plan <- drake_plan(
  model = target(
    my_fun(param1, param2),
    transform = cross(
      param1 = c(
        first = "FirstLongVariableName",
        second = "SecondLongVariableName"
      ),
      param2 = c(
        b1 = "LongVariableName_1",
        b2 = c("LongVariableName_1",  "LongVariableName_2")
      )
    )
  )
)

with resulting target names

model_first_b2
model_second_b1
model_second_b2

If that won't be possible, would you mind explaining how to work around this for cross()? Your suggestion above only works for map(), right? Thank you!!

You can supply a .data argument to map() with the expanded grid.

library(drake)
library(tidyverse)

grid <- expand_grid(
  param1 = c("FirstLongVariableName", "SecondLongVariableName"),
  param2 = c("LongVariableName_1", "LongVariableName_2")
) %>%
  mutate(
    id1 = recode(
      param1,
      FirstLongVariableName = "first",
      SecondLongVariableName = "second"
    )
  ) %>%
  mutate(
    id2 = recode(
      param2,
      LongVariableName_1 = "b1",
      LongVariableName_2 = "b2"
    )
  )

grid
#> # A tibble: 4 x 4
#>   param1                 param2             id1    id2  
#>   <chr>                  <chr>              <chr>  <chr>
#> 1 FirstLongVariableName  LongVariableName_1 first  b1   
#> 2 FirstLongVariableName  LongVariableName_2 first  b2   
#> 3 SecondLongVariableName LongVariableName_1 second b1   
#> 4 SecondLongVariableName LongVariableName_2 second b2

drake_plan(
  model = target(
    my_fun(param1, param2),
    transform = map(.data = !!grid, .id = c(id1, id2))
  )
)
#> # A tibble: 4 x 2
#>   target          command                                               
#>   <chr>           <expr>                                                
#> 1 model_first_b1  my_fun("FirstLongVariableName", "LongVariableName_1") 
#> 2 model_first_b2  my_fun("FirstLongVariableName", "LongVariableName_2") 
#> 3 model_second_b1 my_fun("SecondLongVariableName", "LongVariableName_1")
#> 4 model_second_b2 my_fun("SecondLongVariableName", "LongVariableName_2")

Created on 2020-04-21 by the reprex package (v0.3.0)

Thank you very much for the example, @wlandau! This works, but I had to use case_when() to account for cases when the input is a vector and not a single character/number (like b2 in my example). I still think that just using named vectors would be an elegant solution and probably easier to read/understand, but I understand that you won't implement it. Thanks for your help!

Yes, thanks for these hacks @wlandau.
Just out of curiosity: why do you think custom names are a bad idea? Are they unnecessary, too time-consuming to implement, or maybe would somehow interfere with reproducibility?

I let it sit for a while and changed my mind because implementation turned out to be trivially easy. Implemented in 312e12404d127d66f72caa440b80027333d513ed.

library(drake)
drake_plan(
  x = target(
    f(x),
    transform = map(x = !!seq_len(2), .names = c("a", "b"))
  ),
  y = target(
    f(w, x),
    transform = cross(
      w = !!seq_len(2),
      x,
      .names = c("aa", "ab", "ba", "bb")
    )
  ),
  z = target(
    g(y),
    transform = map(y)
  ),
  final = target(
    h(z),
    transform = combine(z, .by = x, .names = c("final1", "final2"))
  )
)
#> # A tibble: 12 x 2
#>    target command      
#>    <chr>  <expr>       
#>  1 final1 h(z_aa, z_ab)
#>  2 final2 h(z_ba, z_bb)
#>  3 a      f(1L)        
#>  4 b      f(2L)        
#>  5 aa     f(1L, a)     
#>  6 ab     f(2L, a)     
#>  7 ba     f(1L, b)     
#>  8 bb     f(2L, b)     
#>  9 z_aa   g(aa)        
#> 10 z_ab   g(ab)        
#> 11 z_ba   g(ba)        
#> 12 z_bb   g(bb)

Created on 2020-04-23 by the reprex package (v0.3.0)

I had considered this feature before, but I resisted on general principles. It breaks the abstraction of the interface, which is usually unwise from a software development perspective. I have made the mistake of being too eager to add flexibility over the last 3-ish years of developing drake. In other words, I was not lazy enough as a programmer, and I had little experience developing a tool this complicated.

If I could go back and develop the whole package from scratch, I would recommend clever workarounds for as long as possible before immediately implementing new features. For example, triggers was not a good idea in hindsight. The implementation was extremely complicated to create and even more burdensome to maintain, and I now strongly suspect the most important functionality can be completely covered with global objects, cancel(), and cancel_if(), which are much simpler.

For custom names, it took a go at implementation to see this, but it turned out to have less technical debt than I originally expected.

Thank you so much for implementing custom names, @wlandau!! I like the flexibility, but I wonder whether it might also be more error prone as one now has to think through and double check the order of arguments and expansions to provide a proper name vector. Especially, when changing the code (e.g., swapping of arguments), this could easily lead to working with "incorrectly" named targets.

Personally, I am only interested in modifying drake's standard naming convention by plugging in more sensible names for target's arguments. I guess one can apply the approach below and use named vectors (to have a clear value-name correspondence), then construct the names for the targets programmatically before the call to drake_plan() (I tried to do so within the call, but I didn't find a way to refer to the names resulting from a previous target). But even that might turn out to be error prone.

x_arg1 <- c(a = 1L, b = 2L)
x_arg2 <- c(c = 3L, d = 4L)
x_names <- paste0("xtarget_", paste(names(x_arg1), names(x_arg2), sep = "_"))

y_arg  <- c(e = 5L, f = 6L)
y_names <- paste0("ytarget_", do.call(paste, c(expand.grid(names(y_arg), x_names), sep = "_")))

final_names <- paste0("finaltarget_", x_names)

drake::drake_plan(
  xtarget = target(
    f(x, y),
    transform = map(
      x = !!x_arg1,
      y = !!x_arg2,
      .names = !!x_names
    )
  ),
  ytarget = target(
    f(w, xtarget),
    transform = cross(
      w = !!y_arg,
      xtarget,
      .names = !!y_names
    )
  ),
  ztarget = target(
    g(ytarget),
    transform = map(ytarget)
  ),
  finaltarget = target(
    h(ztarget),
    transform = combine(ztarget, .by = xtarget, .names = !!final_names)
  )
)

which gives the same result as the drake's standard naming convention, with the argument names switched out:

drake::drake_plan(
  xtarget = target(
    f(x, y),
    transform = map(
      x = !!x_arg1,
      y = !!x_arg2
    )
  ),
  ytarget = target(
    f(w, xtarget),
    transform = cross(
      w = !!y_arg,
      xtarget
    )
  ),
  ztarget = target(
    g(ytarget),
    transform = map(ytarget)
  ),
  finaltarget = target(
    h(ztarget),
    transform = combine(ztarget, .by = xtarget)
  )
) 

I like the flexibility, but I wonder whether it might also be more error prone as one now has to think through and double check the order of arguments and expansions to provide a proper name vector. Especially, when changing the code (e.g., swapping of arguments), this could easily lead to working with "incorrectly" named targets.

I share your concerns, and this was part of my initial resistance. I recommend avoiding .names unless target names get truly unappealing or unmanageable.

Personally, I am only interested in modifying drake's standard naming convention by plugging in more sensible names for target's arguments.

Fundamental changes to behavior get tricky because drake needs to be as back-compatible with previous versions as possible. And relying on vector/list names is another dimension to behavior that could easily get confusing and complicated for users to learn, even when explained properly. Also, it is too easy to forget when your vector has names and when it does not, and this would make plans brittle.

This is all true! I think I will stay with using .data for now. Seems like the most robust solution. Thank you!!

I will keep using .names for now to test it out.
One inconvenience is as @januz noticed, that the names of the previous targets are not accessible inside the following transformations which would enable more elegant construction of new names.

Here, I'm trying to prepend the previous names:

plan <- drake::drake_plan(

    # Read haplotype seqs and clean their names 
    haps = target(
        read_haps(path = drake::file_in(path)),
        transform = map(
            path = c(
                "/home/lejno/Desktop/aBayesQR-nf/out_bwa_001/haplotypes",
                "/home/lejno/Desktop/cliqueSNV-nf/out_t5_tf001/haplotypes",
                "/home/lejno/Desktop/cliqueSNV-nf/out_t5_tf01/haplotypes",
                "/home/lejno/Desktop/cliqueSNV-nf/out_t10_tf01/haplotypes"
            ),
            .names = c("abqr_001",
                       "csnv_t5_tf001",
                       "csnv_t5_tf01",
                       "csnv_t10_tf01")
        )
    ), 

    # Align the haps using Muscle, ClustalO and ClustalW with default settings
    aln = target(
        msa::msa(inputSeqs = haps, method = methods),
        transform = cross(
            haps,
            methods = c("Muscle", "ClustalOmega", "ClustalW"),
            .names = outer(
                c("Muscle", "ClustO", "ClustW"),
                haps,
                paste,
                sep = "_"
            ) %>% as.vector()
        )
    )
)

#> Error in outer(c("Muscle", "ClustO", "ClustW"), haps, paste, sep = "_") : 
#> object 'haps' not found

Of course, I can just copy the first .names vector and paste it in place of haps inside the second target. However, if haps was of length 40 instead of 4 then pasting the whole vector again would make the plan less readable.

You can work with external variables to support your plan, such as the hap_names vector below.

library(drake)

hap_names <- c(
  "abqr_001",
  "csnv_t5_tf001",
  "csnv_t5_tf01",
  "csnv_t10_tf01"
)

plan <- drake_plan(
  haps = target(
    read_haps(path = drake::file_in(path)),
    transform = map(
      path = c(
        "/home/lejno/Desktop/aBayesQR-nf/out_bwa_001/haplotypes",
        "/home/lejno/Desktop/cliqueSNV-nf/out_t5_tf001/haplotypes",
        "/home/lejno/Desktop/cliqueSNV-nf/out_t5_tf01/haplotypes",
        "/home/lejno/Desktop/cliqueSNV-nf/out_t10_tf01/haplotypes"
      ),
      .names = !!hap_names
    )
  ), 
  aln = target(
    msa::msa(inputSeqs = haps, method = methods),
    transform = cross(
      haps,
      methods = c("Muscle", "ClustalOmega", "ClustalW"),
      .names = as.vector(outer(
        c("Muscle", "ClustO", "ClustW"),
        !!hap_names,
        paste,
        sep = "_"
      ))
    )
  )
)

drake_plan_source(plan)
#> drake_plan(
#>   Muscle_abqr_001 = msa::msa(inputSeqs = abqr_001, method = "Muscle"),
#>   ClustO_abqr_001 = msa::msa(inputSeqs = abqr_001, method = "ClustalOmega"),
#>   ClustW_abqr_001 = msa::msa(inputSeqs = abqr_001, method = "ClustalW"),
#>   Muscle_csnv_t5_tf001 = msa::msa(inputSeqs = csnv_t5_tf001, method = "Muscle"),
#>   ClustO_csnv_t5_tf001 = msa::msa(inputSeqs = csnv_t5_tf001, method = "ClustalOmega"),
#>   ClustW_csnv_t5_tf001 = msa::msa(inputSeqs = csnv_t5_tf001, method = "ClustalW"),
#>   Muscle_csnv_t5_tf01 = msa::msa(inputSeqs = csnv_t5_tf01, method = "Muscle"),
#>   ClustO_csnv_t5_tf01 = msa::msa(inputSeqs = csnv_t5_tf01, method = "ClustalOmega"),
#>   ClustW_csnv_t5_tf01 = msa::msa(inputSeqs = csnv_t5_tf01, method = "ClustalW"),
#>   Muscle_csnv_t10_tf01 = msa::msa(inputSeqs = csnv_t10_tf01, method = "Muscle"),
#>   ClustO_csnv_t10_tf01 = msa::msa(inputSeqs = csnv_t10_tf01, method = "ClustalOmega"),
#>   ClustW_csnv_t10_tf01 = msa::msa(inputSeqs = csnv_t10_tf01, method = "ClustalW"),
#>   abqr_001 = read_haps(path = drake::file_in("/home/lejno/Desktop/aBayesQR-nf/out_bwa_001/haplotypes")),
#>   csnv_t5_tf001 = read_haps(path = drake::file_in("/home/lejno/Desktop/cliqueSNV-nf/out_t5_tf001/haplotypes")),
#>   csnv_t5_tf01 = read_haps(path = drake::file_in("/home/lejno/Desktop/cliqueSNV-nf/out_t5_tf01/haplotypes")),
#>   csnv_t10_tf01 = read_haps(path = drake::file_in("/home/lejno/Desktop/cliqueSNV-nf/out_t10_tf01/haplotypes"))
#> )

Created on 2020-04-24 by the reprex package (v0.3.0)

Was this page helpful?
0 / 5 - 0 ratings

Related issues

wlandau-lilly picture wlandau-lilly  Â·  7Comments

bart1 picture bart1  Â·  7Comments

htlin picture htlin  Â·  4Comments

wlandau picture wlandau  Â·  8Comments

matthiasgomolka picture matthiasgomolka  Â·  8Comments