Nextflow: Add branch operator [was choice operator under dsl2]

Created on 25 Jun 2019 · 24Comments · Source: nextflow-io/nextflow

Channel.create() will not be supported under DLS-2.

I'd like to hear is how the old code like below should be rewritten, do you think, under DLS-2.

source = Channel.from 'Hello world', 'Hola', 'Hello John'
queue1 = Channel.create()
queue2 = Channel.create()

source.choice( queue1, queue2 ) { a -> a =~ /^Hello.*/ ? 0 : 1 }

queue1.subscribe { println it }

My ideas are:

1) If you want to use Channels.

source = Channel.from 'Hello world', 'Hola', 'Hello John'

source
    .choice  { a -> a =~ /^Hello.*/ ? 0 : 1 }
    .set { queue1, queue2 }

queue1.subscribe { println it }

2) If you use pipe notation

source = Channel.from 'Hello world', 'Hola', 'Hello John'

source
    .choice  { a -> a =~ /^Hello.*/ ? 0 : 1 } | >(proc0) >(proc1)

process proc0 {
    input: val(greeting)
    ...
}

process proc1 {
    input: val(greeting)
    ...
}

I would appreciate if you joined the discussion.

landsl2

Source

pachiras

👍1

All 24 comments

You may be right. The deprecation may be reconsidered.

pditommaso on 12 Jul 2019

Current options include:

def english = { it =~ /^Hello.*/ }
def not_english = { ! english(it) }

source | filter(english) | proc1
source | filter(not_english) | proc2

or, if you don't mind the repetition:

source | filter { it =~ /^Hello.*/ }    | proc1
source | filter { !(it =~ /^Hello.*/) } | proc2

... but I agree that it would be nice to have choice() back as a feature in DSL2.

robsyme on 2 Aug 2019

I've fully read this only now, and I really like the idea of the choice returning a pair of channels:

    .choice  { a -> a =~ /^Hello.*/ ? 0 : 1 }
    .set { queue1, queue2 }

Interesting also the idea of operator composition, however it's not possible to implement that syntax.

It could be possible to write something like this

    .choice  { a -> a =~ /^Hello.*/ ? 0 : 1 } | ( proc0 & proc1 )

Hower in the current implementation ( proc0 & proc1 ) execute both proc0 and proc1 with the same input.

pditommaso on 8 Aug 2019

adding @micans because he is interested in the topic

pditommaso on 8 Aug 2019

👍1

separate falls in the same class I think. create() allows a different way of organising the code, separating all channel code from all process code (see below). I am not 100% convinced by the argument that create() is misused; there are also cases where it is useful, and I am now running a Pipeline That Shall Not Be Named where the authors do all sorts of hell-raising tricks with most of the NF language, e.g.

    beforeScript "rsync -az ${outdir}/* ."
    afterScript "rsync -az ./* ${outdir}/"

(Bad code can be written in any language). I'll end with a small example of the channel/process separation; would this still be supported in DSL2 in some other way? create() can be avoided by rearranging the code but the point is that by grouping all the channel logic together the flow logic of the program is easier to grasp (imagine this scaling to 10 processes with as least as many channels between them).

This use case is quite possibly not on anyone else's radar and I don't want to make it into a huge thing. I encountered it only very recently and think it's cool.

// This is a variant on http://nextflow-io.github.io/patterns/index.html#_problem_19
params.includeB = false

ch_input = Channel.create()
ch_BC = Channel.create()

ch_input.flatMap().map { f -> [f.text.trim(), f] }.view()
  .tap { ch_AC }
  .until { !params.includeB }
  .set { ch_AB }

ch_AC.until { params.includeB }.mix(ch_BC).set{ ch_C }

// Above is all the channel plumbing, below are all the process definitions.

process processA {        // Create a bunch of files, each with its ow sample ID.
  output: file('*.txt') into ch_input
  script: 'for i in {1..7}; do echo "sample_$i" > f$i.txt; done'
}

process processB {
  input:  set val(sampleid), file(thefile) from ch_AB
  output: set val(sampleid), file('out.txt') into ch_BC
  script: "(echo 'B process'; cat $thefile; md5sum $thefile) > out.txt"
}

process processC {
  publishDir "results"
  input: set val(sampleid), file(a) from ch_C.view()
  output: file('*.out')
  script: "(echo 'C process'; cat $a) > ${sampleid}.out"
}

micans on 8 Aug 2019

Well, this is escalating on a debate on Channel.create, that's a bit off-topic here. Would you mind to open a separate issue to discuss this point copying the comment above?

pditommaso on 8 Aug 2019

Ah sorry, I got excited and did not read the title which was about choice syntax. Yes will do. I like the channel syntax:

    .choice  { a -> a =~ /^Hello.*/ ? 0 : 1 }
    .set { queue1, queue2 }

Perhaps it would be useful to bear in mind a future generalisation towards three channels? As for the DSL2 syntax I have to catch up. I assume it will be tough to capture the full expressiveness of current channel syntax in DSL2, and additionally things like when and Publishdir. Is there a matrix of what things (1) fall definitely in the scope of DSL2, (2) fall definitely outside its scope (3) grey area?

micans on 9 Aug 2019

Indeed, I was thinking the same. The above syntax only allows a binary choice. It would be better to have something more general.

I think it should be possible instead of something like:

    .choice  foo: { a ->  a =~ /^Hello.*/ }, bar: { a -> !(a =~ /^Hello.*/) }

Pros: does not even need the set because it could be created by the choice itself
Cons: you need to repeat the condition.

I think some concepts from switch and pattern matching can be applied here
http://cr.openjdk.java.net/~briangoetz/amber/switch-rehab.html

pditommaso on 9 Aug 2019

I think the con of repeating the condition is a pretty big one, and this will become more onerous if there are more cases. That code looks painful to me. A link to some form of switch is natural (that article is a lot to wade through!), I like that. The following crazy idea is probably not going to fly but I'll ask anyway; would it be possible to pre-declare the channels in a new choice implementation, so something like this with two closures.

.choice { new_channel1, new_channel2}  { a -> a =~ /^Hello.*/ ? new_channel1 : new_channel2 }

The idea would be to avoid having to refer to the channels by index (0, 1, 2 ..) but be able to use the names themselves, like you have in your example .choice foo: { a -> a =~ /^Hello.*/ }, bar: { a -> !(a =~ /^Hello.*/) } (with the downside of repeated code).

micans on 9 Aug 2019

I agree that condition repetition is painful.

What you are proposing is possible as shown below:

.choice({ new_channel1; new_channel2})  { a -> a =~ /^Hello.*/ ? new_channel1 : new_channel2 }

.choice('new_channel1', 'new_channel2')  { a -> a =~ /^Hello.*/ ? new_channel1 : new_channel2 }

which is not the best for a DSL syntax pint of view.

pditommaso on 9 Aug 2019

The first one looks nice. Does the DSL syntax point of view apply to the second only or to both? The first example is not currently possible, right? It would be extensible to more cases, and presumably work with a switch-type syntax.

Sorry to keep editing this. Something like the first option is possible now already, but the channels have to be created first with Channel.create(), so what I mean is whether it is possible to implement this syntax without having to use Channel.create().

micans on 9 Aug 2019

I think we're all looking ascant at the possibility of condition repetition. Particularly because the proposal:

source.choice foo: { a ->  a =~ /^Hello.*/ }, 
              bar: { a -> !(a =~ /^Hello.*/) }

Isn't a whole heap clearer than what is already possible with existing syntax (thanks to the excellent channel auto-forking):

source | filter { it =~ /^Hello.*/ }    | foo
source | filter { !(it =~ /^Hello.*/) } | bar

I'd be careful about generalisation of the syntax to more than two channels though. In the simple case of two output channels and a closure that _has_ to return a binary, type checking ensures that all of the elements of source will end up in exactly one of the output channels. With multiple conditions, we stray away from the type system guarantees and the semantics become more complicated,

Can an item end up in more than one output channel or just the first channel that fits?
What happens if an item doesn't match any of the conditions?
Do we need semantics for a fallthrough conditions for non-matching items?

If we have >2 output channels and multiple conditions to test, perhaps using multiple filter conditions results in more readable code?

Note that this is not an argument against choice in cases of two possible output channels and a closure that returns a boolean.

robsyme on 9 Aug 2019

The first one looks nice. Does the DSL syntax point of view apply to the second only or to both? The first example is not currently possible, right? It would be extensible to more cases, and presumably work with a switch-type syntax.

The first it's a bit better, but still found noisy the need to specify round and curly brackets. Also, the need to return an integer to match a list of possible channels it's a bit quirky.

Considering @robsyme comment on the need to have a more general solution, I think it could be possible to write something like that:

ch.choice { 
   x: <bool expression>; 
   y: <bool expression>;
   ... ; 
   otherwise: it // optional fallback clause
}

With some Groovy AST magic, it could be possible to convert it behind the scene to:

ch.choice {
  if( <x bool expression> ) {
   return [x: it]
  }
  else if ( <y bool expression> ) {
      return [y: it]
  }
  ... 

  else { 
   return [otherwise: it]  
  }
}

Quite neat, isn't it?

pditommaso on 15 Aug 2019

👍1

That is quite neat, but I lost track a little bit. In this case, are x and y channel names, without the need to pre-declare them? The occurrence of it suggests that users can have an expression with it inside to allow transformation as well, but that's only in the otherwise case. Also in a switch-type syntax it is more customary to have the cases first, no?

ch.choice { 
   <bool expression>: x; 
   <bool expression>: y;
   ... ; 
   default: z // non-optional fallback clause
}

I've changed to a non-optional default fallback clause here, as otherwise there may be a danger that items fall between the cracks. The argument could be null if that were the intention. Hope this makes sense, not sure I understand the [x: it] syntax; I interpret it as sending it to channel x. Is there a reason in your example to put the boolean expression last?

micans on 15 Aug 2019

It makes a lot of sense but it cannot be implemented :/

We need to move in the space of code parsable by the groovy compiler and transform it to a valid statement(s).

The trick here is to use the label statement (yes java and groovy have goto labels!) and use it to identify the code to be evaluated. Therefore only a string constant can be used on the left of the : symbol, otherwise the compiler break.

Same for default, I agree that's better but it's a reserved keyword and therefore cannot be used. If you a better alternative to otherwise is welcome.

The [x: it] syntax, is a trick to return the pair channel name and value to bind to the channel as a map object.

pditommaso on 15 Aug 2019

👍1

I'm not sure what I was saying about switch syntax actually, it did not make sense. Your explanation does! Just to check: there is no issue with the need for pre-declaring channels with channel.Create()? I will also mention separate here for which there is currently a similar need to use channel.Create().

I see the point about otherwise, but return [otherwise: it] does not make sense to me, it has become a channel name suddenly. Could you not have a fallback clause like this:

ch.choice { 
   x: <bool expression>; 
   y: <bool expression>;
   ... ; 
   z: true;        // natural way of coding a fallback clause/channel
}

micans on 15 Aug 2019

The z: true; is smart, I like. I've almost implemented this.

The cool thing is that it also allows the specification of an optional arbitrary code block for each clause to customise the resulting value. For example:

ch.choice { 
   x: <bool expression>
   return <stament>  

   y: <bool expression>
   .. 
   return <stament>

   z: true
   .. 
   return <stament>
}

Using this syntax, we have basically the same semantics of separate for which there's the same problem. Two birds with a stone!

pditommaso on 15 Aug 2019

👍1

I think it would be nice to have a new name for this implementation. Some possibilities:

fork
branch
switchTo (switch cannot be used)
switchOn
caseOf
transfer

Other ideas?

pditommaso on 16 Aug 2019

About

ch.newchoice { 
   x: <bool expression>
   return <stament>  

   y: <bool expression>
   .. 
   return <stament>

   z: true
   .. 
   return <stament>
}

Some thoughts. (1) is the <return> mandatory? Or could it be optional with it implied? (2) I see, separate would become

ch.newchoice {
   x: true
   return [it, it+1]

   y: true
   return [it*2, it*3]
}

(3) I remember from a discussion on gitter there was a case where changing an item in one channel would lead to the item be changed in another channel as well; e.g. a pointer or shallow copy was involved. I forgot which operator was involved. Is that something that needs consideration here? (4) I assume it is still a normal closure, so it could start of with ch.newchoice { myfile -> ...}.
(5) About the name ... I quite like branch, a lot more than the other proposals. It is fairly neutral, descriptive, and covers both choice and separate well. (6) I assume there is no need for Channel.create() any more as branch (or other name) is a new operator and thus that can be under its control.

micans on 16 Aug 2019

OK, as usual, it was more complicated than expected but I'm quite happy with this implementation. I've pushed on the master using the branch name.

I've added also some documentation that should answer the questions above. Regarding the shallow copy, it's unrelated to this .

pditommaso on 17 Aug 2019

That's great! (are you still on holiday!?). First question; what about separate? The branch documentation states: On the first expression that evaluates to a true value, the current item is bound to a named channel as the label identifier. This suggest items can only be bound to a single channel and hence this does not cover separate functionality (yet)?

micans on 18 Aug 2019

This suggest items can only be bound to a single channel and hence this does not cover separate functionality (yet)?

Ouch! I was missing the separate behaviour! Then will need another implementation to replace also separate.

pditommaso on 19 Aug 2019

I assumed that with all conditions set to true, you would get the separate behaviour; but I see now that would require evaluation of all conditions for all items. And that in turn will make the 'default' or 'otherwise'-type behaviour of a final true clause useless. I see the problem ...

micans on 19 Aug 2019

Yes, the code gets translated into a sequence of if( cond ) else if( cond ) .. etc

pditommaso on 19 Aug 2019

👍1

Was this page helpful?

0 / 5 - 0 ratings