I would like a tool to optimize the parameters of an arbitrary command line program for a variable output by that tool (on its standard output, say). For example, I want to run genome assemblies varying the value of the parameter _k_ to optimize the _N50_ output by the assembler. It's easy to do this optimization now using a grid search, running the tool for all possible values of that parameter, but I'd like to something a bit smarter and run fewer assemblies. i.e. assemble at _k_=64, 128, 192 and based on the results of those assemblies pick new values of _k_ at which to assemble. The real power would come in optimizing multiple variables. Is optimization a feature that you would consider implementing in Nextflow?
I haven't yet tried this tool, but for an example that is not integrated with the job scheduler, see https://github.com/sseemayer/ParOpt#readme
This sounds quite a cool proposal. However I need to understands if it can fit in the nextflow model. How would you define the performance metrics in this context. It would be enough to a capture one or more outputs produced by the tool? I think no.
Also once you learned the optimal parameters how would you apply them? You would just use the output produced by the tool with the optimal parameters or you want to re-execute the tool, using a super-set of that params, one ore more time?
How would you define the performance metrics in this context.
The target rule being optimized would output the metric on standard output.
Also once you learned the optimal parameters how would you apply them?
I'd be inclined to save the results of all the jobs run so far in a directory structure based on the parameter names and values. At the end of the optimization (have to specify the stopping criteria) is complete, Nextflow would output the optimal values and the path to the results.
It sounds feasible and I like the idea but it would require a not trivial effort. Any chance for a contribution on your side?
I was considering proposing a topic like this for the Hackseq hackathon in Vancover before ASHG 2016 in October. See http://www.hackseq.com . I hadn't yet decided on an implementation platform. I wanted to check in with you to see if you thought it was a good fit for Nextflow. Any chance you'll be in Vancouver, Canada for ASHG 2016? Abstract deadlines are tomorrow for both ASHG 2016 and Hackseq. I'm helping organize Hackseq and live in Vancouver. Would you be interested in co-supervising this project, in person or remotely?
Yes, I think it could definitively fit in nextflow. Maybe having something in DSL like this:
process foo {
input:
file bam from ..
optimise:
val x from range( .. )
val y from range( .. )
"""
tool-cmd-line $x $y $bam
"""
}
Though it's unlikely I can work on it in the short run, I will try to give a try to it at my first convenience and surely I can advice if you manage to propose it for the Hackseq.
Unfortunately I think I have very low chances to join this hackathon. I'm going to the OBF Codefest and BOSC conference in Orlando this year and this make very difficult I can also come to Vancouver in October. But I will take seriously in consideration for next year.
Alternatively if you could manage to isolate the optimisation logic in a running Java (or Groovy) skeleton, I could try to integrate it in nextflow much more quickly.
It doesn't need to be a parallel implementation nor to execute real tasks. It should only implement the logic to converge to the optimal parameters. What do you think?
I'm not particularly Java nor Groovy savvy. I'd probably implement it in one of C++, Python, Ruby or R. I planned to test out ParOpt and see if it more or less does what I need. I was going to start by having it submit just one job at a time to the scheduler, and block until that job is complete. ParOpt is only 236 lines of Python, so the optimization logic itself is not that complex. Managing the scheduler and parallel executing will be the trickier task, as I'm sure you could guess.
OK, I will to give a look to the ParOpt code if so.
I was giving a try to this feature request and I'm starting to think that in some extent it's already possible to run such optimisation with nextflow. At the end of the day what you are proposing is to run a command multiple times over a finite set of parameters and then pick the best result.
Currently you could do something like this:
process foo {
input:
file 'reads.fa' from reads
each k from 64, 128, 192
output:
file 'k*/contigs.fa' into contigs
"""
abyss-pe -C k$k in=reads.fa
"""
}
This run multiple assemblies for each k in the specified range (and using the same read file). Then of course it still required to pick the best result that can be done with a separated process. Note that when specifying multiple each statements a new task is executed for each combination of the provided values. Thus, if I'm not wrong, it's the same behaviour of the grid search.
In the light of this, can we say that the goal of this feature request is to have a built-in mechanism in the process scope that allows one to pick the best results given a set of parameters, instead of having to use a separate process/logic? Does make sense?
Nice example, by the way. :wink: Yes, Nextflow already implements a grid search, and a second rule bar could depend on the output of foo and select the optimal value of _k_.
Where it gets fun is with multiple parameters, say _s_ and _n_. A grid search would examine all possible combinations of _s_ and _n_. A line search would pick a value for _s_, optimize _n_, then optimize _s_, and go back and forth until the stopping condition. Even that simple algorithm would be a good start, and it's easy to extend to multiple parameters. Two other multivariate optimization algorithms that are easy to describe and implement are twiddle and amoeba.
Even in the univariate case there's room for improvement. Let's say for example the peak NG50 occurs at _k_=152. Rather than test all values of _k_ between 1 and 256, it could use a binary-search-like pattern that runs the following search pattern, with the best NG50 shown at each step in bold:
It finds the optimal value for _k_ within a tolerance of 8 using 10 assemblies, whereas a grid search requires 32 assemblies.
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
Any thoughts on including some simple optimization in Nextflow? It's okay if the answer is not a priority.
Well, after 4 years this issue was opened replying it's not a priority I don't think it's a reasonable answer 馃槵.
The reality is this is one of my favorite enhancement for the NF language but I haven't any time to look at this so far. I really encourage the community to draft a proposal. Happy to give advice on possible implementation.
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
I have explored using Nextflow for a grid-search process in the context of ML however, it relies completely on the facilities provided by the wrapped tool, in this case the H2O.ai library.
I am guessing that to implement this within Nextflow, there needs to be some kind of a feedback provided by the tool, in the case of ML it's a metric like AUC/Accuracy etc. We could always explicitely add a sed/grep script to the process via afterScript and use that for the feedback, this is inspired from the way nf-core is extracting version info from a process
In addition to this metric, syntactically, we could have something like optimize option for the numeric basic param types like integer, floats and strings.
process foo {
input:
file 'reads.fa' from reads
k from 64, 128, 192, optimize: true
j from "a", "b", "c", optimize: true
output:
file 'k*/contigs.fa' into contigs
"""
abyss-pe -C k$k in=reads.fa
"""
afterScript 'SOME_BASH_COMMAND_OR_SCRIPT', optimization: 'max', strategy: 'brute_force'
}
Then based on the variables j and k and the output of afterScript, Nextflow could selectively launch newer processes as per the grid search strategy. The values provided for the j and k variables would be used to create the search-space.
We could also have a stopping metric like maxGridSearch 10 to limit the overall number of process instances launched as part of the grid search.
To begin with, we could a brute force strategy and only have foo with maximum value of afterScript as the final version. Later on, we could add other strategy options such as random or evolutionary.
I could be off the tempo here, but this is just a suggestion for this feature.