Galaxy: Workflows should not run with datasets from failed collections

Created on 19 Feb 2020 · 18Comments · Source: galaxyproject/galaxy

Currently, when a tool run that produces a collection fails, then the collection itself is shown as being in an error state, but collection elements that have already been generated get discovered and are shown as ok/green.
Now when, in a workflow, the collection serves as input for another tool that workflow keeps running with those ok datasets, ignoring the collection's failed status.
IMO, this shouldn't happen. If a tool fails this should be respected for individual datasets and collections alike, and should cause subsequent steps to be suspended.

Example to reproduce:

Generate a "partially failed" collection with the NCBI Accession Download tool with e.g. two accession numbers, one real, the other one non-existing, like:
```
AY278554.2
AY278554.23
```
Leave How to handle download failures at its default Abort with error on first failure.
Feed the collection into another tool, like Fasta Width formatter.

The second tool will operate happily on the one successfully downloaded fasta dataset, ignoring the error state of the containing collection.

kinbug

Source

wm75

Most helpful comment

As you said @wm75 all outputs of a single tool execution need to be failed if the tool quit with a non-zero exit code. That is already the current behavior if you're not discovering an output collection (output discovery being a relatively rare case).

You don't need to worry about this @pcm32, this won't affect you. What you guys are discussing is already what Galaxy does. There are exceptions to this (if you're not mapping over an input collection) but there really is only one correct behavior and that is what we're doing and have been doing since we have the collection concept in Galaxy.

mvdbeek on 20 Feb 2020

❤1 👍1

All 18 comments

Leave _How to handle download failures_ at its default _Abort with error on first failure_.

Is the tool actually failing (i.e non-zero exit code)?

mvdbeek on 19 Feb 2020

Yes, that's the point

wm75 on 19 Feb 2020

Alright, that sounds bad, I'll have a look.

mvdbeek on 19 Feb 2020

I disagree, I think that the current behaviour is the correct one, and we use it actively. There are certain steps where we can allow certain elements of a collection to be failed and we are interested in the workflow to continue to run with the rest of the elements of the collection that are not in a failed state. If I'm understanding correctly the request.

pcm32 on 19 Feb 2020

There is actually a tool that specifically filters failed datasets from a collection (under collection operations) I understand. That tool would never run if the current behaviour wouldn't exist.

pcm32 on 19 Feb 2020

@pcm32 This is bad and surprising in most circumstances though. If needed you can filter out failed datasets with the Filter Failed Datasets tools: https://usegalaxy.org/root?tool_id=__FILTER_FAILED_DATASETS__

Think about it the other way, if one dataset failed and we silently continue we may produce incorrect results, ~~our worst nightmare~~ that would be pretty bad.

mvdbeek on 19 Feb 2020

There is actually a tool that specifically filters failed datasets from a collection (under collection operations) I understand. That tool would never run if the current behaviour wouldn't exist.

No, I think that part you're misunderstanding.

mvdbeek on 19 Feb 2020

Tool A is connected to tool B in the workflow, tool A produces a collection with 5 items, out of which 4 failed because parameters didn't produce an acceptable solution for the tool. Why shouldn't the remaining element in the collection that did succeed not be allowed to be processed by Tool B? Or am I understanding the issue in the wrong way.

pcm32 on 19 Feb 2020

Step A produces an output dataset for every sick patient. Step B prepares a report for the doctor. Operation successful, patient dead, as we say in German ;).

mvdbeek on 19 Feb 2020

If you do choose explicitly that failed datasets are OK, you can use https://usegalaxy.org/root?tool_id=__FILTER_FAILED_DATASETS to continue, but continuing is a bad choice by default.

mvdbeek on 19 Feb 2020

no, I don't continue with failed datasets, only with those that are not failed within the collection. I don't see why the fact that you might have failed datasets within a collection means that all the elements in your collection should be considered to be in a defective state.

pcm32 on 19 Feb 2020

We don't wait on the collection to complete to move to the next step in general - that is actually good - most workflow engines tout this as an advantage. I think the problem isn't the execution - it is that there is no visual indication the next collection is unpopulated. I think there is a very specific bug in setting populated_state here that would bridge these different opinions if fixed. Perhaps I'm wrong though...

jmchilton on 19 Feb 2020

👍2

For instance, we have a step that calculates clusters for different resolutions, but the tool doesn't always produce a result for all resolutions (because of particularities of the incoming dataset), however among the resolution values there are feasible solutions (elements in the collection that didn't fail). We have steps that go after the clustering (for instance finding markers), that can perfectly run with those clustering results that do not fail.

pcm32 on 19 Feb 2020

We don't wait on the collection to complete to move to the next step in general - that is actually good - most workflow engines tout this as an advantage.

Completely agree!

pcm32 on 19 Feb 2020

Ok, I realize this is more complicated and controversial than I first thought. A couple of thoughts to test my own understanding of the problem and to help distinguish different usage scenarios and desired outcomes:

Generating a collection with failed datasets in it is only possible by triggering multiple job runs of a single tool the outputs of which are forming the collection. Is that correct?

If so, then there should be separate job states indicating success/failure of each individual job that was run, right? In this case, it seems ok to me, too, to allow further work with the successful jobs. After all this is just the same as with anything else in your history. There may be failed and ok datasets in one history, but you can still use the ok ones, even in a workflow.
My use case is a single job run for a tool that produces a collection of datasets. In this case, my view is that this single run can either succeed or fail and I don't think that Galaxy should try to know better than the tool wrapper. If the tool wrapper indicates a failure, it's a failure of the job as a whole no matter whether individual datasets may seem usable or not.
In that case, of a single job producing the collection there are actually never individual failed datasets in the collection, there can only be missing datasets that were not produced because the tool run aborted early.

Is that a distinction that makes sense to you? @pcm32?
and is this a behavior that could be implemented, @jmchilton @mvdbeek ?

wm75 on 20 Feb 2020

Yes, case 1 is what you would get when your tool receives as input a collection, it processes each element as a separate job internally, and produces as a result a new collections with those consecutive results. In the Galaxy UI this looks as a box with the multiplexed connector in the input and a multiplexed connection coming out as well. This is the use case I'm after to protect because we rely on this for our Single Cell Expression Atlas data production (yes, it runs on Galaxy).

I think that the fact that the job scheduler doesn't wait for all the jobs in a collection to be done by default to execute subsequent jobs is also important to protect; otherwise this could significantly slow down many workflows if for large collections you force the wait for subsequent execution until all elements are ready. This is something that people advertising for other workflows environments would emphasise, giving Galaxy a fame of being "slower" than other environments.

Maybe the tool interface, when operating in workflow mode, could have a switch that says "Require all elements of incoming collections to be valid", which could be used to switch from one behaviour to the other. Or maybe this is just a problem of visualisation as @jmchilton suggests.

pcm32 on 20 Feb 2020

Yes, case 1 is what you would get when your tool receives as input a collection, it processes each element as a separate job internally, and produces as a result a new collections with those consecutive results. In the Galaxy UI this looks as a box with the multiplexed connector in the input and a multiplexed connection coming out as well. This is the use case I'm after to protect because we rely on this for our Single Cell Expression Atlas data production (yes, it runs on Galaxy).

Great, then we agree on this point :smile:

I think that the fact that the job scheduler doesn't wait for all the jobs in a collection to be done by default to execute subsequent jobs is also important to protect

I also agree here, for the case of separate jobs populating the collection. For the single tool producing the collection I think Galaxy waits for the single job to finish anyway, right?

wm75 on 20 Feb 2020