Nextflow: Nextflow should not stage files that have the same name

Created on 3 Oct 2017  路  9Comments  路  Source: nextflow-io/nextflow

I collect files from multiple subdirectories and work on them in a single process. Nextflow does not complain if two files have the same basename, which leads to silent data loss. It seems that when it stages them, the second symlink overwrites the first one in the working directory.

To reproduce, run mkdir subdir1 subdir2 && echo hello > subdir1/file && echo world > subdir2/file and then run this workflow:

c = Channel.from([
  [file('subdir1/file'), file('subdir2/file')]])

process p {
  publishDir '.'

  input: file(x) from c
  output: file('concatenated')

  "cat $x > concatenated"
}

The intention was to get an output file that contains hello\nworld\n. Instead, I get world\nworld\n.

To give a little bit of context: In the actual pipeline, the process works with multiple FASTQ files that come from the same individual but were sequenced in different runs. They are stored in different directories, but the file (base-)names are in the standard Illumina scheme <sample-name>_S<sample-index>_L<lane-index>_R1_001.fastq.gz. With the sample name being identical (since they come from same individual), a collision occurs when - by chance - the other run of that sample used the same sample index and the same lane.

Most helpful comment

In my case, that would not have helped so much since I launch Nextflow more or less automatically for hundreds of samples (one Nextflow instance per sample). I inspect the log output only when there鈥檚 a failure.

In my opinion, giving an error message and exiting would be the right strategy. I cannot imagine a situation at the moment where it would be ok to continue with a silently dropped input file.

All 9 comments

Yes, you are right. However I think a warning message should be reported. Does it sound good?

In my case, that would not have helped so much since I launch Nextflow more or less automatically for hundreds of samples (one Nextflow instance per sample). I inspect the log output only when there鈥檚 a failure.

In my opinion, giving an error message and exiting would be the right strategy. I cannot imagine a situation at the moment where it would be ok to continue with a silently dropped input file.

I also think that especially in fully-automated analysis this should "break" things, somewhat silent fails are difficult to spot.

Agree, name collisions when staging out files or directories should raise an error and stop the execution.

I agree with Francesco. In case of collision it is better to stop. Consider than the file name is used in many aspects of nextflow (for creating ids etc). I think that it should be in the hands of the user to manage the exception.

I vote for making name collisions an error (i.e. stop the pipeline) as well. Possibly make it an option in nextflow.config?

OK, there's consensus to return an error. I was wondering if it could make sense to use warning to avoid to any potential breaking changes.

But I don't see any valid use case and it should be manage as an error condition. It can only result in a invalid computation.

Also for this I would avoid to make this an option.

Included in version 0.26.0-beta5

Great, thanks!

Was this page helpful?
0 / 5 - 0 ratings