Process tasks can produce temporary files not used by downstream tasks that in some cases consume a relevant amount of storage.
It should be provided a directive that automatically deletes all files produces a task which are not needed by downstream processes.
That's something we would be very interested in.
I don't know if it's something you can also manage but a case we are often facing is a simple chain of processes A->B->C->D where process A produces an output only used by B, B produces an output only used by C and so on. In this case you would need to wait for B to finish before deleting the files produced by A.
Would it be possible more generally to have a directive that will delete the files as soon as they are no longer needed (i.e. not declared as in input in a downstream process)?
The goal of this issue is much easier i.e. simply to delete files all files not declared in the process output declaration.
What you are proposing would required an analysis of the graph of dependencies. Though possible surely much more complex. Also I guess it will break the _resume_ mechanism when another task, out of that chain let say _Z_, stops the pipeline execution.
I'm interested in this issue, especially in the form of an option for the scratch directive to automatically remove the temporary folder after the results were transfered back to the work directory.
Ditto for @mfoll's request - temporarily file deletion is one of the few shackles that's still keeping me glued to Snakemake.
This issues has been indirectly solved by #230 which deletes the temporary files created in the scratch folder. Thus If you want to keep only the process output files use process.scratch = true in your pipeline.
It is true that this is not what @mfoll is proposing, but NF has not been designed to handle such pattern. Also it must be noted that in NF the pipeline work directory it is meant to hold pipeline intermediate results and not the final pipeline output. As such the entire workdir content can be deleted once the computation has terminated. This make NF very different to other pipeline tools, like for example Snakemake, which use the current folder to hold both the tasks temporary files and the final pipeline output.
For this reason I'm closing this issue. Feel free to comment if you need further help/explanations on this.
I see what you're saying. The scratch deletion is definitely useful - I just frequently use pipelines that have processes which temporarily need the output from another process. For example, a common operation for me is downloading a large ".sra" file from NCBI, extracting its even bigger .fastq files, aligning the .fastq's to a compressed .bam file, and then doing a bunch of operations on the .bam file. Thus, it'd be really nice to delete the massive ".sra" file and ".fastq" files as soon as they weren't needed in the pipeline.
I definitely appreciate Nextflow's unique directory structure (I actually really like the approach) - it'd just be nice if there was some way to mark the output of a process as "temporary", and have it deleted once all of its dependent processes finished their execution.
Why setting scratch = true wont work in your use case ?
Maybe I don't understand the scratch option, but I thought it only applies to an individual process? In my case, I have to break up that combination of operations (download .sra, extract .fastq's, and align to .bam) into at least two different processes because the first two are single threaded, while the alignment is multithreaded.
When using scratch=true a task is executed in a temporary directory under /tmp. when the task completes output files are copied into the task workdir and the tmp folder is deleted. It can be defined a specific process or for all the pipeline.
This is considered a best practice when using a shared file system.
previous comment was removed because it's spam
Most helpful comment
That's something we would be very interested in.
I don't know if it's something you can also manage but a case we are often facing is a simple chain of processes A->B->C->D where process A produces an output only used by B, B produces an output only used by C and so on. In this case you would need to wait for B to finish before deleting the files produced by A.
Would it be possible more generally to have a directive that will delete the files as soon as they are no longer needed (i.e. not declared as in input in a downstream process)?