We have workflows that can specify s3 paths for reference index files. This works nicely, except that every task that needs the files downloads a fresh copy. This is a bit annoying when working with a handful of files, but a roadblock when working with hundreds or thousands.
Could nextflow detect if exact duplicate remote paths are specified and use a common cache? Then these references would only be downloaded once per workflow, which makes a lot more sense.
Thanks!
Phil
Yes, this is something that definitely needs to be improved. Related both to #265 and #397.
Aha! I _thought_ that this was already in an issue somewhere - #265 is the one I was thinking of. Didn't find it because you created it instead of me!
That's why I ask users to create the issues :)
Related to this - I noticed that when NF needs to download the files the process are started one by one (not in parallel), at least on our k8s cluster. Could that be an NF issue, or most probably related to our k8s nodes configuration?
So this could potentially work in a similar way to how the singularity image downloads currently work. The first time a remote URL download is encountered, it is downloaded to a special directory inside work (configurable to be elsewhere?). Then this is softlinked to the task work directory to be used. Every time a remote file is found this cache directory is checked first to see if the file has already been downloaded and it is used directly instead of downloading it again.
It would be great if this worked with all remote file types: s3 but also http, ftp etc.
Personally I would like this to be the default 馃槈 Then have a new config option to turn it off or change the behaviour.
Done!
Brilliant - thanks! Will this be default behaviour, or do we need to modify pipeline code at all?
No changes at all, downloaded files are stored in the pipeline $WORKDIR/stage path instead of the task work dir.
Fixed a synchronisation issue when two or more processes access the same foreign file in a concurrent manner.
Included in version 19.02.0-edge.
I'm still having issues with this. I have a text file with one ftp path per line and am using splitText to turn that into a file, one by one, see code below:
.fromPath("${baseDir}/ftp_wgetlist.txt")
.splitText()
.view{ it.strip() }
.map{ ['Species A', file(it.strip())] }
.set {species_A_variants_ch}
These files get consumed by some downstream process.
Now this downloads the files again almost every time I rerun with -resume (not quite sure what is the determining factor), so I have multiple copies of each of the files in work/stage.
Is this because I am dynamically reading in the ftp paths from the ftp_wgetlist.txt file?
Or is there another reason they don't seem to get cached?
I am on nextflow version 19.04.0.5069.
I'm starting to think that there's a problem with this feature.
After reading the latest blog post on the resume function, it might also be explained by us using a shared file system on our cluster. Will try again with cache='lenient' and see if that fixes it.
Most helpful comment
No changes at all, downloaded files are stored in the pipeline
$WORKDIR/stagepath instead of the task work dir.