Nextflow: Cache remotely pulled files

Created on 4 May 2018 · 14Comments · Source: nextflow-io/nextflow

We have workflows that can specify s3 paths for reference index files. This works nicely, except that every task that needs the files downloads a fresh copy. This is a bit annoying when working with a handful of files, but a roadblock when working with hundreds or thousands.

Could nextflow detect if exact duplicate remote paths are specified and use a common cache? Then these references would only be downloaded once per workflow, which makes a lot more sense.

Thanks!

Phil

kinenhancement nfhack18 prmoderate storagaws-s3

Source

ewels

👍4

Most helpful comment

No changes at all, downloaded files are stored in the pipeline $WORKDIR/stage path instead of the task work dir.

pditommaso on 30 Jan 2019

👍4 🎉1

All 14 comments

Yes, this is something that definitely needs to be improved. Related both to #265 and #397.

pditommaso on 4 May 2018

Aha! I _thought_ that this was already in an issue somewhere - #265 is the one I was thinking of. Didn't find it because you created it instead of me!

ewels on 4 May 2018

That's why I ask users to create the issues :)

pditommaso on 4 May 2018

Related to this - I noticed that when NF needs to download the files the process are started one by one (not in parallel), at least on our k8s cluster. Could that be an NF issue, or most probably related to our k8s nodes configuration?

wikiselev on 4 May 2018

So this could potentially work in a similar way to how the singularity image downloads currently work. The first time a remote URL download is encountered, it is downloaded to a special directory inside work (configurable to be elsewhere?). Then this is softlinked to the task work directory to be used. Every time a remote file is found this cache directory is checked first to see if the file has already been downloaded and it is used directly instead of downloading it again.

It would be great if this worked with all remote file types: s3 but also http, ftp etc.

ewels on 2 Oct 2018

Personally I would like this to be the default 😉 Then have a new config option to turn it off or change the behaviour.

ewels on 2 Oct 2018

Done!

pditommaso on 30 Jan 2019

🎉1 😄1

Brilliant - thanks! Will this be default behaviour, or do we need to modify pipeline code at all?

ewels on 30 Jan 2019

No changes at all, downloaded files are stored in the pipeline $WORKDIR/stage path instead of the task work dir.

pditommaso on 30 Jan 2019

👍4 🎉1

Fixed a synchronisation issue when two or more processes access the same foreign file in a concurrent manner.

pditommaso on 2 Feb 2019

Included in version 19.02.0-edge.

pditommaso on 6 Feb 2019

👍3 🚀1

I'm still having issues with this. I have a text file with one ftp path per line and am using splitText to turn that into a file, one by one, see code below:

     .fromPath("${baseDir}/ftp_wgetlist.txt")
     .splitText()
     .view{ it.strip() }
     .map{ ['Species A', file(it.strip())] }
     .set {species_A_variants_ch}

These files get consumed by some downstream process.
Now this downloads the files again almost every time I rerun with -resume (not quite sure what is the determining factor), so I have multiple copies of each of the files in work/stage.
Is this because I am dynamically reading in the ftp paths from the ftp_wgetlist.txt file?
Or is there another reason they don't seem to get cached?
I am on nextflow version 19.04.0.5069.

tobsecret on 25 Jun 2019

👍1

I'm starting to think that there's a problem with this feature.

pditommaso on 25 Jun 2019

👍2

After reading the latest blog post on the resume function, it might also be explained by us using a shared file system on our cluster. Will try again with cache='lenient' and see if that fixes it.

tobsecret on 1 Aug 2019

Was this page helpful?

0 / 5 - 0 ratings

Related issues

Creation of temp files in the /dev/shm directory can result in a permission error in some Linux installations

mfoll · 6Comments

Allow access to manifest scope during workflow execution

ewels · 4Comments

Channel join loses duplicate keys

stevekm · 5Comments

Add support for Java 12

jaquol · 7Comments

Add support for directory wildcards in input file declarations

ewels · 3Comments