Nextflow: Google Cloud Storage built-in support

Created on 24 Jan 2017  路  22Comments  路  Source: nextflow-io/nextflow

What do I need to do to have Nextflow's file methods work for Google Cloud storage (gs://) like it works for S3 buckets (s3://)?

Most helpful comment

All 22 comments

The good news is that the NF support Google Driver is already implemented, the bad news is that it's not yet stable enough to be released. I will post here the progress about this integration.

@pditommaso Thanks! We are very interested in using it and will be trying it out soon. What are your specific development / testing needs to get it to a stable release?

Good, I will prepare a new snapshot including it.

The google storage implementation already includes a set of unit tests and integration tests that are all greens. But I had some weird error messages when downloading large files, though I'm start to think they were related to a temporary hiccup in out local network.

It would be great if you could test in real usage scenario.

I've uploaded a new snapshot including the support for Google Storage. You can find a preliminary documentation here.

To use this version you will need to use define the following environment variable:

export NXF_VER=0.24.0-SNAPSHOT

Then use NF as usual.

@brandon-white is there any feedback on the Google storage support?

@pditommaso Sorry for leaving this one quiet for so long. I downloaded nextflow today using the usual process (curl -s https://get.nextflow.io | bash) and tried to execute a test flow from the prelim docs with the NXF_VER environment variable set and got an error indicating that it couldn't find a handler for the gs protocol.

Note that there were some indicated errors downloading capsules on first execution, but some of those files may have already been in ~/.nextflow from earlier executions of the tool; it's not clear to me if these were the cause of the problem. I've included that log for completeness.

Script

file('gs://[redacted bucket name]/').list().each { println it }

Log output

Jul-07 19:11:47.678 [main] DEBUG nextflow.cli.Launcher - $> ./nextflow list-bucket.nf
Jul-07 19:11:47.760 [main] INFO  nextflow.cli.CmdRun - N E X T F L O W  ~  version 0.24.0-SNAPSHOT
Jul-07 19:11:47.767 [main] INFO  nextflow.cli.CmdRun - Launching `list-bucket.nf` [gloomy_kalam] - revision: 8edca819a7
Jul-07 19:11:48.032 [main] DEBUG nextflow.Session - Session uuid: f6bd686c-b509-4dd9-9fbe-84b82c25c7ee
Jul-07 19:11:48.032 [main] DEBUG nextflow.Session - Run name: gloomy_kalam
Jul-07 19:11:48.034 [main] DEBUG nextflow.Session - Executor pool size: 16
Jul-07 19:11:48.045 [main] DEBUG nextflow.cli.CmdRun - 
  Version: 0.24.0-SNAPSHOT build 4234
  Modified: 20-03-2017 08:34 UTC 
  System: Linux 4.8.0-56-generic
  Runtime: Groovy 2.4.10 on OpenJDK 64-Bit Server VM 1.8.0_131-8u131-b11-0ubuntu1.16.04.2-b11
  Encoding: UTF-8 (UTF-8)
  Process: 29153@frontend001 [192.168.128.14]
  CPUs: 16 - Mem: 102.2 GB (55.1 GB) - Swap: 0 (0)
Jul-07 19:11:48.067 [main] DEBUG nextflow.Session - Work-dir: /scratch/ihaque/nextflow-gs/work [ext2/ext3]
Jul-07 19:11:48.067 [main] DEBUG nextflow.Session - Script base path does not exist or is not a directory: /scratch/ihaque/nextflow-gs/bin
Jul-07 19:11:48.152 [main] DEBUG nextflow.Session - Session start invoked
Jul-07 19:11:48.160 [main] DEBUG nextflow.processor.TaskDispatcher - Dispatcher > start
Jul-07 19:11:48.160 [main] DEBUG nextflow.script.ScriptRunner - > Script parsing
Jul-07 19:11:48.233 [main] DEBUG nextflow.script.ScriptRunner - > Launching execution
Jul-07 19:11:48.255 [main] DEBUG nextflow.Session - Session aborted -- Cause: Cannot a find a file system provider for scheme: gs
Jul-07 19:11:48.314 [main] ERROR nextflow.cli.Launcher - @unknown
java.lang.IllegalArgumentException: Cannot a find a file system provider for scheme: gs
        at nextflow.file.FileHelper.getOrCreateFileSystemFor(FileHelper.groovy:559)
        at nextflow.file.FileHelper.getOrCreateFileSystemFor(FileHelper.groovy)
        at nextflow.file.FileHelper.asPath(FileHelper.groovy:266)
        at nextflow.file.FileHelper.asPath(FileHelper.groovy:246)
        at nextflow.file.FileHelper$asPath$0.call(Unknown Source)
        at org.codehaus.groovy.runtime.callsite.CallSiteArray.defaultCall(CallSiteArray.java:48)
        at org.codehaus.groovy.runtime.callsite.AbstractCallSite.call(AbstractCallSite.java:113)
        at org.codehaus.groovy.runtime.callsite.AbstractCallSite.call(AbstractCallSite.java:125)
        at nextflow.Nextflow.file(Nextflow.groovy:161)
        at nextflow.Nextflow.file(Nextflow.groovy)
        at nextflow.Nextflow$file.callStatic(Unknown Source)
        at org.codehaus.groovy.runtime.callsite.CallSiteArray.defaultCallStatic(CallSiteArray.java:56)
        at org.codehaus.groovy.runtime.callsite.AbstractCallSite.callStatic(AbstractCallSite.java:194)
        at org.codehaus.groovy.runtime.callsite.AbstractCallSite.callStatic(AbstractCallSite.java:206)
        at _nf_script_7da46590.run(_nf_script_7da46590:1)
        at nextflow.script.ScriptRunner.run(ScriptRunner.groovy:322)
        at nextflow.script.ScriptRunner.execute(ScriptRunner.groovy:156)
        at nextflow.cli.CmdRun.run(CmdRun.groovy:223)
        at nextflow.cli.Launcher.run(Launcher.groovy:410)
        at nextflow.cli.Launcher.main(Launcher.groovy:564)

Download/install errors

CAPSULE: Downloading dependency io.nextflow:nxf-commons:pom:0.24.0-20170320.083746-18
CAPSULE: Transfer failed: capsule.org.eclipse.aether.transfer.ArtifactNotFoundException: Could not find artifact io.nextflow:nxf-commons:pom:0.24.0-20170320.083746-18 in local (file:/home/ihaque/.m2/repository) (for stack trace, run with -Dcapsule.log=verbose)
CAPSULE: Downloading dependency io.nextflow:nxf-commons:pom:0.24.0-20170320.083746-18
CAPSULE: Downloading dependency io.nextflow:nxf-httpfs:pom:0.24.0-20170318.132332-17
CAPSULE: Transfer failed: capsule.org.eclipse.aether.transfer.ArtifactNotFoundException: Could not find artifact io.nextflow:nxf-httpfs:pom:0.24.0-20170318.132332-17 in local (file:/home/ihaque/.m2/repository) (for stack trace, run with -Dcapsule.log=verbose)
CAPSULE: Downloading dependency io.nextflow:nxf-httpfs:pom:0.24.0-20170318.132332-17
CAPSULE: Downloading dependency org.codehaus.groovy:groovy-json:pom:2.4.10
CAPSULE: Transfer failed: capsule.org.eclipse.aether.transfer.ArtifactNotFoundException: Could not find artifact org.codehaus.groovy:groovy-json:pom:2.4.10 in local (file:/home/ihaque/.m2/repository) (for stack trace, run with -Dcapsule.log=verbose)
CAPSULE: Downloading dependency org.codehaus.groovy:groovy-json:pom:2.4.10
CAPSULE: Transfer failed: capsule.org.eclipse.aether.transfer.ArtifactNotFoundException: Could not find artifact org.codehaus.groovy:groovy-json:pom:2.4.10 in https://oss.sonatype.org/content/repositories/snapshots (https://oss.sonatype.org/content/repositories/snapshots) (for stack trace, run with -Dcapsule.log=verbose)
CAPSULE: Downloading dependency org.codehaus.groovy:groovy-json:pom:2.4.10
CAPSULE: Downloading dependency org.codehaus.groovy:groovy:pom:2.4.10
CAPSULE: Transfer failed: capsule.org.eclipse.aether.transfer.ArtifactNotFoundException: Could not find artifact org.codehaus.groovy:groovy:pom:2.4.10 in local (file:/home/ihaque/.m2/repository) (for stack trace, run with -Dcapsule.log=verbose)
CAPSULE: Downloading dependency org.codehaus.groovy:groovy:pom:2.4.10
CAPSULE: Transfer failed: capsule.org.eclipse.aether.transfer.ArtifactNotFoundException: Could not find artifact org.codehaus.groovy:groovy:pom:2.4.10 in https://oss.sonatype.org/content/repositories/snapshots (https://oss.sonatype.org/content/repositories/snapshots) (for stack trace, run with -Dcapsule.log=verbose)
CAPSULE: Downloading dependency org.codehaus.groovy:groovy:pom:2.4.10
CAPSULE: Downloading dependency org.codehaus.groovy:groovy-templates:pom:2.4.10
CAPSULE: Transfer failed: capsule.org.eclipse.aether.transfer.ArtifactNotFoundException: Could not find artifact org.codehaus.groovy:groovy-templates:pom:2.4.10 in local (file:/home/ihaque/.m2/repository) (for stack trace, run with -Dcapsule.log=verbose)
CAPSULE: Downloading dependency org.codehaus.groovy:groovy-templates:pom:2.4.10
CAPSULE: Transfer failed: capsule.org.eclipse.aether.transfer.ArtifactNotFoundException: Could not find artifact org.codehaus.groovy:groovy-templates:pom:2.4.10 in https://oss.sonatype.org/content/repositories/snapshots (https://oss.sonatype.org/content/repositories/snapshots) (for stack trace, run with -Dcapsule.log=verbose)
CAPSULE: Downloading dependency org.codehaus.groovy:groovy-templates:pom:2.4.10
CAPSULE: Downloading dependency org.codehaus.groovy:groovy-xml:pom:2.4.10
CAPSULE: Transfer failed: capsule.org.eclipse.aether.transfer.ArtifactNotFoundException: Could not find artifact org.codehaus.groovy:groovy-xml:pom:2.4.10 in local (file:/home/ihaque/.m2/repository) (for stack trace, run with -Dcapsule.log=verbose)
CAPSULE: Downloading dependency org.codehaus.groovy:groovy-xml:pom:2.4.10
CAPSULE: Transfer failed: capsule.org.eclipse.aether.transfer.ArtifactNotFoundException: Could not find artifact org.codehaus.groovy:groovy-xml:pom:2.4.10 in https://oss.sonatype.org/content/repositories/snapshots (https://oss.sonatype.org/content/repositories/snapshots) (for stack trace, run with -Dcapsule.log=verbose)
CAPSULE: Downloading dependency org.codehaus.groovy:groovy-xml:pom:2.4.10
CAPSULE: Downloading dependency org.codehaus.groovy:groovy-nio:pom:2.4.10
CAPSULE: Transfer failed: capsule.org.eclipse.aether.transfer.ArtifactNotFoundException: Could not find artifact org.codehaus.groovy:groovy-nio:pom:2.4.10 in local (file:/home/ihaque/.m2/repository) (for stack trace, run with -Dcapsule.log=verbose)
CAPSULE: Downloading dependency org.codehaus.groovy:groovy-nio:pom:2.4.10
CAPSULE: Transfer failed: capsule.org.eclipse.aether.transfer.ArtifactNotFoundException: Could not find artifact org.codehaus.groovy:groovy-nio:pom:2.4.10 in https://oss.sonatype.org/content/repositories/snapshots (https://oss.sonatype.org/content/repositories/snapshots) (for stack trace, run with -Dcapsule.log=verbose)
CAPSULE: Downloading dependency org.codehaus.groovy:groovy-nio:pom:2.4.10
CAPSULE: Downloading dependency io.nextflow:nextflow:pom:0.24.0-20170320.083617-18
CAPSULE: Transfer failed: capsule.org.eclipse.aether.transfer.ArtifactNotFoundException: Could not find artifact io.nextflow:nextflow:pom:0.24.0-20170320.083617-18 in local (file:/home/ihaque/.m2/repository) (for stack trace, run with -Dcapsule.log=verbose)
CAPSULE: Downloading dependency io.nextflow:nextflow:pom:0.24.0-20170320.083617-18
CAPSULE: Downloading dependency io.nextflow:nxf-commons:jar:0.24.0-20170320.083746-18
CAPSULE: Downloading dependency io.nextflow:nxf-httpfs:jar:0.24.0-20170318.132332-17
CAPSULE: Downloading dependency io.nextflow:nextflow:jar:0.24.0-20170320.083617-18
CAPSULE: Downloading dependency org.codehaus.groovy:groovy-json:jar:2.4.10
CAPSULE: Downloading dependency org.codehaus.groovy:groovy:jar:2.4.10
CAPSULE: Downloading dependency org.codehaus.groovy:groovy-templates:jar:2.4.10
CAPSULE: Downloading dependency org.codehaus.groovy:groovy-xml:jar:2.4.10
CAPSULE: Downloading dependency org.codehaus.groovy:groovy-nio:jar:2.4.10
CAPSULE: Transfer failed: capsule.org.eclipse.aether.transfer.ArtifactNotFoundException: Could not find artifact org.codehaus.groovy:groovy-json:jar:2.4.10 in local (file:/home/ihaque/.m2/repository) (for stack trace, run with -Dcapsule.log=verbose)
CAPSULE: Transfer failed: capsule.org.eclipse.aether.transfer.ArtifactNotFoundException: Could not find artifact org.codehaus.groovy:groovy:jar:2.4.10 in local (file:/home/ihaque/.m2/repository) (for stack trace, run with -Dcapsule.log=verbose)
CAPSULE: Transfer failed: capsule.org.eclipse.aether.transfer.ArtifactNotFoundException: Could not find artifact org.codehaus.groovy:groovy-templates:jar:2.4.10 in local (file:/home/ihaque/.m2/repository) (for stack trace, run with -Dcapsule.log=verbose)
CAPSULE: Transfer failed: capsule.org.eclipse.aether.transfer.ArtifactNotFoundException: Could not find artifact org.codehaus.groovy:groovy-xml:jar:2.4.10 in local (file:/home/ihaque/.m2/repository) (for stack trace, run with -Dcapsule.log=verbose)
CAPSULE: Transfer failed: capsule.org.eclipse.aether.transfer.ArtifactNotFoundException: Could not find artifact org.codehaus.groovy:groovy-nio:jar:2.4.10 in local (file:/home/ihaque/.m2/repository) (for stack trace, run with -Dcapsule.log=verbose)
CAPSULE: Downloading dependency org.codehaus.groovy:groovy-json:jar:2.4.10
CAPSULE: Downloading dependency org.codehaus.groovy:groovy:jar:2.4.10
CAPSULE: Downloading dependency org.codehaus.groovy:groovy-templates:jar:2.4.10
CAPSULE: Downloading dependency org.codehaus.groovy:groovy-xml:jar:2.4.10
CAPSULE: Downloading dependency org.codehaus.groovy:groovy-nio:jar:2.4.10
CAPSULE: Transfer failed: capsule.org.eclipse.aether.transfer.ArtifactNotFoundException: Could not find artifact org.codehaus.groovy:groovy-xml:jar:2.4.10 in https://oss.sonatype.org/content/repositories/snapshots (https://oss.sonatype.org/content/repositories/snapshots) (for stack trace, run with -Dcapsule.log=verbose)
CAPSULE: Transfer failed: capsule.org.eclipse.aether.transfer.ArtifactNotFoundException: Could not find artifact org.codehaus.groovy:groovy-json:jar:2.4.10 in https://oss.sonatype.org/content/repositories/snapshots (https://oss.sonatype.org/content/repositories/snapshots) (for stack trace, run with -Dcapsule.log=verbose)
CAPSULE: Transfer failed: capsule.org.eclipse.aether.transfer.ArtifactNotFoundException: Could not find artifact org.codehaus.groovy:groovy:jar:2.4.10 in https://oss.sonatype.org/content/repositories/snapshots (https://oss.sonatype.org/content/repositories/snapshots) (for stack trace, run with -Dcapsule.log=verbose)
CAPSULE: Transfer failed: capsule.org.eclipse.aether.transfer.ArtifactNotFoundException: Could not find artifact org.codehaus.groovy:groovy-templates:jar:2.4.10 in https://oss.sonatype.org/content/repositories/snapshots (https://oss.sonatype.org/content/repositories/snapshots) (for stack trace, run with -Dcapsule.log=verbose)
CAPSULE: Transfer failed: capsule.org.eclipse.aether.transfer.ArtifactNotFoundException: Could not find artifact org.codehaus.groovy:groovy-nio:jar:2.4.10 in https://oss.sonatype.org/content/repositories/snapshots (https://oss.sonatype.org/content/repositories/snapshots) (for stack trace, run with -Dcapsule.log=verbose)
CAPSULE: Downloading dependency org.codehaus.groovy:groovy-json:jar:2.4.10
CAPSULE: Downloading dependency org.codehaus.groovy:groovy:jar:2.4.10
CAPSULE: Downloading dependency org.codehaus.groovy:groovy-templates:jar:2.4.10
CAPSULE: Downloading dependency org.codehaus.groovy:groovy-xml:jar:2.4.10
CAPSULE: Downloading dependency org.codehaus.groovy:groovy-nio:jar:2.4.10
N E X T F L O W  ~  version 0.24.0-SNAPSHOT                   
Launching `gcslist.nf` [nostalgic_fermat] - revision: fd06c2de94
ERROR ~ Cannot a find a file system provider for scheme: gs

 -- Check script 'gcslist.nf' at line: 1 or see '.nextflow.log' file for more details

Google storage support has been stripped from that version and postponed to a future release. If you are willing to test it I can assemble a special build for that.

Yes, I've a little bandwidth (and a small test project) to give it a try, if you have a moment to spin a build.

I will upload it during the week-end. I will post here the coordinates.

Thanks, I'll take a look when it's up.

Quick question that I wasn't able to answer looking at the Process or S3 docs: how does Nextflow handle staging remote files in/out of the work directory? Will a GS/S3 file implicitly get copied into the local work directory for the lifetime of the entire pipeline, only for the lifetime of the process requesting the file (implying multiple transfers if multiple procs need the file), or neither (staging needs to be done manually if needed)?

Currently only for the lifetime of the process requesting the file. But this is something we are planning to improve #265 .

I've uploaded a new build including the Google Storage support. To use you will need to define the following variables:

export NXF_MODE=gcp
export NXF_VER=0.26.GCP-SNAPSHOT

Then use nextflow info to download the required deps. It should print:

  Version: 0.26.GCP-SNAPSHOT build 4465
  Modified: 08-07-2017 12:45 UTC (14:45 CEST)

When using Nextflow from within a Google Compute instance, no additional authentication steps are necessary. In all other cases the following vars are required:

GOOGLE_PROJECT_ID=<your project id>
GOOGLE_APPLICATION_CREDENTIALS=/path/to/your/credentials/file.json

Your feedback is welcome.

An update here: I've so far tested the ability to list files from GS and process that list of files for downstream processing. It works well! It's replaced the chunk of code that executed gsutil ls and parsed that output.

I have not yet tested the ability to copy files in/out from GCS. Are there particular use cases you'd want to see before merging?

That sounds good. It would interesting to test publishing a process output directly to a GS bucket or having a pipeline which use a GS path as a work directory (but this would require the deployment of an Ignite cluster in a google cluster, currently not so easy as AWS)

@pditommaso I've tested the publishDir directive for GCS (both copy & move) modes and confirmed that this works correctly with this build. Here's the relevant code:

/*
 * Test publishDir directive for google cloud storage.
 * Should publish process output files to google cloud directory
 */

process gcpublish {
        scratch true
        publishDir params.gcOutputDir, mode: 'move', saveAs: {filename -> "${params.testName}.${filename}"}

        input:
                val params.testName
        output:
                file 'chunk_*' into letters

        """
        printf '${params.testName}' | split -b 1 - chunk_
        """
}

Everything seems to be working well!

Unfortunately, for our use case we also are using scratch and don't want to copy large files from the node to local /work in order to publish, so it looks like this still isn't quite the solution we need. However, this build is working as expected (given that utilizing publishDir always requires output files written to /work for all storage systems).

Thanks for testing this. I will merge it when #265 is solved.

does this only support Google storage, or can this feature be used to run compute jobs on the Google Cloud the same way they are described in the docs for Amazon Cloud?

It's complete and live! Check out the blog post.

Right!

I've read the blog post and the current documentation, and was wondering if its possible to use the local gcloud machine for the work-dir, but then use a bucket for the publishdir? If not, is there a reason for this?

Thanks!

You can use the machine local storage only using the local executor. This works as long as you don't need to scale your execution across many machines or you want scale vertically ie. using a machine type many cpus and mem.

In all other cases, you will need to use a storage accessible from the remote computing nodes as explained in the docs.

Please take in consideration to use community channels for general questions:

https://groups.google.com/forum/#!forum/nextflow
https://gitter.im/nextflow-io/nextflow

Was this page helpful?
0 / 5 - 0 ratings

Related issues

wflynny picture wflynny  路  6Comments

ewels picture ewels  路  6Comments

rsuchecki picture rsuchecki  路  3Comments

apeltzer picture apeltzer  路  7Comments

apeltzer picture apeltzer  路  6Comments