Following @pditommaso screencast I am using nextflow to cloud create a cluster on AWS EC2. So far so good, several worker nodes are instantiated along with the master node. No problem ssh-ing into the master node. However, when I run a test ./nextflow run examples/blast.nf -with-docker I get the following error from ignite:
N E X T F L O W ~ version 0.29.0
Pulling nextflow-io/examples ...
downloaded from https://github.com/nextflow-io/examples.git
Launching `nextflow-io/examples` [curious_goldberg] - revision: 27afa1c086 [master]
[warm up] executor > ignite
ERROR ~ org.apache.ignite.spi.discovery.tcp.ipfinder.vm.TcpDiscoveryVmIpFinder.setShared(Z)V
-- Check script 'blast.nf' at line: 52 or see '.nextflow.log' file for more details
The config used for setting up the cluster:
cloud {
imageId = 'ami-054c4e0bad8549c37' //a clone of the AMI used in the screencast, to have it available in local aws region
subnetId = 'subnet-57eba230'
sharedStorageId = 'fs-d21be5eb' //EFS volume
sharedStorageMount = '/mnt/efs'
instanceType = 't2.micro'
userName = 'radsuchecki'
}
Log file from the master node: .nextflow.log
It looks this is an issue with version 0.29.0 is using a wrong version of Ignite:
$ NXF_VER=0.29.0 NXF_MODE=ignite nextflow info -d | grep ignite
NXF_MODE=ignite
/Users/pditommaso/.nextflow/capsule/deps/io/nextflow/nxf-ignite/0.29.0/nxf-ignite-0.29.0.jar
/Users/pditommaso/.nextflow/capsule/deps/org/apache/ignite/ignite-aws/2.4.0/ignite-aws-2.4.0.jar
/Users/pditommaso/.nextflow/capsule/deps/org/apache/ignite/ignite-slf4j/2.4.0/ignite-slf4j-2.4.0.jar
/Users/pditommaso/.nextflow/capsule/deps/org/apache/ignite/ignite-core/2.4.0/ignite-core-2.4.0.jar
/Users/pditommaso/.nextflow/capsule/deps/org/gridgain/ignite-shmem/1.0.0/ignite-shmem-1.0.0.jar
$ NXF_VER=0.29.1 NXF_MODE=ignite nextflow info -d | grep ignite
NXF_MODE=ignite
/Users/pditommaso/.nextflow/capsule/deps/io/nextflow/nxf-ignite/0.29.1/nxf-ignite-0.29.1.jar
/Users/pditommaso/.nextflow/capsule/deps/org/apache/ignite/ignite-aws/1.6.0/ignite-aws-1.6.0.jar
/Users/pditommaso/.nextflow/capsule/deps/org/apache/ignite/ignite-slf4j/1.6.0/ignite-slf4j-1.6.0.jar
/Users/pditommaso/.nextflow/capsule/deps/org/apache/ignite/ignite-core/1.6.0/ignite-core-1.6.0.jar
/Users/pditommaso/.nextflow/capsule/deps/org/gridgain/ignite-shmem/1.0.0/ignite-shmem-1.0.0.jar
If you update to the latest (0.29.1) it should be fine.
This gets things going, however, not sure if there is something wrong with my setup, but following from the above, it appears the nodes are not getting clustered(?) , e.g. from .node-nextflow.log on one of the worker nodes:
Metrics for local node (to disable set 'metricsLogFrequency' to 0)
^-- Node [id=aaf276db, name=nextflow, uptime=00:15:00:011]
^-- H/N/C [hosts=1, nodes=1, CPUs=2]
^-- CPU [cur=0%, avg=0.22%, GC=0%]
^-- Heap [used=131MB, free=92.58%, comm=297MB]
^-- Non heap [used=55MB, free=-1%, comm=56MB]
^-- Public thread pool [active=0, idle=16, qSize=0]
^-- System thread pool [active=0, idle=16, qSize=0]
^-- Outbound messages queue [size=0]
May-14 07:10:01.146 [scheduler-agent] DEBUG nextflow.scheduler.SchedulerAgent - === Waiting for master node to join..
even though
[main] DEBUG nextflow.daemon.IgGridFactory - Apache Ignite config > joining IPs: 172.31.14.62, 172.31.3.109, 172.31.5.179, 172.31.0.253
That's not good. Open all ports for connections in the default security context.
Most helpful comment
It looks this is an issue with version
0.29.0is using a wrong version of Ignite:If you update to the latest (0.29.1) it should be fine.