Nextflow: Improve AWS batch docs and provide Cloudformation stack

Created on 1 Apr 2019  路  11Comments  路  Source: nextflow-io/nextflow

New feature

AWS Batch is available to virtually everyone, and it is not only more convenient than local execution in a dedicated EC2 instance - it's also way more cost efficient. Of all the way to run Nextflow non-locally, one can argue/easily guess that AWS batch will become the dominant option.

The documentation on using AWS batch with Nextflow is minimal and blogs like this suggest that most users have indeed trouble figuring out the config steps on the AWS end of things. Granted, the Nextflow-end configuration is super-simple, but for wide adoption users must be able to set up both Nextflow and AWS.

Suggest implementation

  • Improve documentation on Nextflow's docs pages by drawing from blogs such as this or this. Even better, provide a Cloudformation stack that automatically sets up requirements on the AWS end of thing (I believe, but not 100% sure, that WDL/Cromwell provides a Cloudformation stack for the same purpose).

All 11 comments

There are some details about a cloud formation template here:

https://docs.opendata.aws/genomics-workflows/orchestration/nextflow/nextflow-overview/

This is awesome. Is that referenced from Nextflow's docs and I just missed it? Thanks for pointing this resource to me!

Not that I know of, but it maybe. I saw it on Twitter and via a contact in AWS.

Sweet! So the "resources" stack worked, but the "All-in-one" stack did not and was rolled back. Is there any Nextflow dev curating this resource?

I think you can report it in this GitHub repo

Ok done it. In general, are there plans to further streamline this specific backend support? This is one of the areas where getting feature-parity with WDL would help tremendously (I heard AWS consultants specifically recommending Cromwell since they mainly look at the execution side of things, not necessarily the flexibility of the language). A quality AWS genomics workflow in the referenced Github repo is a great and necessary first step - just need some further polishing. The setup steps and the general architecture seems a little more complicated than Cromwell's though. Also, Cromwell's architecture with a server running and managing requests via REST seems elegant and enabling.

This is one of the areas where getting feature-parity with WDL would help tremendously

In what extent Cromwell is better compared NF for the Batch execution ?

A quality AWS genomics workflow in the referenced Github repo is a great and necessary first step - just need some further polishing

Not sure to understand what repo you are referring?

1) Documentation. The repo I was referring to is the one you linked to above, here. I think this can be improved by:

  • generally clarifying the workflow. If one reads through the Nextflow document, IMO it is not sufficiently clear which steps are captured by the Cloudformation templates and which need to be done manually.
  • certain details are unnecessary for the implementation of the workflow and make the document less concise, e.g. the full code of the entrypoint script. This could be provided but just as an external link, keeping the document lean and focused.
  • the Cromwell architecture schematic is a) simpler to understand b) uses official AWS symbols - just looks more polished and pro than Nextflow's. I know it's nitpicky, but there are people who base their choices just on things like how polished and professional a schematic looks like.
  • the Cromwell docs have an Overview and an Example page. This is useful. Nextflow has only the Overview page.
  • the Cromwell AIO cloudformation stack works, Nextflow's doesn't (in my hands) - that is the bug this issue originally referred to and posted to the Github repo referenced above.
  • Nextflow's official docs need to be tightly integrated (IMO) with the opendata.aws docs for what concerns AWS Batch. I get that Nextflow docs deal with Nextflow-specific configuration, and the AWS Batch configuration is separate from Nextflow. However, unlike other backends (e.g. academic clusters) that typically are set up and maintained by institutional DevOps, AWS Batch is there for everyone - meaning every Nextflow end users. So every Nextflow end user need to be able to seamlessly set AWS Batch up to run with Nextflow, and the more painless the experience is, the better.
    2) Actual things that could be improved.
  • there is a note in the docs to make a manual change to enable expandable scratch space. As end user, I would always want expandable scratch space (possibly with an upper limit, though), and can't see drawbacks for it (especially again if an upper limit could be set) - this should be the default configuration: one note less, one manual step less, less time from decision to execution.
  • there are resources that are shared by every users. For example, the Nextflow Docker container is provided as Dockerfile. Not sure if that would be possible with the actual permissions, but could that be made available as an actual Docker container in a public registry?
  • not clear why AWS CLI should be installed through miniconda. This is at odds with the Nextflow documentation that only mentions that the AWS CLI should be present. But in general this is also, I think, an unnecessary detail that could be delegated to more specific docs. In general, users won't have AWS CLI in their containers (as the docs acknowledge). So the best would be just to provide a host AMI with everything in place, period, without getting into details invisible to users. If the AMI has the right setup, it will work no matter whether the AWS CLI is also present (or absent) in the tool containers. This way, one can drop all the discussion regarding AWS CLI in the main "initial setup" page. Plus, having an AMI provided and pre-configured in the public Amazon marketplace would allow to just refer to that, dropping also all the need to explain the custom AMI (which can still be done in the more detailed docs, but not in the "initial setup" page). The more turn-key this looks like, the better.
  • (less important). Having a server running like Cromwell's with a REST API to start/stop/monitor jobs seems useful. It would allow complex configurations and interfacing with additional tools.

At the end of the day, most suggestions here relate to a more turn-key setup (I think there is ample room to streamline it) and clearer docs. It makes more difference than it may look like for busy users who want to spend their time for bioinformatics and not for DevOps :-)

I completely agree, tho I have no control over that docs. It would make sense to leave the same comment in that repo.

Ok I 1) left a note asking whether they are open to feedback; 2) what the membership rules are for that repo (I can probably help with some docs, but would be great if the developers of the tool being documented also get access ;-) to correct inaccuracies)

Thanks

Was this page helpful?
0 / 5 - 0 ratings