Note: This was something I was going to hold back for Cylc9, however, it is becoming apparent that it might be worth considering this in the 8.x timeframe.
Note: This functionality might probably be best located in another repository. Putting it in Cylc Flow for discussion, we can move it if approved.
It is now possible to import arbitrary Python modules in the flow.conf file via Jinja2:
flow.cylc
#Jinja2
{% import "numpy" as np %}
{# ... #}
Or, preferably from within Python files loaded into Jinja2:
lib/python/process_parameters.py
import numpy as np
#聽...
Which is great, provided that the modules you want are actually installed in the Cylc environment. If you maintain your own Cylc environment then you can maintain your own Conda recipes and build environments which cater to the needs of your workflows 馃ぎ. This is terrible for portability, reproducibility and just general niceness.
Surely the dependencies should lie with the system just like in normal Python projects?
At the moment this is an issue which hits users who are doing more advanced Jinja2 work, especially as Cylc7 is Python2.
When Cylc9 arrives and workflow definitions start being written in Python the ability to add to the Cylc environment will become much greater.
flow.py
"""My Cylc9 workflow."""
# we have at least one community loading data from excel spreadsheets
# which defines the workflow, not the nicest solution but that's the accepted
# interchange format in their area
from excel import OpenExcel
data = OpenExcel('data.xls')
# loading data from netcdf has also cropped up
from netCDF4 import Dataset
rootgrp = Dataset("test.nc", "w", format="NETCDF4")
# many people use these troublesome twins for everything data
import numpy as np
import pandas as pd
from cylc.flow import Flow
# ...
We add a plugin to Cylc which allows the Cylc environment to be extended with a virtual environment. We install required dependencies into this enviroment.
pip or conda.When necessary Cylc would re-invoke the cylc run command inside this virtual environment. This environment would not be available to jobs (circa #3712) it's just for processing the wokflow definition.
Whilst this plugin could be implemented in different ways I think the most logical candidate might be pipenv which is a system for spinning up environments with virtualenv installing stuff pip.
Pipenv would take care of the creation and management of virtual environments for us making this plugin pretty simple to write.
Here's a quick example of how pipenv could be used:
$ # activate your cylc8 environment (with pipenv installed in it)
$ conda activate cylc8
$ # configure pipenv to work with the "parent" environment
$ cd my-workflow/
$ pipenv --python $(command -v python3)
$ pipenv --site-packages
$ # you now have a virtualenv which "extends" the parent environment
$ pipenv run python3 -c 'from cylc.flow import __version__; print(__version__)'
8.0a3.dev
$ # install workflow definition dependencies using a pip-like interface
$ pipenv install cowsay
$ pipenv run python3 -c 'from cowsay import cow; cow("moo")'
___
< moo >
===
\
\
^__^
(oo)\_______
(__)\ )\/\
||----w |
|| ||
$ # the lockfile shows what's installed
$ cat Pipfile.lock
{
"_meta": {
"hash": {
"sha256": "7f086388cf5c03c7072a870415dc29f71218d8aee191fe6507d33176521a4af8"
},
"pipfile-spec": 6,
"requires": {
"python_version": "3.7"
},
"sources": [
{
"name": "pypi",
"url": "https://pypi.org/simple",
"verify_ssl": true
}
]
},
"default": {
"cowsay": {
"hashes": [
"sha256:7ec3ec1bb085cbb788b0de1e762941b4469faf41c6cdbec08a7ac072a7d1d6eb",
"sha256:debde99bae664bd91487613223c1cb291170d8703bf7d524c3a4877ad37b4dad"
],
"index": "pypi",
"version": "==2.0.3"
}
},
"develop": {}
}
The Cylc Pipfile plugin would re-invoke any cylc (run|restart|get-config) commands based on the presence of a Pipfile file.
Another tool with a very similar feature set to Pipenv, it too can work with virtual environments, create lock files, etc. It also has a streamlined system for pypi publishing and other niceness.
pipenv so the actual "work" involved here is not as great as the length of the discussion might make it seem.pipenv not work out.Users would still be coupled to the same Python version as the "parent" Cylc environment.
Python modules would get installed once for each workflow which uses them.
npm/yarn.pipenv --rm, this represents a management overhead.This would restrict users to installing stuff into Python virtual environments.
pip rather than conda.pip installation sufficient?virtualenv, pipenv, heck even conda is almost an option - just a very heavyweight one which would require duplicating the whole env)Does this seem sensible? Can anyone see any potential issues I've not thought of?
It sounds good to me, +1. Can't think of any potential issues right now.
What other purposes might it be desirable to extend the Cylc environment for? Does this approach work for those purposes?
Not sure if directly related to extending Cylc environment, but maybe run tasks or sub-workflows with containers?
With Nextflow you can run an entire workflow with Docker, while also using Conda.
nextflow run
-with-docker [docker image]
Every time your script launches a process execution, Nextflow will run it into a Docker container created by using the specified image. In practice Nextflow will automatically wrap your processes and run them by executing the docker run command with the image you have provided.
nextflow.preview.dsl=2
process foo {
conda 'numpy pandas matplotlib'
output:
path 'foo.txt'
script:
"""
your_command > foo.txt
"""
}
process bar {
container = 'image_name'
input:
path x
output:
path 'bar.txt'
script:
"""
another_command $x > bar.txt
"""
}
workflow {
data = channel.fromPath('/some/path/*.txt')
foo()
bar(data)
}
docker {
enabled = true
}
The conda directive will create a new environment, install the packages, run the process, the container directive specifies the container image to be used, which could have pip, conda, Alpine linux + Python apk packages, etc.
Same with Airflow using the Docker operator
with DAG('docker_dag', default_args=default_args, schedule_interval="5 * * * *", catchup=False) as dag:
t1 = BashOperator(
task_id='print_current_date',
bash_command='date'
)
t2 = DockerOperator(
task_id='docker_command',
image='centos:latest',
api_version='auto',
auto_remove=True,
command="/bin/sleep 30",
docker_url="unix://var/run/docker.sock",
network_mode="bridge"
)
t3 = BashOperator(
task_id='print_hello',
bash_command='echo "hello world"'
)
t1 >> t2 >> t3
It also has PythonVirtualenvOperator but it works differently. It creates an venv to execute whatever Python function you give it, then destroys the environment. Similar to a container, but using venv.
generate_data_one = PythonVirtualenvOperator(
task_id='generate_data_to_gcs_one',
python_callable=data_generator, # a python function
requirements=['google-cloud-storage==1.28.1',
'DateTime==4.3'
],
dag=dag,
system_site_packages=False,
)
# after the task runs, the venv is destroyed
Even snakemake supports a mix of Conda/containers
# snakemake --use-conda --use-singularity
container: "docker://continuumio/miniconda3:4.4.10"
rule NAME:
input:
"table.txt"
output:
"plots/myplot.pdf"
conda:
"envs/ggplot.yaml"
script:
"scripts/plot-stuff.R"
Is pip installation sufficient?
I think it's probably the easiest to start with.
What implementation should we go for (virtualenv, pipenv, heck even conda is almost an option - just a very heavyweight one which would require duplicating the whole env)
Probably the easiest one, just to prove it works. Then either extend the plug-in to supports others, or have separated plug-ins - if necessary.
Just my 0.02 cents :+1:
Bruno
Does this seem sensible?
Yep
What other purposes might it be desirable to extend the Cylc environment for? Does this approach work for those purposes?
I guess the main intent here is to allow install and import of arbitrary Python packages during construction of the workflow definition (e.g. to automatically create a workflow based on the content of a netcdf data file, as you noted).
I guess another use, since the scheduler environment is also used to execute job submissions, xtriggers, and event handlers, would be to make those a lot more extensible and customizable - I guess that connects with @kinow suggestion above.
But we need to keep containers in mind too, especially for job (and sub-workflow) execution.
Is pip installation sufficient?
What implementation should we go for (virtualenv, pipenv, heck even conda is almost an option - just a very heavyweight one which would require duplicating the whole env)
It seems reasonable to me to (at least initially) stick with Python and go as light weight as possible.
since the scheduler environment is also used to execute job submissions, xtriggers, and event handlers
Yep this is what I had in mind.
A simple but inevitable use case pipenv install cylc-xtriggers!
Especially with #3497 it will almost certainly be necessary to add Python modules to the environment in order to build effective async xtrigger functions.
Not sure if directly related to extending Cylc environment, but maybe run tasks or sub-workflows with containers?
I think that the job execution environment is a different topic. Tasks should define their own environments which shouldn't interact with the "suite environment".
But +1 for docker support, this is something multiple people have asked us about and for pretty obvious reasons. I think this is something else we could handle with a plugin which would be installed on the job-hosts and get run in the job environment itself. It's a somewhat tricky plugin to write since the jobscript is in bash (add another argument to the re-write it in Python list), but once we've got that interfacer worked out it should be fairly simple to do something like this:
[runtime]
[[foo]]
script = command to run in docker container
platform = my-platform
[[[job]]]
docker image = user/image-name
Need to do some more thinking on this though...
Note #3712 is quite important for this one as otherwise this "suite environment" would leak into local job execution environment providing an un-reliable proxy for job dependency installation.
Would this solution work for deployment in production environments, which might be walled off from internet access for package installation?
pipenv uses pip underneath so any cache or local package repositories you can configure pip to install from should work.
Options for offline installation with pip:
I think anything more than that would have to be considered as beyond the scope of Cylc, though the development of this plugin would leave behind an interface permitting alternative implementations.
Added Poetry to the issue above as I've just learned that it can manage per-project virtual environments too, and I thought it was just a packaging tool.
I've not tested the "environment extension" in Poetry, but it's likely achievable via a similar route.
Also note that both Pipenv and Poetry support storing the virtual environments inside the project itself (like yarn, npm) which is nice.
Most helpful comment
Yep this is what I had in mind.
A simple but inevitable use case
pipenv install cylc-xtriggers!Especially with #3497 it will almost certainly be necessary to add Python modules to the environment in order to build effective async xtrigger functions.
I think that the job execution environment is a different topic. Tasks should define their own environments which shouldn't interact with the "suite environment".
But +1 for docker support, this is something multiple people have asked us about and for pretty obvious reasons. I think this is something else we could handle with a plugin which would be installed on the job-hosts and get run in the job environment itself. It's a somewhat tricky plugin to write since the jobscript is in bash (add another argument to the re-write it in Python list), but once we've got that interfacer worked out it should be fairly simple to do something like this:
Need to do some more thinking on this though...