This was discussed in a comment on a separate issue, but I figured it merited its own feature request, so I'll repeat here:
I can't provide Kedro as a library to AWS Glue, because it includes in its dependency list libraries that break on Glue for relying on C extensions.
One thought this raises for me is the possibility of having a version of Kedro that is essentially a pure python 'Kedro Core' library with no io or contrib.io datasets built-in (besides the core AbstractDataSet), leaving each of those to be pip installed separately as io plugins based on one's needs.
That would make it so that I can provide this hypothetical Kedro core library to Glue and not worry that it's going to choke on trying to include pandas or numpy (as I can't use any of those io DataSets anyways in Glue).
Since then I ended up taking my thought a step further by forking Kedro and coarsely removing the non-core functionality (branch here) that causes Kedro to depend on pandas, numpy, and other libraries that I considered not part of the 'core' Kedro runtime context/catalog/pipeline/node machinery. By providing my forked "Kedro core" branch to AWS Glue, I have been able to deploy my Kedro project and run it in Glue successfully 馃帀
This opens up the opportunities for Kedro to handle a purely Pyspark pipeline use-case and to allow for simple deployment to AWS Glue, a good choice for running spark in the cloud without the need for managing one's own cluster.
I've also been using the AWS CDK library, and thought Kedro could use a similar approach to what CDK uses: providing a 'core' library and have every other use-case-specific 'io' plugin as a separate small library that could be installed as needed. e.g. see https://docs.aws.amazon.com/cdk/latest/guide/getting_started.html#hello_world_tutorial_add_bucket
Good to hear back from you @sarchila! That is excellent news, thank you for sharing! This was added to our backlog a while ago, with a view to deliver in 2020. We welcome any contributions in this space if you are interested. :)
We're on our way to this issue! We're launching these datasets in the next release: https://github.com/quantumblacklabs/kedro/tree/develop/kedro/extras/datasets
And we will give users time to use these ones instead. The major release following this will have io and contrib dependencies removed from Kedro.
Great news @yetudada - thanks so much for your team's responsiveness on this issue 馃檶
@sarchila this issue can finally be closed. Commit ecd7277 has addressed this change. Thank you so much for submitting this request!