Kedro: [KED-1242] Kedro 'core' library without included io DataSets or contrib.io

Created on 3 Dec 2019  路  4Comments  路  Source: quantumblacklabs/kedro

Description

This was discussed in a comment on a separate issue, but I figured it merited its own feature request, so I'll repeat here:

I can't provide Kedro as a library to AWS Glue, because it includes in its dependency list libraries that break on Glue for relying on C extensions.

One thought this raises for me is the possibility of having a version of Kedro that is essentially a pure python 'Kedro Core' library with no io or contrib.io datasets built-in (besides the core AbstractDataSet), leaving each of those to be pip installed separately as io plugins based on one's needs.

That would make it so that I can provide this hypothetical Kedro core library to Glue and not worry that it's going to choke on trying to include pandas or numpy (as I can't use any of those io DataSets anyways in Glue).

Since then I ended up taking my thought a step further by forking Kedro and coarsely removing the non-core functionality (branch here) that causes Kedro to depend on pandas, numpy, and other libraries that I considered not part of the 'core' Kedro runtime context/catalog/pipeline/node machinery. By providing my forked "Kedro core" branch to AWS Glue, I have been able to deploy my Kedro project and run it in Glue successfully 馃帀

Context

This opens up the opportunities for Kedro to handle a purely Pyspark pipeline use-case and to allow for simple deployment to AWS Glue, a good choice for running spark in the cloud without the need for managing one's own cluster.

Possible Implementation

I've also been using the AWS CDK library, and thought Kedro could use a similar approach to what CDK uses: providing a 'core' library and have every other use-case-specific 'io' plugin as a separate small library that could be installed as needed. e.g. see https://docs.aws.amazon.com/cdk/latest/guide/getting_started.html#hello_world_tutorial_add_bucket

Opportunity Roadmap

All 4 comments

Good to hear back from you @sarchila! That is excellent news, thank you for sharing! This was added to our backlog a while ago, with a view to deliver in 2020. We welcome any contributions in this space if you are interested. :)

We're on our way to this issue! We're launching these datasets in the next release: https://github.com/quantumblacklabs/kedro/tree/develop/kedro/extras/datasets

And we will give users time to use these ones instead. The major release following this will have io and contrib dependencies removed from Kedro.

Great news @yetudada - thanks so much for your team's responsiveness on this issue 馃檶

@sarchila this issue can finally be closed. Commit ecd7277 has addressed this change. Thank you so much for submitting this request!

Was this page helpful?
0 / 5 - 0 ratings