Kedro: [KED-1242] Kedro 'core' library without included io DataSets or contrib.io

Created on 3 Dec 2019 · 4Comments · Source: quantumblacklabs/kedro

Description

This was discussed in a comment on a separate issue, but I figured it merited its own feature request, so I'll repeat here:

I can't provide Kedro as a library to AWS Glue, because it includes in its dependency list libraries that break on Glue for relying on C extensions.

One thought this raises for me is the possibility of having a version of Kedro that is essentially a pure python 'Kedro Core' library with no io or contrib.io datasets built-in (besides the core AbstractDataSet), leaving each of those to be pip installed separately as io plugins based on one's needs.

That would make it so that I can provide this hypothetical Kedro core library to Glue and not worry that it's going to choke on trying to include pandas or numpy (as I can't use any of those io DataSets anyways in Glue).

Since then I ended up taking my thought a step further by forking Kedro and coarsely removing the non-core functionality (branch here) that causes Kedro to depend on pandas, numpy, and other libraries that I considered not part of the 'core' Kedro runtime context/catalog/pipeline/node machinery. By providing my forked "Kedro core" branch to AWS Glue, I have been able to deploy my Kedro project and run it in Glue successfully 🎉

Context

This opens up the opportunities for Kedro to handle a purely Pyspark pipeline use-case and to allow for simple deployment to AWS Glue, a good choice for running spark in the cloud without the need for managing one's own cluster.

Possible Implementation

I've also been using the AWS CDK library, and thought Kedro could use a similar approach to what CDK uses: providing a 'core' library and have every other use-case-specific 'io' plugin as a separate small library that could be installed as needed. e.g. see https://docs.aws.amazon.com/cdk/latest/guide/getting_started.html#hello_world_tutorial_add_bucket

Opportunity Roadmap

Source

sarchila

👍5

All 4 comments

Good to hear back from you @sarchila! That is excellent news, thank you for sharing! This was added to our backlog a while ago, with a view to deliver in 2020. We welcome any contributions in this space if you are interested. :)