Kedro: [KED-1273] Using transformers to specify Python objects in DataSets

Created on 7 Dec 2019 · 8Comments · Source: quantumblacklabs/kedro

I just wanted to say first off, I love the software so far. Thank you for releasing this!

I do however, want to talk about datasets.

Kedro has structured all of its datasets around the pandas dataframe, and I'm not in love with this.

the issue

First off, pandas is a very large package for just io. It also unfortunately suffers from a volatile API. For me, it's not obvious that csv objects ought to be loaded/written into/from a pandas dataframe, and I really wouldn't want to include pandas as a dependency for my project if I'm not using it at all, especially if I'm dockerizing the package. Returning a file descriptor, numpy array, pyarrow table, or otherwise might be a better choice depending on the use case, and the name CSVLocalDataset certainly doesn't imply anything about the pandas library, I could easily imagine passing it a nested list instead.

the proposal

My proposal would be to consider adding in some additional modularity regarding what type of python object you'd like to get in/out of your datasets. Namely, pandas DataFrames should be a Transformer class inheriting from AbstractTransformer compatible with each of these dataset types (with an optional pandas dependency on import if possible). I think it would be great to have each of the datasets return more natural file descriptors, then have numpy, pandas, dask, etc. transformers which specify how you want to be saving and loading these things to/from a csv.

This gives some nice freedom to users to choose how to handle the dataset objects. It also promotes a nice way of thinking about datasets and transformations from various python objects, abstracting the python data classes out of the fileIO.

Thoughts?

Feature Request Discussion

Source

dasturge

👍2

Most helpful comment

That makes sense, but I guess what I might suggest is two things:

make the pandas dependency optional: i.e. only on import of pandas io classes (for smaller kedro docker containers when possible)
change the naming conventions on the datasets to reflect use of pandas

What's awkward for me, is creating a numpy or pure python CSV loader. In extending the kedro io classes, I end up writing:

NumpyCSVLocalDataSet, ListCSVLocalDataSet, or DaskArrayCSVLocalDataSet, meanwhile the |X| element is obfuscated from the standard kedro io classes. It was a bit awkward for me to figure out how I ought to be naming these new datasets I had to create by need, when the class description of CSVLocalDataSet also naturally describes them.

I understand this is getting verbose, but once you integrate fsspec it sounds like you'll be eliminating the fs aspect, so you'd end up with:

CSVDataSet

So, perhaps naming it PandasCSVDataSet or DataFrameCSVDataSet instead is more clear, and offers a natural naming convention for users implementing their own datasets for CSV io.

dasturge on 19 Dec 2019

👍3

All 8 comments

Oops, I wanted to put a discussion tag on there

dasturge on 7 Dec 2019

Thank you so much for raising this @dasturge. I'm going to answer this one by breaking your issue into two parts:

pandas as a dependency
Python object modularity for the datasets

On 1. you'll be glad to know that we have it on our backlog to remove pandas and numpy as core dependencies in Kedro. The issue evolved out of a request to create a Kedro-Glue plugin and users not being able to do this because of our dependency on pandas and numpy for our built-in datasets (issue #57). So we're implementing a version of #178 soon.

On 2. we'll have a discussion on this one. We're looking at ways to limit the insane amounts of DataSets in the Data Catalog and this might be a solve for this. One change you'll see in kedro next year is the use of fsspec to abstract file storage to create CSVDataSet, eventually deprecating CSVLocalDataSet, CSVS3DataSet, CSVGCSDataSet but even in this system we would still have CSVDataSet and CSVDaskDataSet and so on. So I'll circle back and get back to you on this.

Let me tag this issue with a ticket so we can add it to our backlog for discussion.

yetudada on 11 Dec 2019

👍1

Thank you very much for using Kedro and contributing with opening an issue here, @dasturge ! We're really glad to see more people joining the conversation about shaping up Kedro's future!

The point raised here is very related to the point raised in https://github.com/quantumblacklabs/kedro/issues/31 and my comment there is also relevant for the current issue. I will try to rephrase the problem though, so we can have more context in the current discussion.

The Dataset abstraction is made purposefully this simple, due to the very large landscape of different object formats (X = {pandas, numpy, pyarrow, ...}), file formats (Y = {CSV, Parquet, Excel, Avro, ...}) and storage options (Z = {local disk, AWS S3, Azure BlobStorage, GCS, ...}). So essentially the problem space is defined to be of at least a cubic complexity (|X| * |Y| * |Z|). If we consider each of those variables to be independent from each other, we can very easily turn that into a |X| + |Y| + |Z| problem, meaning that we create one solution per category and then we just combine them in the DataCatalog by configuration as per requirement for the project at hand. However the reality is that these are not independent - e.g. the way you load a CSV file into a Pandas DataFrame is done in one way (pandas.read_csv() for X=pandas and Y=CSV) and if you want to load it into a python list of lists, that'd be a totally different way for X=list and Y=CSV. So you have to count |X| * |Y| solutions rather than |X| + |Y| solutions and no matter how you represent them, they will always be in a quadratic order O(|X| * |Y|).

Recently we found a way to cheat a little bit by using fsspec, which collapsed multiple storage options as one file system option and since |Z| is quite well represented by fsspec, de facto the complexity of the problem was decreased from cubic to quadratic, because it made |Z| ~= O(1), so now we have O(|X| * |Y| * 1). This is only possible because almost all instances of |X| * |Y| and almost all of |Z| had connections to a third variable with only one instance - a python bytestream. Luckily for us the connection of Z to a bytestream comes for free for us through fsspec and bytestreams generally have connections to most instances of the set X * Y (bear in mind that if pandas.read_csv() and similar functions did not accept bytestream as input already, that wouldn't have been achieved as easily as it is now).

What you are suggesting is that we use transformers to decouple X from Y. Unfortunately that is hard to do in a generic way, because we cannot leverage all the optimisations functions like pandas.read_csv() do and make loading of larger datasets much slower. Also in order to do that, we will need a new common intermediary format which all instances of X and Y can work with. We haven't found such a format yet and as far as I am aware Apache Arrow aims to be exactly that for a representative subset of file formats and objects, but it is only applicable to tabular data. If we naively implement transformers between each pair of X and Y, we'll still end up with |X| * |Y| transformations, which is no better than just making |X| * |Y| datasets.

That's why we don't plan into changing the way we load data into python objects just yet. However if you find a neat way to do it like the way fsspec solved the other side of the dependency, please feel free to share that with us.

idanov on 19 Dec 2019

That makes sense, but I guess what I might suggest is two things:

make the pandas dependency optional: i.e. only on import of pandas io classes (for smaller kedro docker containers when possible)
change the naming conventions on the datasets to reflect use of pandas

What's awkward for me, is creating a numpy or pure python CSV loader. In extending the kedro io classes, I end up writing:

I understand this is getting verbose, but once you integrate fsspec it sounds like you'll be eliminating the fs aspect, so you'd end up with:

CSVDataSet

So, perhaps naming it PandasCSVDataSet or DataFrameCSVDataSet instead is more clear, and offers a natural naming convention for users implementing their own datasets for CSV io.

dasturge on 19 Dec 2019

👍3

@dasturge Absolutely, I agree that the current naming convention is obfuscating the object type and it isn't very clear what you load your data into. The Pandas dependency can also be avoided by moving all datasets to the contrib module and this way people can install the datasets they actually care about rather than kedro depending on them as you pointed out.

Both of your suggestions are great and will definitely be taken into consideration before releasing 0.16.0. Thanks a lot for your help, welcome to the community and we hope to see you here more often in 2020 🎉

idanov on 24 Dec 2019

@dasturge The first part of this query is on its way. We've created kedro.extras.datasets and will start deprecating all io and contrib.io datasets to eventually remove their dependencies from Kedro.

The second part is that all of these new datasets are using fsspec for file storage abstraction. We have one CSVDataSet that allows you to load data from local file storage, S3, GCS and many more. You can check out the full registry of options here: https://filesystem-spec.readthedocs.io/en/latest/_modules/fsspec/registry.html

yetudada on 5 Feb 2020

@dasturge The new datasets are out and finally commit ecd7277 has addressed deleting contrib.io and some datasets from io. Thank you so much for submitting this request!

yetudada on 13 Mar 2020

Hi @dasturge, I'm going to close this issue. In kedro 0.16.0 you will find a modular structure for your datasets. Meaning it's possible to determine if you're working with pandas or numpy. All dependencies related to datasets have also been moved out of the core library, see the docs about this change.

We hope that you continue to use Kedro! Let us know if you have any more thoughts on this by opening up this issue or creating a new one.

yetudada on 20 May 2020

Was this page helpful?

0 / 5 - 0 ratings