Kedro: [KED-1081] Make the folder /data/ as abstract data folder

Created on 27 Sep 2019  ·  13Comments  ·  Source: quantumblacklabs/kedro

Description

Usually, we have to deal with very large datasets (ie >100 Go) and text type,
storing in the folder /data/ is not possible.

Feature Request

Most helpful comment

Hi @arita37

Absolute paths are already supported by all filepath based AbstractDataSets.

Please have a look here to see how to define such datasets in tha DataCatalog: make sure to replace filepath from a relative path (data/...) to an absolute (/data).

Working with big datasets typically involves working with cloud solutions. For that case, kedro also supports many cloud based datasets (see here and here)

All 13 comments

Hi @arita37

Absolute paths are already supported by all filepath based AbstractDataSets.

Please have a look here to see how to define such datasets in tha DataCatalog: make sure to replace filepath from a relative path (data/...) to an absolute (/data).

Working with big datasets typically involves working with cloud solutions. For that case, kedro also supports many cloud based datasets (see here and here)

Hello,
Thanks for the reply and details, this is useful.

This is more a generic design perspective (not just the path),

Let me precise my point :
1) Separation of code source (git versionable) vs data (not git versionnable).
Having the dataset mixed with code, is not always good practice for software dev.
Thats why, I propose to consider data/ as a folder of "abstract dataset" (ie represented by .yml)

Hello @arita37

Thank you for feedback! Let me address the comments.

  1. We don't encourage anyone to store the data files in git. data directory is in .gitignore file, that's why by default the files are just ignored.
  2. You can already split your catalog into multiple files: https://kedro.readthedocs.io/en/latest/04_user_guide/03_configuration.html#loading
  3. What's the purpose of the suggested CLI folder? Currently we assume that all the additional CLI commands should be put into kedro_cli.py file, which is a file of a created projects and is easily editable. All the commands added there become available via kedro COMMAND ....
  4. We'll check it out later, thank you.

But non of the points above seem to relate to the original question: abstract /data/ directory.

If you have massive files and complex ways of accessing them, you have a few options:

  1. Kedro DataSets, which can have any complex logic behind. We have a lot of the standard and contrib data sets available, you can easily implement your own ones, the interface is simple. Any contribution back to Kedro is welcome!
  2. If you want to make it really transparent for the code, like reading from a file, which does all the magic inside, you can use FUSE. Basically write a script (even in Python) to handle file system events on the directory.
  3. All sorts of SSHFS/GFS/FTPFS - virtual file systems, not storing the data locally, but surving as a transparent proxy. Usually based on FUSE.

Does it help somehow? It would be nice to know more about your use-cases and requirements.

Hello,
Thanks for replying

Question
We are evaluating vs MLFlow
which a pipeline for ML. Thats, why having the data folder (doc example ) with the code was strange from prod view.

1) this is not clear when packaged in Docker
which folder are kept and which one removed ?

2) model storage and serialization is not clear.
Same for pre-processing with states (ie clustering, )
How this is handle ? Or do we have to do it manually ?
In theory, you should have created a folder for model storage (ie same style than data).

Hi @arita37, I'll leave @Flid to handle the bulk of this conversation but I'll just make a few comments about MLflow.

  • They solve orthogonal problems and can be used together, Kedro focuses on development experience, code organisation and data abstraction and Mlflow provides tracking and better support for versioning
  • We have a Medium post going out soon about how our teams use both together leveraging MLflow's tracking ability
  • However, I definitely recommend checking out #113 to see how this team has used Kedro & Mlflow together
  1. https://github.com/quantumblacklabs/kedro-docker/blob/53eb98201048fd4e2eed74bfb1738ab97ac5ad7a/kedro_docker/template/.dockerignore#L13 - data is not copied into a Docker image.
  2. data/06_models is supposed to store the models. But we don't enforce it, just recommend. How you serialise the model is completely up to you, Kedro has a few useful data sets like PickleLocalDataSet that might be useful for general case scenarios.
  3. Not sure I understand the question. If you are interested in the concept of a node itself - (here's the description what it is and how it works)[https://kedro.readthedocs.io/en/latest/04_user_guide/05_nodes.html]. If by "data split" you mean splitting into training and test parts - it's not something Kedro is responsible for. Kedro helps you organise the pipeline code, give it a structure with added benefits. The actual logic is out of scope, other frameworks (including MLFlow you've mentioned) should be used.

Thanks for your prompt reply.

1) noted,
Although from software Engi. Perspective
data/ should be abstract (ie virtual)...
Suppose the code references
some csv into data/ folder : so it runs fine on local....
But, when it is transfered to docker , it failed
because data/ is not transfered.

Framework should try to enforce better “best practice”.... (at least showning the docs).

  1. Well, agree it doesn't play nicely in Docker with all cases. Still you can run the container with an additional volume, mounting your host stuff under data directory inside.
    I'm not sure what you mean by "virtual", none of the definition I can find seems relevant. But anyway you have a full power of *nix: soft and hard links, FUSE, Docker volumes, virtual file systems. Kedro doesn't try to cover everything. It's actually a Unix way to do a tool, which does one small thing, but does it great and integrates with other tools easily, isn't it? :)
  2. Versioning is also supported by Kedro! It's not perfect yet, we work on it, but I think it can already do what you need.
  3. Re-reading the initial question with this information - Kedro is not a platform to organise multiple machines into a cluster, it's a framework to organise the code into pipelines, and then you can run the pipeline nodes in many ways. For example, check out kedro-airflow. Same can be done for other platforms, because Kedro pipeline is essentially a python script in the end, you can run nodes in isolation even on different machines, just make sure you pass the data in between and connect it all together, just what we do for Airflow.
  4. ⬆️
  5. I didn't, sorry, can't help.
  6. Thank you :)
  7. Again, there are lots of possible pipeline structures for multiple different use-cases. Kedro is a general purpose framework for creating pipelines. It doesn't enforce the internal node structure, doesn't provide data science tools or anything to use from inside the nodes. Kedro is a logical glue between the nodes. That's why it can usually be easily used with any DS frameworks, anything you can call from Python.

What is missing for more production :

1) Model folder (as abstract folder) and model lifecycle (versionning, tag).
Data and model are completely different concept (its like you are saying code is like data...)

2) Clear separation of Train and Inference by framework :
Train: dataset --> Process --> Train --> Model Storage + Statistics
Inference : Model Load, Dataset --> inference --> Results + Statistics

  1. I still think it's a big deal, as you are not enforced to use the provided structure and keep data in one place, you can use any virtual FS or soft/hard-links. It seems to be a sensible setup for most simple use-cases, not over-complicating things for everyone.
  2. Again, Kedro does not tell you how to process your data and train your models. It's completely up to you how you split your pipeline into data cleaning, training, inference, validation, whatever. With the feature we are about to release in a few days, you'll even be able to make them separate pipelines and run individually.
  3. Do you mean the lengthy model training produces some results periodically, and you wish to see the intermediate results? Anyway, it's a particular framework producing this, it seems to be out of scope for Kedro, but we need to think how to connect them nicely.
  4. Indeed it's a good idea to do that. And even better, it's already supported 😄 Check out kedro jupyter convert command. It creates the nodes with the tagged code. Doesn't integrate these nodes into pipelines of course, but still.

3) The view was : as AbstractDataset and connector was developed,
why not developed AbstractModel to manage the model lifecycle.
AbstractModel is common to all kind of Machine Learning cycle (esp. handling of model drifting).

4) There are automatic ML tools which already normalize the code.
As the end goal seems to to convert Jupyter to runnable code in Docker.
Why not adding more pragma tags to allows better conversion :
#PIPELINE: mypipeline_name

Hi @arita37, I want to see if I can create some actionable items from this issue and then close it with the appropriate tasks.

So I have a query about your 3rd point:
You've spoken about an AbstractModel, what would you like this class to do for you? And which frameworks would be best to work with this?

And to your 4th point:
We support a workflow that allows users to use Jupyter Notebooks for what they're good for, exploratory data analysis and initial pipeline development but we do encourage that users move from Jupyter Notebooks to Python script with node tagged cells and the kedro jupyter convert command. When you're referring to better conversion, what do you mean?

@arita37 It would be great to get more input from you when you have time. For now, I'll close this issue but I'll be happy to re-open it when you're ready.

Was this page helpful?
0 / 5 - 0 ratings