Usually, we have to deal with very large datasets (ie >100 Go) and text type,
storing in the folder /data/ is not possible.
Hi @arita37
Absolute paths are already supported by all filepath based AbstractDataSets.
Please have a look here to see how to define such datasets in tha DataCatalog: make sure to replace filepath from a relative path (data/...) to an absolute (/data).
Working with big datasets typically involves working with cloud solutions. For that case, kedro also supports many cloud based datasets (see here and here)
Hello,
Thanks for the reply and details, this is useful.
This is more a generic design perspective (not just the path),
Let me precise my point :
1) Separation of code source (git versionable) vs data (not git versionnable).
Having the dataset mixed with code, is not always good practice for software dev.
Thats why, I propose to consider data/ as a folder of "abstract dataset" (ie represented by .yml)
Hello @arita37
Thank you for feedback! Let me address the comments.
data directory is in .gitignore file, that's why by default the files are just ignored.kedro_cli.py file, which is a file of a created projects and is easily editable. All the commands added there become available via kedro COMMAND ....But non of the points above seem to relate to the original question: abstract /data/ directory.
If you have massive files and complex ways of accessing them, you have a few options:
Does it help somehow? It would be nice to know more about your use-cases and requirements.
Hello,
Thanks for replying
Question
We are evaluating vs MLFlow
which a pipeline for ML. Thats, why having the data folder (doc example ) with the code was strange from prod view.
1) this is not clear when packaged in Docker
which folder are kept and which one removed ?
2) model storage and serialization is not clear.
Same for pre-processing with states (ie clustering, )
How this is handle ? Or do we have to do it manually ?
In theory, you should have created a folder for model storage (ie same style than data).
Hi @arita37, I'll leave @Flid to handle the bulk of this conversation but I'll just make a few comments about MLflow.
data is not copied into a Docker image.data/06_models is supposed to store the models. But we don't enforce it, just recommend. How you serialise the model is completely up to you, Kedro has a few useful data sets like PickleLocalDataSet that might be useful for general case scenarios.Thanks for your prompt reply.
1) noted,
Although from software Engi. Perspective
data/ should be abstract (ie virtual)...
Suppose the code references
some csv into data/ folder : so it runs fine on local....
But, when it is transfered to docker , it failed
because data/ is not transfered.
Framework should try to enforce better “best practice”.... (at least showning the docs).
data directory inside.What is missing for more production :
1) Model folder (as abstract folder) and model lifecycle (versionning, tag).
Data and model are completely different concept (its like you are saying code is like data...)
2) Clear separation of Train and Inference by framework :
Train: dataset --> Process --> Train --> Model Storage + Statistics
Inference : Model Load, Dataset --> inference --> Results + Statistics
3) The view was : as AbstractDataset and connector was developed,
why not developed AbstractModel to manage the model lifecycle.
AbstractModel is common to all kind of Machine Learning cycle (esp. handling of model drifting).
4) There are automatic ML tools which already normalize the code.
As the end goal seems to to convert Jupyter to runnable code in Docker.
Why not adding more pragma tags to allows better conversion :
#PIPELINE: mypipeline_name
Hi @arita37, I want to see if I can create some actionable items from this issue and then close it with the appropriate tasks.
So I have a query about your 3rd point:
You've spoken about an AbstractModel, what would you like this class to do for you? And which frameworks would be best to work with this?
And to your 4th point:
We support a workflow that allows users to use Jupyter Notebooks for what they're good for, exploratory data analysis and initial pipeline development but we do encourage that users move from Jupyter Notebooks to Python script with node tagged cells and the kedro jupyter convert command. When you're referring to better conversion, what do you mean?
@arita37 It would be great to get more input from you when you have time. For now, I'll close this issue but I'll be happy to re-open it when you're ready.
Most helpful comment
Hi @arita37
Absolute paths are already supported by all
filepathbasedAbstractDataSets.Please have a look here to see how to define such datasets in tha
DataCatalog: make sure to replacefilepathfrom a relative path (data/...) to an absolute (/data).Working with big datasets typically involves working with cloud solutions. For that case,
kedroalso supports many cloud based datasets (see here and here)