I am moving my image segmentation projects to kedro but kedro does not support this datatype . Could you please suggest me how can I incorprate this kind of data in catalog.yml.
FYI My data is in below format:-
train
test
validate
Thanks in adavance.
Hi @parulML, the easiest approach would be to create a custom dataset (guide provided in the link). That way you can use whatever approach you are already using to load and save images.
I think there are many ways to read and interpret images for learning so would be interesting to learn what you are using now and how it could contribute to Kedro.
Hi @parulML @Pet3ris thanks for the amazing package! I have a similar question as @parulML and i am lost even after reading the guide. Appreciate if anyone could give a rough idea of how to write the function? Thanks in advance!
@penguinpompom if the Excel example in the tutorial is confusing, you could look at the AbstractDataSet implementation in core.py.
I've removed all the parts you can ignore:
class AbstractDataSet(abc.ABC):
"""``AbstractDataSet`` is the base class for all data set implementations.
All data set implementations should extend this abstract class
and implement the methods marked as abstract.
Example:
::
>>> from kedro.io import AbstractDataSet
>>> import pandas as pd
>>>
>>> class MyOwnDataSet(AbstractDataSet):
>>> def __init__(self, param1, param2):
>>> self._param1 = param1
>>> self._param2 = param2
>>>
>>> def _load(self) -> pd.DataFrame:
>>> print("Dummy load: {}".format(self._param1))
>>> return pd.DataFrame()
>>>
>>> def _save(self, df: pd.DataFrame) -> None:
>>> print("Dummy save: {}".format(self._param2))
>>>
>>> def _describe(self):
>>> return dict(param1=self._param1, param2=self._param2)
"""
@abc.abstractmethod
def _load(self) -> Any:
raise NotImplementedError(
"`{}` is a subclass of AbstractDataSet and"
"it must implement the `_load` method".format(self.__class__.__name__)
)
@abc.abstractmethod
def _save(self, data: Any) -> None:
raise NotImplementedError(
"`{}` is a subclass of AbstractDataSet and"
"it must implement the `_save` method".format(self.__class__.__name__)
)
@abc.abstractmethod
def _describe(self) -> Dict[str, Any]:
raise NotImplementedError(
"`{}` is a subclass of AbstractDataSet and"
"it must implement the `_describe` method".format(self.__class__.__name__)
)
def _exists(self) -> bool:
logging.getLogger(__name__).warning(
"`exists()` not implemented for `%s`. Assuming output does not exist.",
self.__class__.__name__,
)
return False
All you need to do is make a new class that inherits from AbstractDataSet and implements all the abstract methods with _exists being optional. You get to specify your own arguments as Kedro leans on i/o routines available in other libraries and a data set is really like a wrapper for that functionality.
A simple example is the pickle local example: here.
Some tips as you do that:
__init__ functionload uses load_args and save uses save_argsIf still confusing don't hesitate to reach out.
@Pet3ris Got it thanks!
@parulML @penguinpompom I will close this issue as answered. Feel free to re-open if you still have trouble with this answer. Thank you!
The issue I still see here is how to handle a dataset where individual rows should only be loaded from disk as-they-are-used. Where do I need to build in this logic? I can imagine having a json dataset and then allowing the pipeline functions to open the data from a filesystem path, but where does the actual data go in that case? How is the actual image data stored and handled by the data catalog?
Most helpful comment
Hi @parulML, the easiest approach would be to create a custom dataset (guide provided in the link). That way you can use whatever approach you are already using to load and save images.
I think there are many ways to read and interpret images for learning so would be interesting to learn what you are using now and how it could contribute to Kedro.