Kedro: Creating custom dataset for image segmentation.

Created on 9 Jul 2019 · 6Comments · Source: quantumblacklabs/kedro

I am moving my image segmentation projects to kedro but kedro does not support this datatype . Could you please suggest me how can I incorprate this kind of data in catalog.yml.
FYI My data is in below format:-
train

image1
image2

- image3

test

image1
image2

- image3

validate

image1
image2

- image3

Thanks in adavance.

Feature Request Help Wanted Contrib good first issue

Source

parulML

👍2

Most helpful comment

Hi @parulML, the easiest approach would be to create a custom dataset (guide provided in the link). That way you can use whatever approach you are already using to load and save images.

I think there are many ways to read and interpret images for learning so would be interesting to learn what you are using now and how it could contribute to Kedro.

Pet3ris on 15 Jul 2019

👍2

All 6 comments

Hi @parulML, the easiest approach would be to create a custom dataset (guide provided in the link). That way you can use whatever approach you are already using to load and save images.

I think there are many ways to read and interpret images for learning so would be interesting to learn what you are using now and how it could contribute to Kedro.

Pet3ris on 15 Jul 2019

👍2

Hi @parulML @Pet3ris thanks for the amazing package! I have a similar question as @parulML and i am lost even after reading the guide. Appreciate if anyone could give a rough idea of how to write the function? Thanks in advance!

penguinpompom on 29 Jul 2019

@penguinpompom if the Excel example in the tutorial is confusing, you could look at the AbstractDataSet implementation in core.py.

I've removed all the parts you can ignore:

class AbstractDataSet(abc.ABC):
    """``AbstractDataSet`` is the base class for all data set implementations.
    All data set implementations should extend this abstract class
    and implement the methods marked as abstract.
    Example:
    ::
        >>> from kedro.io import AbstractDataSet
        >>> import pandas as pd
        >>>
        >>> class MyOwnDataSet(AbstractDataSet):
        >>>     def __init__(self, param1, param2):
        >>>         self._param1 = param1
        >>>         self._param2 = param2
        >>>
        >>>     def _load(self) -> pd.DataFrame:
        >>>         print("Dummy load: {}".format(self._param1))
        >>>         return pd.DataFrame()
        >>>
        >>>     def _save(self, df: pd.DataFrame) -> None:
        >>>         print("Dummy save: {}".format(self._param2))
        >>>
        >>>     def _describe(self):
        >>>         return dict(param1=self._param1, param2=self._param2)
    """

    @abc.abstractmethod
    def _load(self) -> Any:
        raise NotImplementedError(
            "`{}` is a subclass of AbstractDataSet and"
            "it must implement the `_load` method".format(self.__class__.__name__)
        )

    @abc.abstractmethod
    def _save(self, data: Any) -> None:
        raise NotImplementedError(
            "`{}` is a subclass of AbstractDataSet and"
            "it must implement the `_save` method".format(self.__class__.__name__)
        )

    @abc.abstractmethod
    def _describe(self) -> Dict[str, Any]:
        raise NotImplementedError(
            "`{}` is a subclass of AbstractDataSet and"
            "it must implement the `_describe` method".format(self.__class__.__name__)
        )

    def _exists(self) -> bool:
        logging.getLogger(__name__).warning(
            "`exists()` not implemented for `%s`. Assuming output does not exist.",
            self.__class__.__name__,
        )
        return False

All you need to do is make a new class that inherits from AbstractDataSet and implements all the abstract methods with _exists being optional. You get to specify your own arguments as Kedro leans on i/o routines available in other libraries and a data set is really like a wrapper for that functionality.

A simple example is the pickle local example: here.

Some tips as you do that:

make sure you use the same arguments for the __init__ function
make sure load uses load_args and save uses save_args

If still confusing don't hesitate to reach out.

Pet3ris on 29 Jul 2019

👍1

@Pet3ris Got it thanks!

penguinpompom on 30 Jul 2019

@parulML @penguinpompom I will close this issue as answered. Feel free to re-open if you still have trouble with this answer. Thank you!

lorenabalan on 7 Aug 2019

The issue I still see here is how to handle a dataset where individual rows should only be loaded from disk as-they-are-used. Where do I need to build in this logic? I can imagine having a json dataset and then allowing the pipeline functions to open the data from a filesystem path, but where does the actual data go in that case? How is the actual image data stored and handled by the data catalog?