Kedro: How to apply a pipeline to each of the partitions before merging them together?

Created on 23 Dec 2019 · 5Comments · Source: quantumblacklabs/kedro

Hi, thank you for this great framework! I'm new to kedro but I found it really intuitive and easy to use. And sorry for raising the question here. Is there a room/channel on gitter, telegram or any other place that allows users to ask about the usage of kedro? I didn't find one. If there is one, please let me know. It would be great to communicate with others.

I have read the document and found the PartitionedDataSet. But I didn't find a way to apply an existing pipeline to each of the partitions (hundreds of CSV files in the same directory) before concatenating them together. Is there a way to accomplish this?

Source

et2010

Most helpful comment

Hey @Pet3ris and @et2010! Happy new year to you both. We're working on alternative channels for discussing kedro issues because we realise that Stack Overflow is not enough. We'll keep you updated on developments here. At the moment we're currently facing internal risk assessment for this.

yetudada on 9 Jan 2020

👍2

All 5 comments

Hi @et2010! Thanks very much for your question. PartitionedDataSet does not support your particular scenario at the moment since it behaves as a regular dataset from the pipeline execution perspective. You can, however, pass a dictionary that maps partition id to the data from one node to another. So for example, PartitionedDataSet -> node1 -> dict (using MemoryDataSet) -> node2 -> ...

Your use case, however, is quite interesting and may benefit others too, so we will consider this as the future extension of PartitionedDataSet functionality.

Re official channel - you can ask such questions here, however since we are trying to grow our StackOverflow community, SO is probably a better place for questions like that.