Kedro: [KED-1756] PartitionedDataSet to allow Empty Paths

Created on 31 May 2020  路  3Comments  路  Source: quantumblacklabs/kedro

Description

IncrementalDataSet currently supports having an empty path, so how can we also let PartitionedDataSet do the same?

Context

Currently, if I'm using a PartitionedDataSet, I'm required to put something in the partitioned path. If it is empty, there is an inescapable Exception that is thrown when the pipeline is run.

In my opinion, verifying if data is available or should be done in a different layer other than the DataSet layer.

There are many times where data not existing is a valid case, and if we require the partitions to be non-empty, then extra complexity goes into ensuring that data exists, to satisfy the DataSet.

Possible Implementation

  1. Remove the raised exception.
  2. Add an option inside of the __init__ to allow_empty.
Feature Request Discussion

Most helpful comment

Hello @DmitriiDeriabinQB, thanks for taking a peek at my mention issue.

In terms of use cases, I have a few that I've run into:

  1. Stateful kedro pipelines. This is written in more detail in chronocoding a la https://github.com/quantumblacklabs/kedro/issues/341. To summarize: Getting the previous state of a pipeline run can sometimes be useful. A native method of doing that is by using PartitionedDataSet, where it would load chunks of metadata related to previous runs. If I use PartitionedDataSet, I would like to be able to use it without needing to create fake seed data, as this adds complexity required to filter out that fake seed data. Furthermore, if I use another method of locally storing those metadata chunks, I run a greater risk of slowness due to the unnecessary IO.

  2. Pipelines that have a push source. There are cases where kedro has no control over the origin source, and data is instead written to flat files or table, and kedro needs to load them and operate on them. This was the case in the Twitter Streaming Kedro Pipeline I wrote a few weekends ago (https://www.youtube.com/watch?v=_9DgYDEb2Ag). Using IncrementalDataSet is ideal, but it may be the case that the data being pushed does not have incremental keys (thus prompting the first scenario with stateful saving). Furthermore, given the streaming nature, it is likely the case that, once the data is operated on, it is deleted. Thus, an empty path could be a common occurrence, and the pipeline would throw an error often (aka. false alarms) requiring extra complexity to account for (arguably, one could use IncrementalDataSet for that case, but if data is prohibitively large, that could be untenable).

Really, all I'm looking for is a natively supported noop generic. Such a construct is ubiquitous in functional programming and since kedro emulates many of FP's paradigms, having such a support is a natural fit. IncrementalDataSet currently supports this by returning an empty dictionary, and having it on PartitionedDataSet is useful in the same way. Honestly though, if we can support noop in another way, I'd be happy with that, too.

All 3 comments

Hi @tamsanh! Thank you for the feedback. You are absolutely right, as documentation suggests, unlike PartitionedDataSet IncrementalDataSet does not raise a DataSetError if load returns no partitions.

Current design assumes that partitioned dataset points to a "static" location, where there should be some data. If no data is returned, we assume that being an error and raise rather than silently pass. Incremental data load, however, is dynamic by nature and may result in no partitions being available.

To consider suppressing this error, we would need to better understand your use case for why and when you still want to load empty partitioned dataset. Can you please elaborate more on that? Thanks!

Hello @DmitriiDeriabinQB, thanks for taking a peek at my mention issue.

In terms of use cases, I have a few that I've run into:

  1. Stateful kedro pipelines. This is written in more detail in chronocoding a la https://github.com/quantumblacklabs/kedro/issues/341. To summarize: Getting the previous state of a pipeline run can sometimes be useful. A native method of doing that is by using PartitionedDataSet, where it would load chunks of metadata related to previous runs. If I use PartitionedDataSet, I would like to be able to use it without needing to create fake seed data, as this adds complexity required to filter out that fake seed data. Furthermore, if I use another method of locally storing those metadata chunks, I run a greater risk of slowness due to the unnecessary IO.

  2. Pipelines that have a push source. There are cases where kedro has no control over the origin source, and data is instead written to flat files or table, and kedro needs to load them and operate on them. This was the case in the Twitter Streaming Kedro Pipeline I wrote a few weekends ago (https://www.youtube.com/watch?v=_9DgYDEb2Ag). Using IncrementalDataSet is ideal, but it may be the case that the data being pushed does not have incremental keys (thus prompting the first scenario with stateful saving). Furthermore, given the streaming nature, it is likely the case that, once the data is operated on, it is deleted. Thus, an empty path could be a common occurrence, and the pipeline would throw an error often (aka. false alarms) requiring extra complexity to account for (arguably, one could use IncrementalDataSet for that case, but if data is prohibitively large, that could be untenable).

Really, all I'm looking for is a natively supported noop generic. Such a construct is ubiquitous in functional programming and since kedro emulates many of FP's paradigms, having such a support is a natural fit. IncrementalDataSet currently supports this by returning an empty dictionary, and having it on PartitionedDataSet is useful in the same way. Honestly though, if we can support noop in another way, I'd be happy with that, too.

Thanks Tam, as always, for a great and thorough explanation 馃憤

I've added a ticket to consider this given your scenarios, which both make total sense. noop object is also something we may consider, but for different scenarios. To me, we still need to call the node with empty partitions dictionary in case the node has some special logic implemented for this.

Was this page helpful?
0 / 5 - 0 ratings