Datasets: [Design discussion] More flexible feature decoding / Better Keras&Hub compatibility

Created on 11 Jun 2019  路  2Comments  路  Source: tensorflow/datasets

We are working on making it easier to customize the feature decoding and make it easier to plug the dataset into existing Keras/tf.Hub model.

The current proposal and questions is available:

https://docs.google.com/document/d/117eomgl2bIqFkwud8eYI-aIaZAvYFDy3fYm-WfCLutA/edit#heading=h.x5946vtq57b5

We encourage everyone to participate and give their though on this, either by commenting on the doc, or replying to this issue.

enhancement

Most helpful comment

I can see the argument for allowing customizable decoding.

Everything else seems like a lot of effort to re-implement Dataset.map and sheild users from the (in my opinion, very straight-forward and intuitive) tf.data interface. For me, tfds is about downloading and serializing/deserializing datasets in a manner that is convenience, consistent and optimized for training in tensorflow. Data augmentation on the other hand should be tied to the model one is training. The link between the two is tf.data.

I like seeing explicit calls to map - it tells me exactly where to look to understand the transformations being applied, and if the documentation doesn't sufficiently explain the inputs I can always go to the builder to look for info (on that note, I don't particularly like the fact that load allows you to batch datasets, though the option doesn't bother me much since I always ignore it).

I don't agree that using map adds boilerplate any more than a map kwarg.

dataset = tfds.load(..., map=map_fn)

looks pretty much the same as

dataset = tfds.load(...).map(map_fn)

though looking at the documentation for map I understand I have additional freedom in terms of other args.

If there's a problem with users' understanding that map is generally required, I think this should be addressed via examples. I don't see any reason why every tfds builder/config isn't accompanied by an example workbook that demonstrates basic usage and visualization (and possibly simple model training). Presumably similar scripts are being written by the contributors to manually verify the datasets are producing the intended results, so why not spend another 5min cleaning those up and require them as part of a PR?

If there's a problem with a lack of all-in-one common preprocessing functions to use in map like in keras, I think this should be address by providing those functions.

All 2 comments

I can see the argument for allowing customizable decoding.

Everything else seems like a lot of effort to re-implement Dataset.map and sheild users from the (in my opinion, very straight-forward and intuitive) tf.data interface. For me, tfds is about downloading and serializing/deserializing datasets in a manner that is convenience, consistent and optimized for training in tensorflow. Data augmentation on the other hand should be tied to the model one is training. The link between the two is tf.data.

I like seeing explicit calls to map - it tells me exactly where to look to understand the transformations being applied, and if the documentation doesn't sufficiently explain the inputs I can always go to the builder to look for info (on that note, I don't particularly like the fact that load allows you to batch datasets, though the option doesn't bother me much since I always ignore it).

I don't agree that using map adds boilerplate any more than a map kwarg.

dataset = tfds.load(..., map=map_fn)

looks pretty much the same as

dataset = tfds.load(...).map(map_fn)

though looking at the documentation for map I understand I have additional freedom in terms of other args.

If there's a problem with users' understanding that map is generally required, I think this should be addressed via examples. I don't see any reason why every tfds builder/config isn't accompanied by an example workbook that demonstrates basic usage and visualization (and possibly simple model training). Presumably similar scripts are being written by the contributors to manually verify the datasets are producing the intended results, so why not spend another 5min cleaning those up and require them as part of a PR?

If there's a problem with a lack of all-in-one common preprocessing functions to use in map like in keras, I think this should be address by providing those functions.

Closing this. We've improved our doc by adding end-to-end keras examples, and added a tfds.decode.SkipDecoding API. For the transformation, it has less value as the previous comment pointed.

Was this page helpful?
0 / 5 - 0 ratings