Performance deteriorates dramatically when using tfds.Split.TRAIN.subsplit(tfds.percent[...]).
For example, when loading a sub-split of ImageNet defined as tfds.Split.TRAIN.subsplit(tfds.percent[:10]), data pipeline performance drops more than 10-fold. Specifically, performance drops from ~6000 images/sec to ~450 images/sec. This is impractical, and a workaround would be to shard the data immediately after it is created by TFRecordDataset.
The purpose of this feature enhancement would be to expose shard information to the end user.
split = subsplit(tfds.shards[:4]).num_shards >= 5) when adding a new dataset.Hey @dynamicwebpaige can i work on this issue?
Hi @Shashankjain12, thanks so much for your interest! I believe @Conchylicultor is currently working on this, but maybe not - let's wait for him to respond.
@Conchylicultor Are you working on this Issue. If not, I am interested to work on this.
Can you please assign it to me and help me going ahead solving this Issue. Thanks :)
@ParthS007 Thanks for taking care of this. Note that this is not a trivial change and require a good understanding of how the subsplit are working. As first step, I would suggest to familiarise yourself with the current subsplit system:
The current SplitBase class explain globally how the current system is working: https://github.com/tensorflow/datasets/blob/master/tensorflow_datasets/core/splits.py#L56
The sub-split created by the users are converted into instruction in dataset_builder:
https://github.com/tensorflow/datasets/blob/master/tensorflow_datasets/core/dataset_builder.py#L637
Then the instruction are then resolved inside dataset_utils.build_dataset() in https://github.com/tensorflow/datasets/blob/master/tensorflow_datasets/core/dataset_utils.py#L81
The idea here would be to add a new tfds.shards option which could be used in subsplit tfds.Split.TRAIN.subsplit(tfds.shards[:4]). The the subsplit would be resolved into new instructions such as only a sub-set of the shards are read. Probably by modifying this function: https://github.com/tensorflow/datasets/blob/master/tensorflow_datasets/core/dataset_builder.py#L651
Just be warned that the implementation may be quite difficult to understand. Do not feel forced of doing this if this is too difficult.
@Conchylicultor I am looking into it and going through the code. I will comment here if I have any queries. Thanks for the help :)
Note that I have a big change coming up on this soon, to make things simpler and faster.
Once this change is in, there are still many params which can be tweaked, in case you want to experiment with those Part. In the meantime, I would advise you to wait for those changes to be in (i'll update this issue).
@pierrot0 Is #650 the beginning of the big change you mentioned? I'd love to test this experiment on imagenet:4.0.0; are there docs/examples of how to use this new functionality for sharding a dataset across multiple machines? Thanks for your work on this!
@pierrot0 Is #650 the beginning of the big change you mentioned? I'd love to test this experiment on
imagenet:4.0.0; are there docs/examples of how to use this new functionality for sharding a dataset across multiple machines? Thanks for your work on this!
If you're using the nightly version, you could try tfds.load('imagenet:4.0.0') and it should generate/select the last version. The changes are mostly under the hood.
The new sub-split API is using string instead:
tfds.load('imagenet:4.0.0', split='train[40%:]')
We're gonna add doc, as it gets deployed on more dataset.
@Conchylicultor thank you! I'll give this a try on nightly + looking forward to the release. :-)
Be aware that this is still considered experimental.
The hashing algorithm used to shuffle data is most probably going to be changed (cf issue #653), which will change the order of data (and set of examples returned when using slicing API).
There is no official documentation for this API yet, but https://github.com/tensorflow/datasets/blob/master/tensorflow_datasets/core/tfrecords_reader_test.py#L160 and following lines can help figuring things out.
Feedback is appreciated, thanks!
Closing issue, new subsplit api has been out for a while, and is about to become the default everywhere.
Most helpful comment
Note that I have a big change coming up on this soon, to make things simpler and faster.
Once this change is in, there are still many params which can be tweaked, in case you want to experiment with those Part. In the meantime, I would advise you to wait for those changes to be in (i'll update this issue).