Datasets: Shard subsplit API implementation for TensorFlow Datasets (TFDS).

Created on 3 Mar 2019 · 11Comments · Source: tensorflow/datasets

Performance deteriorates dramatically when using tfds.Split.TRAIN.subsplit(tfds.percent[...]).

For example, when loading a sub-split of ImageNet defined as tfds.Split.TRAIN.subsplit(tfds.percent[:10]), data pipeline performance drops more than 10-fold. Specifically, performance drops from ~6000 images/sec to ~450 images/sec. This is impractical, and a workaround would be to shard the data immediately after it is created by TFRecordDataset.

The purpose of this feature enhancement would be to expose shard information to the end user.

For the API: something similar to split = subsplit(tfds.shards[:4]).
Include the number of shards in the dataset documentation.
Current sub-split API could be kept for users who want fine-grained control over split, or for splits which only have one shard (e.g., test and validation).
Ensure a soft constraint on the number of shards (ex: check that num_shards >= 5) when adding a new dataset.

enhancement

Source

dynamicwebpaige

Most helpful comment

Note that I have a big change coming up on this soon, to make things simpler and faster.
Once this change is in, there are still many params which can be tweaked, in case you want to experiment with those Part. In the meantime, I would advise you to wait for those changes to be in (i'll update this issue).

pierrot0 on 20 May 2019

👍2

All 11 comments

Hey @dynamicwebpaige can i work on this issue?

Shashankjain12 on 3 Mar 2019

Hi @Shashankjain12, thanks so much for your interest! I believe @Conchylicultor is currently working on this, but maybe not - let's wait for him to respond.

rsepassi on 4 Mar 2019

@Conchylicultor Are you working on this Issue. If not, I am interested to work on this.
Can you please assign it to me and help me going ahead solving this Issue. Thanks :)

ParthS007 on 22 Mar 2019

@ParthS007 Thanks for taking care of this. Note that this is not a trivial change and require a good understanding of how the subsplit are working. As first step, I would suggest to familiarise yourself with the current subsplit system:

The current SplitBase class explain globally how the current system is working: https://github.com/tensorflow/datasets/blob/master/tensorflow_datasets/core/splits.py#L56

The sub-split created by the users are converted into instruction in dataset_builder:
https://github.com/tensorflow/datasets/blob/master/tensorflow_datasets/core/dataset_builder.py#L637

Then the instruction are then resolved inside dataset_utils.build_dataset() in https://github.com/tensorflow/datasets/blob/master/tensorflow_datasets/core/dataset_utils.py#L81

The idea here would be to add a new tfds.shards option which could be used in subsplit tfds.Split.TRAIN.subsplit(tfds.shards[:4]). The the subsplit would be resolved into new instructions such as only a sub-set of the shards are read. Probably by modifying this function: https://github.com/tensorflow/datasets/blob/master/tensorflow_datasets/core/dataset_builder.py#L651

Just be warned that the implementation may be quite difficult to understand. Do not feel forced of doing this if this is too difficult.

Conchylicultor on 22 Mar 2019

@Conchylicultor I am looking into it and going through the code. I will comment here if I have any queries. Thanks for the help :)

ParthS007 on 24 Mar 2019

pierrot0 on 20 May 2019

👍2

@pierrot0 Is #650 the beginning of the big change you mentioned? I'd love to test this experiment on imagenet:4.0.0; are there docs/examples of how to use this new functionality for sharding a dataset across multiple machines? Thanks for your work on this!

ludwigschubert on 19 Jun 2019

@pierrot0 Is #650 the beginning of the big change you mentioned? I'd love to test this experiment on imagenet:4.0.0; are there docs/examples of how to use this new functionality for sharding a dataset across multiple machines? Thanks for your work on this!

If you're using the nightly version, you could try tfds.load('imagenet:4.0.0') and it should generate/select the last version. The changes are mostly under the hood.

The new sub-split API is using string instead:
tfds.load('imagenet:4.0.0', split='train[40%:]')

We're gonna add doc, as it gets deployed on more dataset.

Conchylicultor on 19 Jun 2019

👍1

@Conchylicultor thank you! I'll give this a try on nightly + looking forward to the release. :-)

ludwigschubert on 19 Jun 2019

Be aware that this is still considered experimental.
The hashing algorithm used to shuffle data is most probably going to be changed (cf issue #653), which will change the order of data (and set of examples returned when using slicing API).

There is no official documentation for this API yet, but https://github.com/tensorflow/datasets/blob/master/tensorflow_datasets/core/tfrecords_reader_test.py#L160 and following lines can help figuring things out.

Feedback is appreciated, thanks!

pierrot0 on 21 Jun 2019

Closing issue, new subsplit api has been out for a while, and is about to become the default everywhere.

pierrot0 on 5 Dec 2019

Was this page helpful?

0 / 5 - 0 ratings