Datasets: tfds.Split.ALL does not work

Created on 7 Feb 2020  路  7Comments  路  Source: tensorflow/datasets

Short description

splits.py refers to a special split, tfds.Split.ALL, that's supposed to contain all splits merged together. Currently, this does not work, raising the error:

Requested split "all" does not exist.

Environment information

  • tensorflow-datasets version: 2.0.0

Reproduction instructions

import tensorflow_dataset as tfds

tfds.load(name='omniglot', split='all')
bug

Most helpful comment

@ibarrond
Note that

ds = tfds.load(..., split='train+test+validation')

might be more performant. The generic form would be:

builder = tfds.builder('my_dataset')
ds = builder.as_dataset('+'.join(builder.info.splits.keys()))

Also in your above code, each split is read sequentially, (all train, then all test,...) while users might want to shuffle between splits (possible with shuffle_files=True)

All 7 comments

From my understanding of tfds.Split.ALL, in other files of code it says that its a special keyword that cannot be used as a key like in your reproduction instructions. split='all' cannot be used. If you did just have tfds.load(name='omniglot') split would default and return all splits. If that helps.

Hey, @mrahtz I'm giving you an example of mnist dataset for using all the split data. If you're looking for making a dataset which is having all the split then it may helps you.
`all=tfds.Split.Train+ tfds.Split.Test
ds = tfds.load("mnist", split="all")
print(len(ds))

output= 70000

`

For context, this was a regression when we switched to our new reading pipeline.

The question is, should we restore this feature to allow tfds.load(..., split='all') ?
From our statistics internally, it seems that this feature has almost never been used so this hasn't been prioritised.
The alternative is to delete ALL entirely.

Please +1 this issue if you're interested, so we can evaluate the demand.

(Personally, I don't need this feature - I only reported it because I thought it might be a regression. I'd be happy with this issue being closed.)

Thanks for the update. I removed Split.ALL from the API and the doc. So hopefully users will stop being confused about this.

If anyone is looking for this in the future, you can instead concatenate each split in one line of code:

ds_train = datasets['train']
ds_test = datasets['test']
ds_valid = datasets['validation']

ds = ds_train.concatenate(ds_test).concatenate(ds_valid)

SOURCE: https://stackoverflow.com/questions/56546672/how-can-i-merge-two-or-more-tensorflow-datasets

@ibarrond
Note that

ds = tfds.load(..., split='train+test+validation')

might be more performant. The generic form would be:

builder = tfds.builder('my_dataset')
ds = builder.as_dataset('+'.join(builder.info.splits.keys()))

Also in your above code, each split is read sequentially, (all train, then all test,...) while users might want to shuffle between splits (possible with shuffle_files=True)

Was this page helpful?
0 / 5 - 0 ratings

Related issues

MareoRaft picture MareoRaft  路  5Comments

Eshan-Agarwal picture Eshan-Agarwal  路  3Comments

MahdiNicoo picture MahdiNicoo  路  3Comments

powergkrry picture powergkrry  路  3Comments

ageron picture ageron  路  4Comments