Hi,
As it stands I don't think Pytorch Ignite is compatible with the IterableDataset class (https://pytorch.org/docs/stable/_modules/torch/utils/data/dataset.html#IterableDataset), since this dataset doesn't have the __len__ attribute. To run Ignite with the usual train/evaluate handles, the Engine would need to be modified to allow a user-inputted epoch size. Is this in the works? Otherwise can probably work up a PR this week for it.
Thanks!
Hi @PCerles ,
Yes, Engine requires as sized iterable object and usually it is a DataLoader is provided, which has __len__.
If you would like to work directly with an IterableDataset, it is possible to do like this :
ds = MyIterableDataset()
# make infinite iterable
def cycle(ds):
while True:
for i in iterable:
yield i
ds_iter = cycle(ds)
def update_fn(engine, i):
data = next(ds_iter)
# do whatever you intended
trainer = Engine(update_fn)
iters_per_epoch = list(range(1000))
trainer.run(iters_per_epoch, max_epochs=5)
HTH
Thanks, this works.
Are there plans for a non-hack solution?
Note that this hack also works:
trainer.run(map(lambda x: x, train_data_loader), max_epochs=1, epoch_length=2)
@samedii recently refactored engine now can allow to work with iterators and epoch_length can control epoch size if __len__ is not available.
@vfdev-5 Great! I started on a PR but then saw that you already started making changes. I see that epoch_length is not used in _from_iteration. I think that would solve the last issues.
I started with a check isinstance(self.state.dataloader.dataset, torch.utils.data.IterableDataset) earlier but it failed later because it's still trying to look for the __len__ attribute in those functions.
@samedii yes, there are two places where the code made an implicit guess on Map type of the dataset inside the dataloader:
Could like tell me the use-case where such structure are helpful, I mean DataLoader on IterableDataset ? Maybe it could be good to check more such cases and provide a bit more support.
@vfdev-5 Sure! Our use case is when we have multiple datasets that we want to merge in some manner.
This can be because we want to create balanced batches of different classes or data sources.
E.g. an advanced example:
It is very nice then to be able to define a sampling strategy for data source A, B and semi-supervised separately and then merge them later.
This is one of the things done in tensorflow that are quite nice.
Alternative hack :)
delattr(torch.utils.data.DataLoader, '__len__')
trainer.run(train_data_loader, max_epochs=2, epoch_length=2)
Edit: Nevermind, this hack only works with my fork
@samedii thanks for details ! Yes, I see. That's true that a specific sampler in the data loader wont be simple to setup while working with for multiple sources... (May be tricky, a naive solution can be also to concat all sources into a single dataset and than define weights per class and per source indices...)
This is one of the things done in tensorflow that are quite nice.
Can you provide a link for tensorflow where they have done this thing ? Thanks
Edit: The end result is just a generator of batches. Simple example:
my_dataset = (
tf.data.Dataset.from_tensors(something)
.repeat()
.shuffle()
.prefetch(buffer_size)
.batch(batch_size)
)
https://www.tensorflow.org/api_docs/python/tf/data/Dataset#batch
Here are some links and examples of how they might be used. I might merge a normal dataset with a mixup dataset with the functions below:
https://www.tensorflow.org/api_docs/python/tf/data/Dataset#prefetch
https://www.tensorflow.org/api_docs/python/tf/data/Dataset#from_generator
https://www.tensorflow.org/api_docs/python/tf/data/Dataset#shuffle
def shuffle_dataset(ds, buffer_size):
return (
ds
.repeat()
.shuffle(
buffer_size,
reshuffle_each_iteration=True,
# seed=seed
)
)
https://www.tensorflow.org/api_docs/python/tf/data/Dataset#flat_map
def merge_datasets(datasets, ns):
return (
tf.data.Dataset.zip(tuple(ds.batch(n) for ds, n in zip(datasets, ns) if n >= 1))
.flat_map(lambda *batches: reduce(tf.data.Dataset.concatenate, [
tf.data.Dataset.from_tensors(batch).unbatch()
for batch in batches
]))
)
def mixup_items(items, weight_func):
weights = weight_func()
weights /= tf.reduce_sum(weights)
return tuple([
tf.einsum('i...,i->...', tf.stack(variable, axis=0), weights)
for variable in zip(*items)
])
def mixup_datasets(datasets, weight_func=None):
if weight_func is None:
weight_func = get_mixup_weights
return (
tf.data.Dataset.zip(datasets)
.map(lambda *items: mixup_items(items, weight_func))
)
@samedii thanks