Ignite: Compatibility with IterableDataset

Created on 10 Sep 2019 · 11Comments · Source: pytorch/ignite

Hi,
As it stands I don't think Pytorch Ignite is compatible with the IterableDataset class (https://pytorch.org/docs/stable/_modules/torch/utils/data/dataset.html#IterableDataset), since this dataset doesn't have the __len__ attribute. To run Ignite with the usual train/evaluate handles, the Engine would need to be modified to allow a user-inputted epoch size. Is this in the works? Otherwise can probably work up a PR this week for it.
Thanks!

question

Source

PCerles

All 11 comments

Hi @PCerles ,

Yes, Engine requires as sized iterable object and usually it is a DataLoader is provided, which has __len__.

If you would like to work directly with an IterableDataset, it is possible to do like this :

ds = MyIterableDataset()
# make infinite iterable
def cycle(ds):
    while True:
        for i in iterable:
            yield i
ds_iter = cycle(ds)

def update_fn(engine, i):
    data = next(ds_iter)
    # do whatever you intended

trainer = Engine(update_fn)

iters_per_epoch = list(range(1000))
trainer.run(iters_per_epoch, max_epochs=5)

HTH

vfdev-5 on 10 Sep 2019

Thanks, this works.

PCerles on 10 Sep 2019

Are there plans for a non-hack solution?

samedii on 22 Jan 2020

Note that this hack also works:

trainer.run(map(lambda x: x, train_data_loader), max_epochs=1, epoch_length=2)

samedii on 22 Jan 2020

👍1

@samedii recently refactored engine now can allow to work with iterators and epoch_length can control epoch size if __len__ is not available.

vfdev-5 on 22 Jan 2020

@vfdev-5 Great! I started on a PR but then saw that you already started making changes. I see that epoch_length is not used in _from_iteration. I think that would solve the last issues.

I started with a check isinstance(self.state.dataloader.dataset, torch.utils.data.IterableDataset) earlier but it failed later because it's still trying to look for the __len__ attribute in those functions.

samedii on 22 Jan 2020

@samedii yes, there are two places where the code made an implicit guess on Map type of the dataset inside the dataloader:

when we replace batch_sampler by a index reproducible batch sampler
when we setup the data from the starting iteration

Could like tell me the use-case where such structure are helpful, I mean DataLoader on IterableDataset ? Maybe it could be good to check more such cases and provide a bit more support.

vfdev-5 on 22 Jan 2020

@vfdev-5 Sure! Our use case is when we have multiple datasets that we want to merge in some manner.

This can be because we want to create balanced batches of different classes or data sources.

E.g. an advanced example:

One sample from each class
Samples might be split third-third-third between data source A, data source B and semi-supervised

It is very nice then to be able to define a sampling strategy for data source A, B and semi-supervised separately and then merge them later.

This is one of the things done in tensorflow that are quite nice.

Alternative hack :)

delattr(torch.utils.data.DataLoader, '__len__')
trainer.run(train_data_loader, max_epochs=2, epoch_length=2)

Edit: Nevermind, this hack only works with my fork

samedii on 22 Jan 2020

@samedii thanks for details ! Yes, I see. That's true that a specific sampler in the data loader wont be simple to setup while working with for multiple sources... (May be tricky, a naive solution can be also to concat all sources into a single dataset and than define weights per class and per source indices...)

This is one of the things done in tensorflow that are quite nice.

Can you provide a link for tensorflow where they have done this thing ? Thanks

vfdev-5 on 22 Jan 2020

Edit: The end result is just a generator of batches. Simple example:

my_dataset = (
    tf.data.Dataset.from_tensors(something)
    .repeat()
    .shuffle()
    .prefetch(buffer_size)
    .batch(batch_size)
)

https://www.tensorflow.org/api_docs/python/tf/data/Dataset#batch

Here are some links and examples of how they might be used. I might merge a normal dataset with a mixup dataset with the functions below:

https://www.tensorflow.org/api_docs/python/tf/data/Dataset#prefetch
https://www.tensorflow.org/api_docs/python/tf/data/Dataset#from_generator
https://www.tensorflow.org/api_docs/python/tf/data/Dataset#shuffle

def shuffle_dataset(ds, buffer_size):
    return (
        ds
        .repeat()
        .shuffle(
            buffer_size,
            reshuffle_each_iteration=True,
            # seed=seed
        )
    )

https://www.tensorflow.org/api_docs/python/tf/data/Dataset#flat_map

def merge_datasets(datasets, ns):
    return (
        tf.data.Dataset.zip(tuple(ds.batch(n) for ds, n in zip(datasets, ns) if n >= 1))
        .flat_map(lambda *batches: reduce(tf.data.Dataset.concatenate, [
            tf.data.Dataset.from_tensors(batch).unbatch()
            for batch in batches
        ]))
    )

def mixup_items(items, weight_func):
    weights = weight_func()
    weights /= tf.reduce_sum(weights)

    return tuple([
        tf.einsum('i...,i->...', tf.stack(variable, axis=0), weights)
        for variable in zip(*items)
    ])


def mixup_datasets(datasets, weight_func=None):
    if weight_func is None:
        weight_func = get_mixup_weights
    return (
        tf.data.Dataset.zip(datasets)
        .map(lambda *items: mixup_items(items, weight_func))
    )

samedii on 22 Jan 2020

👍1

@samedii thanks

vfdev-5 on 22 Jan 2020

Was this page helpful?

0 / 5 - 0 ratings