Ignite: This is breaking my IterableDataset use case: is it a bug or intended?

Created on 2 Jun 2020  路  15Comments  路  Source: pytorch/ignite

Looking at the following line of code:
https://github.com/pytorch/ignite/blob/523798c13a9465f4c95565cb54ae3c5aa0786ac6/ignite/engine/engine.py#L742
It breaks out of the loop without resetting the dataloader (means without calling self.set_data(self.state.dataloader)), it's breaking all my use cases where my own implementation of dataloader only raises StopIteration once. Is this a bug or intended?

Most helpful comment

I would like to but there are some proprietary code I cannot share. I'll look into it at my end. Also, let me know if you find something! Thanks a lot for helping.

All 15 comments

@snie2012 yes, it is intended because what happens (as I understand it) is the following:

except StopIteration:
    # Define self.state.epoch_length if it is not yet set
    if self.state.epoch_length is None:
        # Define epoch length and stop the epoch
        self.state.epoch_length = iter_counter
        break

Data iterator is exhausted and StopIteration is raised. We, then, see that epoch_length is not defined => we define it => so, epoch is done => we need to quit single epoch loop.
If we call again iter(dataloader) and dataloader is exhausted, it does not make much sense. On the next epoch, engine calls next(dataloader_iter) => dataloader_iter is still empty => call event DATALOADER_STOP_ITERATION => user should reset dataloder on this event => retry next(dataloader_iter).

The way to use all this with finite unknown length iterators is described here : https://pytorch.org/ignite/master/faq.html#finite-iterator-with-unknown-length

Please, let me know if it helps you or your use-cases are different ?

@vfdev-5 Thanks for the quick response! I am a bit confused. My understanding is that when dataloader is exhausted, calling iter(dataloader) will do the resetting. Am I missing something?

@vfdev-5 Thanks for the quick response! I am a bit confused. My understanding is that when dataloader is exhausted, calling iter(dataloader) will do the resetting. Am I missing something?

Yes, you are right about iter(dataloader) if dataloader is a torch DataLoader. I missed the fact that you have IterableDataset.

from torch.utils.data import DataLoader, IterableDataset
from ignite.engine import Engine

class MyIterableDataset(IterableDataset):
    def __init__(self):
        super(MyIterableDataset).__init__()
        self.start = 0
        self.end = 7  # unknown for user

    def __iter__(self):
        return iter(range(self.start, self.end))

num_workers = 0
ds = MyIterableDataset()
data_loader = DataLoader(ds, num_workers=num_workers)

def foo(e, b):
    print("{}-{}: {}".format(e.state.epoch, e.state.iteration, b))

engine = Engine(foo)
engine.run(data_loader, epoch_length=None, max_epochs=5)

Here is how it would work as I expect

1-1: tensor([0])
1-2: tensor([1])
1-3: tensor([2])
1-4: tensor([3])
1-5: tensor([4])
1-6: tensor([5])
1-7: tensor([6])
2-8: tensor([0])
2-9: tensor([1])
2-10: tensor([2])
...
5-32: tensor([3])
5-33: tensor([4])
5-34: tensor([5])
5-35: tensor([6])

Could you, please, detail what is not working for your use-case. Thanks !

@vfdev-5 Never mind, I think I can fix this on my end.

A separate but related issue: DDP with unknown length IterableDataset fails. I tried multiple runs, the first epoch works totally fine, but at the second epoch, there is simply no data, so the engine either silently quits or cause some metric computation error (no data to compute). I am not familiar with the DDP code in ignite, but this sounds very similar to the issue we are discussing here. Somehow the dataloader is not reset for the second epoch.

@vfdev-5 Never mind, I think I can fix this on my end.

I'm still curious about on how do you use those IterableDataset and where ignite fails (without DDP).

A separate but related issue: DDP with unknown length IterableDataset fails.

@snie2012 thanks for pointing out. Let's investigate it. Could you please provide a minimal snippet of your DataLoader ? Is it different from toy one as I used above ?
Thanks

I'm still curious about on how do you use those IterableDataset and where ignite fails (without DDP).

For this, the root cause is my own implementation of IterableDataset only raises StopIteration once. The second time it's called, it throws an error.

@snie2012 thanks for pointing out. Let's investigate it. Could you please provide a minimal snippet of your DataLoader ? Is it different from toy one as I used above ?

Yes in this case, the Dataloader is the same as in your example. Let me know, thanks!

Do you have any successful use cases with DDP and unknown length IterableDataset?

@snie2012 here is a working code on DDP with "gloo" backend on 4 procs, 1 node:


Code

import os

import time
import torch
import torch.nn as nn
import torch.optim as optim
import torch.distributed as dist
from torch.utils.data import DataLoader, IterableDataset

from ignite.engine import Engine


class MyIterableDataset(IterableDataset):
    def __init__(self, offset):
        super(MyIterableDataset).__init__()
        self.start = offset
        self.end = offset + 7  # unknown for user

    def __iter__(self):
        return iter(range(self.start, self.end))

if __name__ == "__main__":
    import ignite
    print(ignite.__version__)

    device = "cpu"    
    local_rank = os.environ["LOCAL_RANK"]

    dist.init_process_group("gloo", init_method="env://")

    rank = dist.get_rank()
    ws = dist.get_world_size()

    time.sleep(rank * 0.01)
    print("Dist Info: ", rank, ws)
    dataset = MyIterableDataset(offset=rank * ws * 10)    
    data_loader = DataLoader(dataset, num_workers=0, batch_size=4)

    model = nn.Linear(10, 10).to(device)
    model = nn.parallel.DistributedDataParallel(model)

    opt = optim.SGD(model.parameters(), lr=0.01)

    def foo(e, b):

        opt.zero_grad()
        x = torch.rand(10).to(device)
        y_pred = model(x)
        loss = y_pred.sum()
        loss.backward()
        opt.step()

        # for printing purposes
        time.sleep(rank * 0.01)
        print("{}:: {}-{}: {}".format(rank, e.state.epoch, e.state.iteration, b))

    engine = Engine(foo)
    engine.run(data_loader, epoch_length=None, max_epochs=5)

    dist.destroy_process_group()    



Output:

$ python -m torch.distributed.launch --nproc_per_node=4 --use_env issue-1094-iterable-dataset-DDP.py 

*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
*****************************************
0.4.0.dev20200603
0.4.0.dev20200603
0.4.0.dev20200603
0.4.0.dev20200603
Dist Info:  0 4
Dist Info:  1 4
Dist Info:  2 4
Dist Info:  3 4
0:: 1-1: tensor([0, 1, 2, 3])
1:: 1-1: tensor([40, 41, 42, 43])
2:: 1-1: tensor([80, 81, 82, 83])
3:: 1-1: tensor([120, 121, 122, 123])
0:: 1-2: tensor([4, 5, 6])
1:: 1-2: tensor([44, 45, 46])
2:: 1-2: tensor([84, 85, 86])
3:: 1-2: tensor([124, 125, 126])
0:: 2-3: tensor([0, 1, 2, 3])
1:: 2-3: tensor([40, 41, 42, 43])
2:: 2-3: tensor([80, 81, 82, 83])
3:: 2-3: tensor([120, 121, 122, 123])
0:: 2-4: tensor([4, 5, 6])
1:: 2-4: tensor([44, 45, 46])
2:: 2-4: tensor([84, 85, 86])
3:: 2-4: tensor([124, 125, 126])
0:: 3-5: tensor([0, 1, 2, 3])
1:: 3-5: tensor([40, 41, 42, 43])
2:: 3-5: tensor([80, 81, 82, 83])
3:: 3-5: tensor([120, 121, 122, 123])
0:: 3-6: tensor([4, 5, 6])
1:: 3-6: tensor([44, 45, 46])
2:: 3-6: tensor([84, 85, 86])
3:: 3-6: tensor([124, 125, 126])
0:: 4-7: tensor([0, 1, 2, 3])
1:: 4-7: tensor([40, 41, 42, 43])
2:: 4-7: tensor([80, 81, 82, 83])
3:: 4-7: tensor([120, 121, 122, 123])
0:: 4-8: tensor([4, 5, 6])
1:: 4-8: tensor([44, 45, 46])
2:: 4-8: tensor([84, 85, 86])
3:: 4-8: tensor([124, 125, 126])
0:: 5-9: tensor([0, 1, 2, 3])
1:: 5-9: tensor([40, 41, 42, 43])
2:: 5-9: tensor([80, 81, 82, 83])
3:: 5-9: tensor([120, 121, 122, 123])
0:: 5-10: tensor([4, 5, 6])
1:: 5-10: tensor([44, 45, 46])
2:: 5-10: tensor([84, 85, 86])
3:: 5-10: tensor([124, 125, 126])


@vfdev-5 Thanks for the example! Is this the canonical set up to do distributed training with ignite? I'd like to learn more of how to set up DDP in ignite, any documentations or code examples on this?

@snie2012 in this example, I just reproduce one of the pytorch tutorials on distributed computation: a) create processing group, b) setup distributed dataflow, c) destroy the group in the end. However, the part with MyIterableDataset is just my imagination on how to dispatch data over processes. Honestly, I do not know how it canonically should be setup IterableDataset, DistributedSampler and DataLoader...

At the moment, Ignite does not provide any helpers on DDP, it just integrates into existing pytorch schema. However, we are working on some simplifications of that by supporting GPUs, TPUs using the same code with minimal changes, for example: https://pytorch.org/ignite/master/distributed.html

Once, we are done with development on that, the code could look likes that : https://github.com/pytorch/ignite/blob/f64095ef2985d50d6a043caf719331cf4712904a/examples/contrib/new-cifar10/main.py but again it is not a framework, but library helper tools free to use or not to use if can be done better :)

@vfdev-5 Thanks for the clarification. In my case, I am manually calling mp.spawn(one_ddp_fn) to spawn the processes. IterableDataset is constructed inside one_ddp_fn. My understanding is that this is pretty much the same as your code example. Not sure what went wrong.

@snie2012 if you wish to share the code somewhere, we can debug it :)

I would like to but there are some proprietary code I cannot share. I'll look into it at my end. Also, let me know if you find something! Thanks a lot for helping.

Hi @snie2012 could you solve the problem you had with IterableDataset and DDP ?

@vfdev-5 I haven't had time to work on this. Feel free to close the issue for now.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

TheCodez picture TheCodez  路  3Comments

UjwalKandi picture UjwalKandi  路  3Comments

vfdev-5 picture vfdev-5  路  3Comments

vfdev-5 picture vfdev-5  路  4Comments

czotti picture czotti  路  3Comments