Pytorch-lightning: Extend docs with multiple dataloader with common cases

Created on 8 Mar 2020  路  18Comments  路  Source: PyTorchLightning/pytorch-lightning

I notice that one can evaluate the model on a list of validation/test data loaders. Is it also possible to extract data from multiple train_data_loader in the training step in the current version? This feature might be useful in tasks like transfer learning or semi-supervised learning, which usually maintain multiple datasets in the training stage (e.g., source and target datasets in transfer learning, labeled and unlabeled datasets in semi-supervised learning).

It will be nice if one could obtain list of batch data as follow,

def training_step(self, batch_list, batch_nb_list):
    # batch_list = [batch_1, batch_2]
    x_1, y_1 = batch_list[0]
    x_2, y_2 = batch_list[1]
    loss = self.compute_some_loss(x_1, x_2, y_1, y_2)     
    tensorboard_logs = {'train_loss': loss}
    return {'loss': loss, 'log': tensorboard_logs}

def train_dataloader(self):
    return [data_loader_1, data_loader_2]
enhancement good first issue question

Most helpful comment

maybe the way to go is to support multiple dataloaders and add a way (maybe an arg) to decide whether it should be sequential or simultaneous. if simultaneous, lightning auto loops or truncates to the shorter length?

All 18 comments

Hi! thanks for your contribution!, great first issue!

Good point having support also for multiple training dataloders would be great, mind send a PR?
just be aware that there is another open PR on dataloaders... #1104
cc: @PyTorchLightning/core-contributors

I'm interested in this task, but I have some questions.

1- Do we assume the data loaders are of the same length? What should we do if one runs out of data.
2- How long would be an epoch? The length of the shortest data loader?

3- Would a more sensible design be:

def training_step(self, batch, batch_idx:int, dataloader_idx: int):
    if dataloader_idx == 0:
        # Supervised loss for example
    elif dataloader == 1:
        # Unsupervised loss
    ...

Thanks for all the replies.

To @Dref360,

  1. I think that the lengths of the data loaders can be different is more flexible, and each data loader can have its batch size. It is my opinion that the loader can just reload the dataset after running out of the data, so it doesn't depend on other data loaders.

  2. My previous experience is to use the length of the longest data loader (the smallest epoch of all data loaders). But this needs more discussion.

I found a related discussion in here. The first reply provided a solution for multi-datasets using torch.utils.data.Dataset. However, it assumes that the lengths of the data loaders are the same, and the index's relationships between datasets are fixed.

Therefore, I modified the provided codes to be more flexible such as follows,

class CustomDataset(Dataset):
    def __init__(self, datasets):
        self.datasets = datasets

        self.map_indexes = [[] for _ in self.datasets]

        self.min_length = min(len(d) for d in self.datasets)
        self.max_length = max(len(d) for d in self.datasets)

    def __getitem__(self, i):
        return tuple(d[m[i]] for d, m in zip(self.datasets, self.map_indexes))

    def construct_map_index(self):
        def update_indices(original_indexes, target_len, max_len):
            # map max_len to target_len (large to small)

            # return: a list, which maps the range(max_len) to the valid index in the dataset

            original_indexes = original_indexes[max_len:] # remove used indices
            fill_num = max_len - len(original_indexes)
            batch = fill_num // target_len

            if fill_num % target_len != 0:
                # to let the fill_num + len(original_indexes) greater than max_len
                batch += 1

            additional_indexes = list(range(target_len)) * batch
            random.shuffle(additional_indexes)

            original_indexes += additional_indexes

            assert len(original_indexes) >= max_len, "the length of matcing indexes is too small"

            return original_indexes

        self.map_indexes = [update_indices(m, len(d), self.max_length) 
            for m, d in zip(self.map_indexes, self.datasets)]

    def __len__(self):
        # will be called every epoch
        self.construct_map_index()
        return self.max_length

In this case, the indexes of the CustomDataset are set to the largest length of the dataset. Therefore, some indexes might be not valid for some datasets. construct_map_index is used to build lists for mapping excess indexes to available indexes, and it will be updated when calling self.__len__().

Construct one train_loader using CustomDataset.

from torch.utils.data import TensorDataset

dataset_1 = TensorDataset(torch.arange(2))
dataset_2 = TensorDataset(torch.arange(3, 8))

dataset = CustomDataset([dataset_1, dataset_2])

dataloader = DataLoader(dataset, batch_size=3, shuffle=True)

for epoch in range(3):
    for batch in dataloader:
        print(batch)

Outputs

[[tensor([1, 1, 1])], [tensor([4, 7, 6])]]
[[tensor([0, 0])], [tensor([3, 5])]]

[[tensor([0, 0, 1])], [tensor([7, 3, 4])]]
[[tensor([1, 0])], [tensor([5, 6])]]

[[tensor([0, 0, 1])], [tensor([5, 7, 4])]]
[[tensor([1, 1])], [tensor([6, 3])]]

The primary deficiency of the codes is the batch sizes of datasets will be the same and might be a little bit hard to read for users. I hope this is helpful for you to develop the feature!

@williamFalcon @tullie pls ^^

  1. in this case a custom dataloader that has two datasets in it is probably the best thing.

  2. if we do support multiple dataloaders, the way to keep it consistent with val and test (which already support that), is to call training_step with alternating batches.

  3. in the case of your own dataloader, you can just cycle through the smallest dataset multiple times while cycling the large one.

ssss ssss ssss ssss
LLLLLLLLLLLL LLLLLLLLLLLL

Agreed that in this case the custom dataloader with two datasets seems best. Pytorch's dataloader/dataset classes are flexible enough that the user can control exactly what is coming out of them at each epoch (including which node they go to) and batch size.

why don鈥檛 we make the output of this a common use case page?

Add a new page for multiple dataloaders

  • training (show the example on building two)
  • val, test: describe how it happens ij lightning today and add examples with validation_step, test_step

Agreed that in this case the custom dataloader with two datasets seems best. Pytorch's dataloader/dataset classes are flexible enough that the user can control exactly what is coming out of them at each epoch (including which node they go to) and batch size.

Then why we have multiple dataloaders for test and valid? Just feeling like a bit puzzled...

why don鈥檛 we make the output of this a common use case page?

Add a new page for multiple dataloaders

  • training (show the example on building two)
  • val, test: describe how it happens ij lightning today and add examples with validation_step, test_step

I totally agree with the idea about the new doc for data loaders.

Agreed that in this case the custom dataloader with two datasets seems best. Pytorch's dataloader/dataset classes are flexible enough that the user can control exactly what is coming out of them at each epoch (including which node they go to) and batch size.

Then why we have multiple dataloaders for test and valid? Just feeling like a bit puzzled...

Is it because we would like to extract the data from multiple datasets simultaneously in the training phase, while we usually sequentially loop datasets in the validation/testing phase (like evaluation step)?

exactly. i could be wrong, but in training we usually want to use both batches at once. in val/test we use them sequentially

if we do support multiple dataloaders, the way to keep it consistent with val and test (which already support that), is to call training_step with alternating batches.

In semi-supervised learning, domain adaptation, consistency training, etc it is typical that one uses the samples from different loaders in the same training step to compute various cross-losses. Thus, alternating behaviour of the training step does not bring much usability improvement.
I understand that it is possible to shift the issue to one step back and implement custom Dataset and/or Sampler for such cases, but from my experience having multiple dataloaders is just more explicit and convenient.

maybe the way to go is to support multiple dataloaders and add a way (maybe an arg) to decide whether it should be sequential or simultaneous. if simultaneous, lightning auto loops or truncates to the shorter length?

Quick fix to get different batch size on labeled and unlabeled dataloaders during training might be:
```python
def prepare_data(self):
...
self.train_unlabeled_dataloader = torch.utils.data.DataLoader(train_unlabeled_dataset, ...)
self.train_unlabeled_dataloader_iterator = iter(self.train_unlabeled_dataloader)
...

def training_step(self, batch, batch_idx):
inputs_x, targets = batch
try:
unlabeled_x, _ = next(self.train_unlabeled_dataloader_iterator)
except StopIteration:
self.train_unlabeled_dataloader_iterator = iter(self.train_unlabeled_dataloader)
unlabeled_x, _ = next(self.train_unlabeled_dataloader_iterator)
unlabeled_x = unlabeled_x.type_as(inputs_x)
...
```
But as @soupault said, it will be much more convenient to have multiple train dataloaders.

In our active learning library baal, we are currently trying to come up with a solution to the same problem. In our case, one of the DataLoader will be massively larger than the other. In consequence, we added some optional features:

  • We put a probability of selecting data loader A vs B.
  • We put a maximum number of steps otherwise, we stop when the smallest iterator is completed. This assumes that both loaders are using random selection.

Those two features are optional and if they are not provided, we only alternate between the two loaders.

We provide an implementation in this gist: https://gist.github.com/Dref360/2524e524244569ed47428f19c487f264

I would appreciate your feedback! Thank you!

I see that https://github.com/PyTorchLightning/pytorch-lightning/pull/1416 has been merged. Should we close this as well?

If we want to make this a new feature, I think we have 3 cases to support

  1. Sequentially
  2. Alternate (same behavior as test_dataloader)
  3. Simultaneous (Draw from all dataloader for each batch)

Could we propose those three cases as Iterator and the user would pick one?

def train_dataloader(self):
    return SimultaneousIterator([dataloader1, dataloader2])

Or we add an argument:

trainer = Trainer(train_multiple_dataloader_type='alternate')

I would be happy to work on this as soon as we reach a decision :)

This was added here. Closing.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

maxime-louis picture maxime-louis  路  3Comments

anthonytec2 picture anthonytec2  路  3Comments

iakremnev picture iakremnev  路  3Comments

baeseongsu picture baeseongsu  路  3Comments

remisphere picture remisphere  路  3Comments