Pytorch-lightning: Extend docs with multiple dataloader with common cases

Created on 8 Mar 2020 · 18Comments · Source: PyTorchLightning/pytorch-lightning

I notice that one can evaluate the model on a list of validation/test data loaders. Is it also possible to extract data from multiple train_data_loader in the training step in the current version? This feature might be useful in tasks like transfer learning or semi-supervised learning, which usually maintain multiple datasets in the training stage (e.g., source and target datasets in transfer learning, labeled and unlabeled datasets in semi-supervised learning).

It will be nice if one could obtain list of batch data as follow,

def training_step(self, batch_list, batch_nb_list):
    # batch_list = [batch_1, batch_2]
    x_1, y_1 = batch_list[0]
    x_2, y_2 = batch_list[1]
    loss = self.compute_some_loss(x_1, x_2, y_1, y_2)     
    tensorboard_logs = {'train_loss': loss}
    return {'loss': loss, 'log': tensorboard_logs}

def train_dataloader(self):
    return [data_loader_1, data_loader_2]

enhancement good first issue question

Source

louis2889184

👍6

Most helpful comment

maybe the way to go is to support multiple dataloaders and add a way (maybe an arg) to decide whether it should be sequential or simultaneous. if simultaneous, lightning auto loops or truncates to the shorter length?

williamFalcon on 25 Mar 2020

👍5

All 18 comments

Hi! thanks for your contribution!, great first issue!

github-actions[bot] on 8 Mar 2020

Good point having support also for multiple training dataloders would be great, mind send a PR?
just be aware that there is another open PR on dataloaders... #1104
cc: @PyTorchLightning/core-contributors

Borda on 11 Mar 2020

I'm interested in this task, but I have some questions.

1- Do we assume the data loaders are of the same length? What should we do if one runs out of data.
2- How long would be an epoch? The length of the shortest data loader?

3- Would a more sensible design be:

def training_step(self, batch, batch_idx:int, dataloader_idx: int):
    if dataloader_idx == 0:
        # Supervised loss for example
    elif dataloader == 1:
        # Unsupervised loss
    ...

Dref360 on 12 Mar 2020

Thanks for all the replies.

To @Dref360,

I think that the lengths of the data loaders can be different is more flexible, and each data loader can have its batch size. It is my opinion that the loader can just reload the dataset after running out of the data, so it doesn't depend on other data loaders.
My previous experience is to use the length of the longest data loader (the smallest epoch of all data loaders). But this needs more discussion.

louis2889184 on 13 Mar 2020

👍5

I found a related discussion in here. The first reply provided a solution for multi-datasets using torch.utils.data.Dataset. However, it assumes that the lengths of the data loaders are the same, and the index's relationships between datasets are fixed.

Therefore, I modified the provided codes to be more flexible such as follows,

class CustomDataset(Dataset):
    def __init__(self, datasets):
        self.datasets = datasets

        self.map_indexes = [[] for _ in self.datasets]

        self.min_length = min(len(d) for d in self.datasets)
        self.max_length = max(len(d) for d in self.datasets)

    def __getitem__(self, i):
        return tuple(d[m[i]] for d, m in zip(self.datasets, self.map_indexes))

    def construct_map_index(self):
        def update_indices(original_indexes, target_len, max_len):
            # map max_len to target_len (large to small)

            # return: a list, which maps the range(max_len) to the valid index in the dataset

            original_indexes = original_indexes[max_len:] # remove used indices
            fill_num = max_len - len(original_indexes)
            batch = fill_num // target_len

            if fill_num % target_len != 0:
                # to let the fill_num + len(original_indexes) greater than max_len
                batch += 1

            additional_indexes = list(range(target_len)) * batch
            random.shuffle(additional_indexes)

            original_indexes += additional_indexes

            assert len(original_indexes) >= max_len, "the length of matcing indexes is too small"

            return original_indexes

        self.map_indexes = [update_indices(m, len(d), self.max_length) 
            for m, d in zip(self.map_indexes, self.datasets)]

    def __len__(self):
        # will be called every epoch
        self.construct_map_index()
        return self.max_length

In this case, the indexes of the CustomDataset are set to the largest length of the dataset. Therefore, some indexes might be not valid for some datasets. construct_map_index is used to build lists for mapping excess indexes to available indexes, and it will be updated when calling self.__len__().

Construct one train_loader using CustomDataset.

from torch.utils.data import TensorDataset

dataset_1 = TensorDataset(torch.arange(2))
dataset_2 = TensorDataset(torch.arange(3, 8))

dataset = CustomDataset([dataset_1, dataset_2])

dataloader = DataLoader(dataset, batch_size=3, shuffle=True)

for epoch in range(3):
    for batch in dataloader:
        print(batch)

Outputs

[[tensor([1, 1, 1])], [tensor([4, 7, 6])]]
[[tensor([0, 0])], [tensor([3, 5])]]

[[tensor([0, 0, 1])], [tensor([7, 3, 4])]]
[[tensor([1, 0])], [tensor([5, 6])]]

[[tensor([0, 0, 1])], [tensor([5, 7, 4])]]
[[tensor([1, 1])], [tensor([6, 3])]]

The primary deficiency of the codes is the batch sizes of datasets will be the same and might be a little bit hard to read for users. I hope this is helpful for you to develop the feature!

louis2889184 on 15 Mar 2020

@williamFalcon @tullie pls ^^

Borda on 15 Mar 2020

in this case a custom dataloader that has two datasets in it is probably the best thing.
if we do support multiple dataloaders, the way to keep it consistent with val and test (which already support that), is to call training_step with alternating batches.
in the case of your own dataloader, you can just cycle through the smallest dataset multiple times while cycling the large one.

ssss ssss ssss ssss
LLLLLLLLLLLL LLLLLLLLLLLL

williamFalcon on 15 Mar 2020

👍2

Agreed that in this case the custom dataloader with two datasets seems best. Pytorch's dataloader/dataset classes are flexible enough that the user can control exactly what is coming out of them at each epoch (including which node they go to) and batch size.

tullie on 15 Mar 2020

why don’t we make the output of this a common use case page?

Add a new page for multiple dataloaders

training (show the example on building two)
val, test: describe how it happens ij lightning today and add examples with validation_step, test_step

williamFalcon on 15 Mar 2020

Agreed that in this case the custom dataloader with two datasets seems best. Pytorch's dataloader/dataset classes are flexible enough that the user can control exactly what is coming out of them at each epoch (including which node they go to) and batch size.

Then why we have multiple dataloaders for test and valid? Just feeling like a bit puzzled...

Borda on 15 Mar 2020

why don’t we make the output of this a common use case page?

Add a new page for multiple dataloaders

training (show the example on building two)

val, test: describe how it happens ij lightning today and add examples with validation_step, test_step

I totally agree with the idea about the new doc for data loaders.

Agreed that in this case the custom dataloader with two datasets seems best. Pytorch's dataloader/dataset classes are flexible enough that the user can control exactly what is coming out of them at each epoch (including which node they go to) and batch size.

Then why we have multiple dataloaders for test and valid? Just feeling like a bit puzzled...

Is it because we would like to extract the data from multiple datasets simultaneously in the training phase, while we usually sequentially loop datasets in the validation/testing phase (like evaluation step)?

louis2889184 on 16 Mar 2020

exactly. i could be wrong, but in training we usually want to use both batches at once. in val/test we use them sequentially

williamFalcon on 16 Mar 2020

👍1

if we do support multiple dataloaders, the way to keep it consistent with val and test (which already support that), is to call training_step with alternating batches.

In semi-supervised learning, domain adaptation, consistency training, etc it is typical that one uses the samples from different loaders in the same training step to compute various cross-losses. Thus, alternating behaviour of the training step does not bring much usability improvement.
I understand that it is possible to shift the issue to one step back and implement custom Dataset and/or Sampler for such cases, but from my experience having multiple dataloaders is just more explicit and convenient.

soupault on 25 Mar 2020

👍5

williamFalcon on 25 Mar 2020

👍5

Quick fix to get different batch size on labeled and unlabeled dataloaders during training might be:
```python
def prepare_data(self):
...
self.train_unlabeled_dataloader = torch.utils.data.DataLoader(train_unlabeled_dataset, ...)
self.train_unlabeled_dataloader_iterator = iter(self.train_unlabeled_dataloader)
...

def training_step(self, batch, batch_idx):
inputs_x, targets = batch
try:
unlabeled_x, _ = next(self.train_unlabeled_dataloader_iterator)
except StopIteration:
self.train_unlabeled_dataloader_iterator = iter(self.train_unlabeled_dataloader)
unlabeled_x, _ = next(self.train_unlabeled_dataloader_iterator)
unlabeled_x = unlabeled_x.type_as(inputs_x)
...
```
But as @soupault said, it will be much more convenient to have multiple train dataloaders.

M1F1 on 3 Apr 2020

In our active learning library baal, we are currently trying to come up with a solution to the same problem. In our case, one of the DataLoader will be massively larger than the other. In consequence, we added some optional features:

We put a probability of selecting data loader A vs B.
We put a maximum number of steps otherwise, we stop when the smallest iterator is completed. This assumes that both loaders are using random selection.

Those two features are optional and if they are not provided, we only alternate between the two loaders.

We provide an implementation in this gist: https://gist.github.com/Dref360/2524e524244569ed47428f19c487f264

I would appreciate your feedback! Thank you!

Dref360 on 6 Apr 2020

👍2

I see that https://github.com/PyTorchLightning/pytorch-lightning/pull/1416 has been merged. Should we close this as well?

If we want to make this a new feature, I think we have 3 cases to support

Sequentially
Alternate (same behavior as test_dataloader)
Simultaneous (Draw from all dataloader for each batch)

Could we propose those three cases as Iterator and the user would pick one?

def train_dataloader(self):
    return SimultaneousIterator([dataloader1, dataloader2])

Or we add an argument:

trainer = Trainer(train_multiple_dataloader_type='alternate')

I would be happy to work on this as soon as we reach a decision :)

Dref360 on 25 Apr 2020

👍3

This was added here. Closing.

edenlightning on 27 Jul 2020

Was this page helpful?

0 / 5 - 0 ratings

Related issues

Fix .test() on ddp

williamFalcon · 3Comments

Namespace Cleaning

monney · 3Comments

Ditch Trainer percent arguments, make overfit great again

iakremnev · 3Comments

Simplification: Merge load_from_metrics and load_from_checkpoint

awaelchli · 3Comments

[DataModule] `prepare_data()` and `setup()` not called

remisphere · 3Comments