Pytorch-lightning: refactor len(datasets) call.

Created on 26 Feb 2020 · 4Comments · Source: PyTorchLightning/pytorch-lightning

🚀 Feature

Let's minimize len(dataset) calls and do it as late in the training as we can (ie: ideally right before any training loop). This way, we can open up the path to support iterable datasets more cleanly.

Motivation

Getting the length prematurely calls datasets at the wrong time often causing double loads.

This is a blocker to #948

enhancement help wanted

Source

williamFalcon

👍1

Most helpful comment

Ok, have a look at #955 - should fix a few things and make it easy to add support for iterable datasets everywhere

ethanwharris on 26 Feb 2020

👍2

All 4 comments

@williamFalcon I'm happy to take a look at this if needed, just let me know :)

ethanwharris on 26 Feb 2020

Perfect!

williamFalcon on 26 Feb 2020

https://github.com/PyTorchLightning/pytorch-lightning/blob/be244560b24b68b0236a4694707fb9bb63c2e6d0/pytorch_lightning/trainer/data_loading.py#L149

In this function, auto_add_sampler() is always called.

https://github.com/PyTorchLightning/pytorch-lightning/blob/be244560b24b68b0236a4694707fb9bb63c2e6d0/pytorch_lightning/trainer/data_loading.py#L92

And inside, even though the comment says

https://github.com/PyTorchLightning/pytorch-lightning/blob/be244560b24b68b0236a4694707fb9bb63c2e6d0/pytorch_lightning/trainer/data_loading.py#L93

what it does is create a new pytorch DataLoader. I think this logic is flawed.

the code doesn't agree with the comment, which is confusing.
the data loader should be a very abstract thing that just returns the next batch. It might also know the size of the dataset. The current implementation makes an assumption on what a data loader is, which i think is unnecessary. For example, any call to loader.batch_size or loader.dataset should be avoided in the default setting, when all we need is to keep iterating the dataloader. Although I agree in more advanced settings maybe these are necessary.

What I suggest is that in the default setting, only call len(loader) to maybe determine the size.