Pytorch-lightning: refactor len(datasets) call.

Created on 26 Feb 2020  路  4Comments  路  Source: PyTorchLightning/pytorch-lightning

馃殌 Feature

Let's minimize len(dataset) calls and do it as late in the training as we can (ie: ideally right before any training loop). This way, we can open up the path to support iterable datasets more cleanly.

Motivation

Getting the length prematurely calls datasets at the wrong time often causing double loads.

This is a blocker to #948

enhancement help wanted

Most helpful comment

Ok, have a look at #955 - should fix a few things and make it easy to add support for iterable datasets everywhere

All 4 comments

@williamFalcon I'm happy to take a look at this if needed, just let me know :)

Perfect!

https://github.com/PyTorchLightning/pytorch-lightning/blob/be244560b24b68b0236a4694707fb9bb63c2e6d0/pytorch_lightning/trainer/data_loading.py#L149

In this function, auto_add_sampler() is always called.

https://github.com/PyTorchLightning/pytorch-lightning/blob/be244560b24b68b0236a4694707fb9bb63c2e6d0/pytorch_lightning/trainer/data_loading.py#L92

And inside, even though the comment says

https://github.com/PyTorchLightning/pytorch-lightning/blob/be244560b24b68b0236a4694707fb9bb63c2e6d0/pytorch_lightning/trainer/data_loading.py#L93

what it does is create a new pytorch DataLoader. I think this logic is flawed.

  1. the code doesn't agree with the comment, which is confusing.
  2. the data loader should be a very abstract thing that just returns the next batch. It might also know the size of the dataset. The current implementation makes an assumption on what a data loader is, which i think is unnecessary. For example, any call to loader.batch_size or loader.dataset should be avoided in the default setting, when all we need is to keep iterating the dataloader. Although I agree in more advanced settings maybe these are necessary.

What I suggest is that in the default setting, only call len(loader) to maybe determine the size.

Ok, have a look at #955 - should fix a few things and make it easy to add support for iterable datasets everywhere

Was this page helpful?
0 / 5 - 0 ratings

Related issues

williamFalcon picture williamFalcon  路  3Comments

baeseongsu picture baeseongsu  路  3Comments

justusschock picture justusschock  路  3Comments

iakremnev picture iakremnev  路  3Comments

mmsamiei picture mmsamiei  路  3Comments