Pytorch-lightning: Prefetch in LightingDataModule PR

Created on 22 Nov 2020  路  11Comments  路  Source: PyTorchLightning/pytorch-lightning

馃殌 Feature

Thinking to file a pr for Prefetching feature in dataloader module which is very useful tool for optimization of datapipeline,

Motivation

as a TensorFlow user, i find the Tensorflow.dataset prefetch feature a useful thing and easy optimization.

Pitch

Alternatives

Additional context

1 way

from pytorch_lightning import LightningDataModule
dm = MNISTDataModule(..., prefetch=True)

2 way

class MNISTDataModule(LightningDataModule):
    .
    .
    .
    @property
    def default_transforms(self):
        if not TORCHVISION_AVAILABLE:
            return None
        if self.normalize:
            mnist_transforms = transform_lib.Compose(
                [transform_lib.ToTensor(), transform_lib.Normalize(mean=(0.5,), std=(0.5,))]
            )
        else:
            mnist_transforms = transform_lib.ToTensor()

        return mnist_transforms
     def optimize(self):
           optimizations = [self.prefetch('AUTO'), self.cache(10)]
           return optimizations

data / DataModule enhancement help wanted

Most helpful comment

Prefetching overlaps the preprocessing and model execution of a training step

This is already happening with PyTorch dataloaders. Setting num_workers=x will fork/spawn x processes that load data in parallel into a queue. See here section called "Single- and Multi-process Data Loading". I thought you are talking about device transfers?

~Btw, above you point to the wrong figures even though the titles are showing which one is which.~

prefetch(1) -> (recommended) This means prefetch one sample from training/val pipeline till the graph computes.

Closest I could find is DataLoader(workers=1, prefetch_factor=1), that's pretty much the same right? src: https://pytorch.org/docs/stable/data.html?highlight=dataloader#torch.utils.data.DataLoader

As you can see, I put links where I got my information from. Please do that as well so we know where you have the information and figures from and we can read up on it, thanks. I am not familiar with TF.

All 11 comments

Hi, could you add more information how it differs from the way PyTorch DataLoaders work and more motivation why such a feature should live in Lightning and not PyTorch or a separate library?
Also please if you can add some prior work section, I am sure there are some libraries out there that do what you need. I propose to first make sure that existing methods work well with Lightning and if they don't we can see how to integrate better :)

@awaelchli Hi,

  1. you can check tensorflow.dataset and look for prefetch there, this feature decreases input pipeline and graph bottlenecks.
  2. PyTorch Dataloader in my knowledge don't have prefetch support
    below is the link to discuss ,"prefetch in pytorch"
    one of the facebook AI research developer answered:
    "there isn鈥檛 a prefetch option, but you can write a custom Dataset that just loads the entire data on GPU and returns samples from in-memory. In that case you can just use 0 workers in your DataLoader"
    :)

wouldn't it cause memory issues if whole data is loaded in the memory at once if data is huge??

which is very useful tool for optimization of datapipeline

what kind of optimization specifically?

one of the facebook AI research developer answered:

btw is a co-creator of PyTorch :smile:

@rohitgr7 Co-Creator wow!!
1.
So,
prefetching is basically prefetch 'n' number for samples from the DataPipeline,
This is 'n' is user defined or automatically picked based on computational / space resource.
prefetching is done in two ways.
prefetch(auto) -> framework automatically picks n for us.
prefetch(1) -> (recommended) This means prefetch one sample from training/val pipeline till the graph computes.

  1. what kind of optimization
    Tensorflow.org says
    "Prefetching overlaps the preprocessing and model execution of a training step. While the model is executing training step s, the input pipeline is reading the data for step s+1. Doing so reduces the step time to the maximum (as opposed to the sum) of the training and the time it takes to extract the data."

Prefetch:
image
Naive:
image

Thanks :)

@rohitgr7 so in nutshell , it doesnt loads the entire data at once :)

Prefetching overlaps the preprocessing and model execution of a training step

This is already happening with PyTorch dataloaders. Setting num_workers=x will fork/spawn x processes that load data in parallel into a queue. See here section called "Single- and Multi-process Data Loading". I thought you are talking about device transfers?

~Btw, above you point to the wrong figures even though the titles are showing which one is which.~

prefetch(1) -> (recommended) This means prefetch one sample from training/val pipeline till the graph computes.

Closest I could find is DataLoader(workers=1, prefetch_factor=1), that's pretty much the same right? src: https://pytorch.org/docs/stable/data.html?highlight=dataloader#torch.utils.data.DataLoader

As you can see, I put links where I got my information from. Please do that as well so we know where you have the information and figures from and we can read up on it, thanks. I am not familiar with TF.

DataLoader(workers=1, prefetch_factor=1)

TIL

@awaelchli @rohitgr7 Hi,
as you @awaelchli pinpointed,
there can be two prefetch, one is cpu level and another gpu level (gpu programming optimization)

  1. cpu prefetch pytorch already does this (reiterated pytorch discussions.. and codebase)
    src: https://discuss.pytorch.org/t/how-to-prefetch-data-when-processing-with-gpu/548
  2. gpu prefetch(device transfers.. so we don't spend expensive transfer time from host to device(GPU) ), pytorch is lacking idk if this feature should be added here.

Thanks, we can close this thread, if we are not planning to add gpu prefetch support (which doesnt make sense here)!
nice dicussion :)

It is possible to overlap data transfers and model compute with the non_blocking=True option (see https://pytorch.org/docs/stable/notes/cuda.html?highlight=non_blocking section "Pinned Memory Buffers". Lightning does this already, but it's not equivalent to a queue.

Since the bottleneck is often the CPU prefetching and processing of data, the transfers to GPU can often be neglected. Memory pinning and the non_blocking I just mentioned provide enough flexibility, at least from my experience.
My guess is this is the reason why PyTorch doesn't have any special GPU prefetching logic.

That being said, we are of course always open for new features that remove bottlenecks and get the most out of the hardware :)
If you can (or someone else) come up with a concrete idea and can present a proof of concept with benchmarks so that we see the benefit of this GPU prefetching working on a real example, then I would be more than happy to see and test it myself!

It might also be worth looking in to DALI
https://developer.nvidia.com/DALI

@awaelchli
Will look into it and try to remove bottlenecks
Thanks !

Was this page helpful?
5 / 5 - 1 ratings

Related issues

DavidRuhe picture DavidRuhe  路  3Comments

anthonytec2 picture anthonytec2  路  3Comments

monney picture monney  路  3Comments

versatran01 picture versatran01  路  3Comments

mmsamiei picture mmsamiei  路  3Comments