Pytorch-lightning: Prefetch in LightingDataModule PR

Created on 22 Nov 2020 · 11Comments · Source: PyTorchLightning/pytorch-lightning

🚀 Feature

Thinking to file a pr for Prefetching feature in dataloader module which is very useful tool for optimization of datapipeline,

Motivation

as a TensorFlow user, i find the Tensorflow.dataset prefetch feature a useful thing and easy optimization.

Pitch

Alternatives

Additional context

1 way

from pytorch_lightning import LightningDataModule
dm = MNISTDataModule(..., prefetch=True)

2 way

class MNISTDataModule(LightningDataModule):
    .
    .
    .
    @property
    def default_transforms(self):
        if not TORCHVISION_AVAILABLE:
            return None
        if self.normalize:
            mnist_transforms = transform_lib.Compose(
                [transform_lib.ToTensor(), transform_lib.Normalize(mean=(0.5,), std=(0.5,))]
            )
        else:
            mnist_transforms = transform_lib.ToTensor()

        return mnist_transforms
     def optimize(self):
           optimizations = [self.prefetch('AUTO'), self.cache(10)]
           return optimizations

data / DataModule enhancement help wanted

Source

kartik4949

Most helpful comment

Prefetching overlaps the preprocessing and model execution of a training step

This is already happening with PyTorch dataloaders. Setting num_workers=x will fork/spawn x processes that load data in parallel into a queue. See here section called "Single- and Multi-process Data Loading". I thought you are talking about device transfers?

~Btw, above you point to the wrong figures even though the titles are showing which one is which.~

prefetch(1) -> (recommended) This means prefetch one sample from training/val pipeline till the graph computes.

Closest I could find is DataLoader(workers=1, prefetch_factor=1), that's pretty much the same right? src: https://pytorch.org/docs/stable/data.html?highlight=dataloader#torch.utils.data.DataLoader

As you can see, I put links where I got my information from. Please do that as well so we know where you have the information and figures from and we can read up on it, thanks. I am not familiar with TF.

awaelchli on 22 Nov 2020

🚀2

All 11 comments

Hi, could you add more information how it differs from the way PyTorch DataLoaders work and more motivation why such a feature should live in Lightning and not PyTorch or a separate library?
Also please if you can add some prior work section, I am sure there are some libraries out there that do what you need. I propose to first make sure that existing methods work well with Lightning and if they don't we can see how to integrate better :)

awaelchli on 22 Nov 2020

@awaelchli Hi,

you can check tensorflow.dataset and look for prefetch there, this feature decreases input pipeline and graph bottlenecks.
PyTorch Dataloader in my knowledge don't have prefetch support
below is the link to discuss ,"prefetch in pytorch"
one of the facebook AI research developer answered:
"there isn’t a prefetch option, but you can write a custom Dataset that just loads the entire data on GPU and returns samples from in-memory. In that case you can just use 0 workers in your DataLoader"
:)

kartik4949 on 22 Nov 2020

wouldn't it cause memory issues if whole data is loaded in the memory at once if data is huge??

which is very useful tool for optimization of datapipeline

what kind of optimization specifically?

one of the facebook AI research developer answered:

btw is a co-creator of PyTorch :smile:

rohitgr7 on 22 Nov 2020

@rohitgr7 Co-Creator wow!!
1.
So,
prefetching is basically prefetch 'n' number for samples from the DataPipeline,
This is 'n' is user defined or automatically picked based on computational / space resource.
prefetching is done in two ways.
prefetch(auto) -> framework automatically picks n for us.
prefetch(1) -> (recommended) This means prefetch one sample from training/val pipeline till the graph computes.

what kind of optimization
Tensorflow.org says
"Prefetching overlaps the preprocessing and model execution of a training step. While the model is executing training step s, the input pipeline is reading the data for step s+1. Doing so reduces the step time to the maximum (as opposed to the sum) of the training and the time it takes to extract the data."

Prefetch:

Naive:

Thanks :)

kartik4949 on 22 Nov 2020

@rohitgr7 so in nutshell , it doesnt loads the entire data at once :)

kartik4949 on 22 Nov 2020

👍1

Prefetching overlaps the preprocessing and model execution of a training step

~Btw, above you point to the wrong figures even though the titles are showing which one is which.~

prefetch(1) -> (recommended) This means prefetch one sample from training/val pipeline till the graph computes.

Closest I could find is DataLoader(workers=1, prefetch_factor=1), that's pretty much the same right? src: https://pytorch.org/docs/stable/data.html?highlight=dataloader#torch.utils.data.DataLoader

awaelchli on 22 Nov 2020

🚀2

DataLoader(workers=1, prefetch_factor=1)

TIL

rohitgr7 on 22 Nov 2020

👍1

@awaelchli @rohitgr7 Hi,
as you @awaelchli pinpointed,
there can be two prefetch, one is cpu level and another gpu level (gpu programming optimization)

cpu prefetch pytorch already does this (reiterated pytorch discussions.. and codebase)
src: https://discuss.pytorch.org/t/how-to-prefetch-data-when-processing-with-gpu/548
gpu prefetch(device transfers.. so we don't spend expensive transfer time from host to device(GPU) ), pytorch is lacking idk if this feature should be added here.

Thanks, we can close this thread, if we are not planning to add gpu prefetch support (which doesnt make sense here)!
nice dicussion :)

kartik4949 on 23 Nov 2020

It is possible to overlap data transfers and model compute with the non_blocking=True option (see https://pytorch.org/docs/stable/notes/cuda.html?highlight=non_blocking section "Pinned Memory Buffers". Lightning does this already, but it's not equivalent to a queue.

Since the bottleneck is often the CPU prefetching and processing of data, the transfers to GPU can often be neglected. Memory pinning and the non_blocking I just mentioned provide enough flexibility, at least from my experience.
My guess is this is the reason why PyTorch doesn't have any special GPU prefetching logic.

That being said, we are of course always open for new features that remove bottlenecks and get the most out of the hardware :)
If you can (or someone else) come up with a concrete idea and can present a proof of concept with benchmarks so that we see the benefit of this GPU prefetching working on a real example, then I would be more than happy to see and test it myself!

awaelchli on 24 Nov 2020

👍1

It might also be worth looking in to DALI
https://developer.nvidia.com/DALI

awaelchli on 24 Nov 2020

👍1

@awaelchli
Will look into it and try to remove bottlenecks
Thanks !

kartik4949 on 24 Nov 2020

Was this page helpful?

5 / 5 - 1 ratings