Pytorch-lightning: Implement Asynchronous GPU transfer and Training with Multithreading

Created on 7 Apr 2020 · 10Comments · Source: PyTorchLightning/pytorch-lightning

🚀 Feature

Asynchronous GPU transfer can be achieved by utilizing pinned memory with multithreading
Minimal example code
https://github.com/HenryJia/Lighter/blob/master/lighter/train/loaders.py

Motivation

Parallelrising GPU transfer and training will cut down time GPU is stuck waiting for data from CPU
https://devblogs.nvidia.com/how-overlap-data-transfers-cuda-cc/

Pitch

Everyone likes faster training and maximal GPU utilisation

Alternatives

Not Applicable

Additional context

None

enhancement help wanted won't fix

Source

HenryJia

👍1

Most helpful comment

If you set non_blocking=True and pin_memory=True it allows asynchronous GPU transfer but will block if you need it immediately as it just means that .to() will return immediately before it's done. That's why you still need queue and Threading

HenryJia on 8 Apr 2020

👍2

All 10 comments

Hi! thanks for your contribution!, great first issue!

github-actions[bot] on 7 Apr 2020

@PyTorchLightning/core-contributors do we want to bring this in?

williamFalcon on 7 Apr 2020

Is it the same as using the arg non_blocking=True in the .to(device) method ? It is used in Pytorch's ImageNet example : https://github.com/pytorch/examples/tree/master/imagenet

E.g.:

for batch_idx, (inputs, targets) in enumerate(train_loader):
   inputs, targets = inputs.to(device, non_blocking=True), targets.to(device, non_blocking=True)

Edit: it should be this particular line to change
https://github.com/PyTorchLightning/pytorch-lightning/blob/fdb61cb854f6e624c4a0670f125a8e3ebaaf1571/pytorch_lightning/trainer/distrib_parts.py#L439

From Pytorch's forum, it seems indeed to refer to the same functionality (https://discuss.pytorch.org/t/should-we-set-non-blocking-to-true/38234/4?u=sebastienwood)

sebastienwood on 7 Apr 2020

HenryJia on 8 Apr 2020

👍2

I'm not sure we want this tbh. This may introduce some race conditions. And if GPU is busy, outsourcing to another cuda stream will also not speed things up (and usually it's the case, that cpu waits for gpu. Also I remember the advice from PyTorch guys to avoid using cuda streams manually when it's not essential.

Can you maybe provide a small benchmark script so that we can see the potential benefit?

A sidenote from reading your code: does a threading Queue also involve pickling? If it does, you might want to switch to torch.muktiprocessings queue since it bypasses that via shared memory. On the other hand, I really want to avoid extra queues as they usually slow things down.

justusschock on 8 Apr 2020

this is on a similar note as DALI right?
https://towardsdatascience.com/nvidia-dali-speeding-up-pytorch-876c80182440
we had the initiative to add it already #791 #789 #513 #1316

Borda on 8 Apr 2020

@justusschock
No it doesn't introduce any race conditions I think you misunderstood what the code does.
With PyTorch's bog standard dataloader you always have to wait on the CPU as you end up with
.to() -> train(...) -> .to() -> train(...) ->...

This happens even if you use non_blocking=True as you need the data immediately so it will block anyway

My code simply sticks the .to() in a separate thread and stream so that the GPU isn't waiting for data
There's 2 threads now, one which just does
to() -> queue.put() -> to() -> queue.put() ->...
and the main training thread which does
queue.get() -> train -> queue.get() -> train

The only possible case which the main thread would be bottlenecked by the loading thread is if the batches are so large we're bottlenecked by the normal dataloader or the PCIE bus anyway. The queue put and get takes no time since only a handle referencing the batch (which now already lives on the GPU) is stored in the queue.

Threading queue does not involve pickling. It's simply a threadsafe version of deque. It uses deque to actually store everything. We're using threads so there's no need to pickle and pass objects around. They all live in the same shared memory space anyway on the CPU

On small datasets (size wise not sample wise) like MNIST, it's slightly faster. My Asynchronous loader takes 151 seconds for 100 epochs of training and validation, vs PyTorch's bog standard dataloader which takes 158 seconds. Code: https://gist.github.com/HenryJia/930916775c11bc5c6debb87c046965e5

On larger datasets there's more of an effect. This code which generates 256MB batches of random numbers to load and multiply takes 17.8s on my machine to complete with the AsynchronousLoader and 22.1s to complete without it: https://gist.github.com/HenryJia/17e3a647cc2da1dd0ceeb6365bdfeaac

HenryJia on 9 Apr 2020

I am aware of DALI, and I believe it does similar stuff and has a bit more functionality, although I personally think it's a little bit bulky. I wrote mine to be as minimalistic as I could and work exactly like a PyTorch dataloader in terms of interfacing

HenryJia on 9 Apr 2020

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.