Pytorch-lightning: Implement Asynchronous GPU transfer and Training with Multithreading

Created on 7 Apr 2020  路  10Comments  路  Source: PyTorchLightning/pytorch-lightning

馃殌 Feature

Asynchronous GPU transfer can be achieved by utilizing pinned memory with multithreading
Minimal example code
https://github.com/HenryJia/Lighter/blob/master/lighter/train/loaders.py

Motivation

Parallelrising GPU transfer and training will cut down time GPU is stuck waiting for data from CPU
https://devblogs.nvidia.com/how-overlap-data-transfers-cuda-cc/

Pitch

Everyone likes faster training and maximal GPU utilisation

Alternatives

Not Applicable

Additional context

None

enhancement help wanted won't fix

Most helpful comment

If you set non_blocking=True and pin_memory=True it allows asynchronous GPU transfer but will block if you need it immediately as it just means that .to() will return immediately before it's done. That's why you still need queue and Threading

All 10 comments

Hi! thanks for your contribution!, great first issue!

@PyTorchLightning/core-contributors do we want to bring this in?

Is it the same as using the arg non_blocking=True in the .to(device) method ? It is used in Pytorch's ImageNet example : https://github.com/pytorch/examples/tree/master/imagenet

E.g.:

for batch_idx, (inputs, targets) in enumerate(train_loader):
   inputs, targets = inputs.to(device, non_blocking=True), targets.to(device, non_blocking=True)

Edit: it should be this particular line to change
https://github.com/PyTorchLightning/pytorch-lightning/blob/fdb61cb854f6e624c4a0670f125a8e3ebaaf1571/pytorch_lightning/trainer/distrib_parts.py#L439

From Pytorch's forum, it seems indeed to refer to the same functionality (https://discuss.pytorch.org/t/should-we-set-non-blocking-to-true/38234/4?u=sebastienwood)

If you set non_blocking=True and pin_memory=True it allows asynchronous GPU transfer but will block if you need it immediately as it just means that .to() will return immediately before it's done. That's why you still need queue and Threading

I'm not sure we want this tbh. This may introduce some race conditions. And if GPU is busy, outsourcing to another cuda stream will also not speed things up (and usually it's the case, that cpu waits for gpu. Also I remember the advice from PyTorch guys to avoid using cuda streams manually when it's not essential.

Can you maybe provide a small benchmark script so that we can see the potential benefit?

A sidenote from reading your code: does a threading Queue also involve pickling? If it does, you might want to switch to torch.muktiprocessings queue since it bypasses that via shared memory. On the other hand, I really want to avoid extra queues as they usually slow things down.

this is on a similar note as DALI right?
https://towardsdatascience.com/nvidia-dali-speeding-up-pytorch-876c80182440
we had the initiative to add it already #791 #789 #513 #1316

@justusschock
No it doesn't introduce any race conditions I think you misunderstood what the code does.
With PyTorch's bog standard dataloader you always have to wait on the CPU as you end up with
.to() -> train(...) -> .to() -> train(...) ->...

This happens even if you use non_blocking=True as you need the data immediately so it will block anyway

My code simply sticks the .to() in a separate thread and stream so that the GPU isn't waiting for data
There's 2 threads now, one which just does
to() -> queue.put() -> to() -> queue.put() ->...
and the main training thread which does
queue.get() -> train -> queue.get() -> train

The only possible case which the main thread would be bottlenecked by the loading thread is if the batches are so large we're bottlenecked by the normal dataloader or the PCIE bus anyway. The queue put and get takes no time since only a handle referencing the batch (which now already lives on the GPU) is stored in the queue.

Threading queue does not involve pickling. It's simply a threadsafe version of deque. It uses deque to actually store everything. We're using threads so there's no need to pickle and pass objects around. They all live in the same shared memory space anyway on the CPU

On small datasets (size wise not sample wise) like MNIST, it's slightly faster. My Asynchronous loader takes 151 seconds for 100 epochs of training and validation, vs PyTorch's bog standard dataloader which takes 158 seconds. Code: https://gist.github.com/HenryJia/930916775c11bc5c6debb87c046965e5

On larger datasets there's more of an effect. This code which generates 256MB batches of random numbers to load and multiply takes 17.8s on my machine to complete with the AsynchronousLoader and 22.1s to complete without it: https://gist.github.com/HenryJia/17e3a647cc2da1dd0ceeb6365bdfeaac

I am aware of DALI, and I believe it does similar stuff and has a bit more functionality, although I personally think it's a little bit bulky. I wrote mine to be as minimalistic as I could and work exactly like a PyTorch dataloader in terms of interfacing

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

Closing this issue because I've already added to bolts. Sorry I forgot about it for a bit

Was this page helpful?
0 / 5 - 0 ratings

Related issues

DavidRuhe picture DavidRuhe  路  3Comments

anthonytec2 picture anthonytec2  路  3Comments

remisphere picture remisphere  路  3Comments

jcreinhold picture jcreinhold  路  3Comments

Vichoko picture Vichoko  路  3Comments