Pytorch-lightning: GPU training, but datasets are on the CPU

Created on 25 Jun 2020 · 17Comments · Source: PyTorchLightning/pytorch-lightning

What is your question?

I am running GPU training, but it is not much faster. My batch_size is 1024. I see that the model's datasets are on the CPU. Also, so are the model parameters. Why?

How can I preload them all onto the GPU for faster training? They will fit in GPU memory.

I see 18 it/sec on CPU and 30 it/sec on GPU

Code

for i, (x, y, d) in enumerate(model.train_dataloader()):
    print(x.device)
    print(d.device)
    break

cpu
cpu

for p in model.parameters():
    # p.requires_grad: bool
    # p.data: Tensor
    print(p.device)

cpu
...

What have you tried?

Applying .to(device) on the dataset tensors.

Unfortunately, it is not so obvious to me from the pytorch lighning docs how to debug whether something is on the GPU or not.

What's your environment?

OS: Ubuntu
Packaging pip3
Version 0.8.1

question won't fix

Source

turian

Most helpful comment

@dscarmo Thank you for being patient and helping me. I apologize for not editing my comments, I'll do it in the future. I am familiar with deep learning; if you Google me you will see I am a co-author on the original Theano paper :) What I am not familiar with is pytorch lightning and more recent tooling. But I must say that you are shipping a really beautiful product and I wish we had these tools ten years ago.

So a few details:

My data set is quite small, and should be able to fit entirely in the GPU.
Even if in the net's forward() method I ask where the datasets are, it says CPU.
I can make my model larger, of course, but that doesn't solve my problem that the datasets are constantly being shuttled between the CPU and the GPU :(

So the issue I am trying to resolve is why my data is not in cuda:0 in the training stage.

turian on 26 Jun 2020

❤3 🚀1

All 17 comments

can you put a colab showing it does NOT use GPUs?
we run tests on GPUs on every PR, all our examples use GPUs and we all work on GPUs daily...

it's very likely something about your code or maybe not installing the GPU pytorch.

If you can replicate this on colab we can look at this.
In the meantime, please follow the guide to clean up your code (https://pytorch-lightning.readthedocs.io/en/stable/multi_gpu.html#preparing-your-code)

williamFalcon on 25 Jun 2020

So I have tried this on Colab, with and without GPUs. I have also tried it on an AWS GPU instance, with and without CPUs. Both have similar results, only 2x increase and all showing the model params and data on the CPU

turian on 25 Jun 2020

@williamFalcon here is a minimal example.

https://colab.research.google.com/drive/1RA1Ny2wyVzzPIyvWdjUm41Pm4SUR31_C?usp=sharing

You can see that both CPU and GPU get the same speed. And the model parameters and dataset are on the CPU.

turian on 25 Jun 2020

I read the instructions here https://pytorch-lightning.readthedocs.io/en/stable/multi_gpu.html#preparing-your-code
and have done everything except for "Init tensors using type_as". I didn't see that in the single GPU doc, and couldn't figure out how to do it or find any more documentation about it. If that is what is necessary I am happy to do it.

turian on 25 Jun 2020

Besides this specific example, one thing I am missing from the PL docs is just some simple instructions on how to verify if GPU training is actually happening as desired. Is my diagnostic code:

for i, (x, y, d) in enumerate(model_gpu.datatrain):
    print(x.device)
    print(y.device)
    print(d.device)
    break
for p in model.parameters():
    print(p.device)

the correct way to do it? Or something else?

turian on 25 Jun 2020

What is probably happening (couldn't access your minimal example, permission issues) is your example is too simple and bottlenecked by I/O.

GPU really speeds ups things when it doesn't have to wait for HDD reads.

PL automatically uses the GPU if you give gpus=1 to the Trainer. One way to verify is to either: run nvidia-smi in a terminal, which is harder in Google Colab, or print .device (as you showed) and it should be "cuda".

There are also ways to monitor GPU usage with the nvidia-ml-py3 library but i think this is a bit overkill.

dscarmo on 25 Jun 2020

@dscarmo strange that you cannot see my colab. Anyway, here is that notebook exported as a gist:

https://gist.github.com/turian/f3ea99f6f495f4e6bd45c8a89d23552c

If you see, I do both CPU and GPU training. GPU training is working, as evidenced by GPU available: True, used: True. But the speed is the same, and it looks like the model params and data are on the CPU device. So if they are being transferred each batch to the GPU, yeah that would be slow.

What I would expect to happen is that there would be a loading time at the beginning onto the GPU and then training would be very fast.

turian on 26 Jun 2020

@dscarmo

Yeah, so I am also trying to run it on an AWS GPU instance. It doesn't seem to utilize the GPU very much.

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 440.33.01    Driver Version: 440.33.01    CUDA Version: 10.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla K80           On   | 00000000:00:1E.0 Off |                    0 |
| N/A   38C    P0    72W / 149W |    332MiB / 11441MiB |      3%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|    0      7005      C   python3                                      321MiB |
+-----------------------------------------------------------------------------+

But it is used it somewhat.

The fact that if I run

for i, (x, y, d) in enumerate(model_gpu.datatrain):
    print(x.device)
    print(y.device)
    print(d.device)
    break
for p in model_gpu.parameters():
    print(p.device)

prints

cpu
cpu
cpu
...

suggests to me that the model parameters and the data are not on the GPU, and that it keeps transferring during training.

turian on 26 Jun 2020

Interesting, if I run the .device code after trainer_gpu.fit(model_gpu)

I get:

cpu
cpu
cpu
cuda:0
cuda:0
cuda:0
cuda:0

which suggests that the parameters are now in the GPU, but the datasets are still on the CPU

turian on 26 Jun 2020

i’m not sure what the goal here is haha. if it’s to verify that it runs on GPUs then use nvidia-smi. if it’s to measure speed up vs CPU then use a large dataset.

for a small dataset you won’t see a difference

williamFalcon on 26 Jun 2020

Calm down, there are some confusions going on here. Try to edit your comments instead of posting new ones. I will try to write up something, but you should study how Deep Learning training works, look-up exactly what a batch and epoch are, and why the DataLoader is necessary.

The DataLoader is a class that interfaces with the Dataset, which implements reading some kind of data from "the CPU" (hard drive) in an optimized way. This relegates data processing (such as transforms) to be done in the CPU, while the GPU trains on the previous batch.

Its possible but not common to have everything inside the GPU, your dataset would have to be pretty small for that to work.

Besides, you don't want to do ALL operations in the GPU, like data manipulation before the batch forward. Thats why dataloaders are implemented the way they are. If you do it properly, yout GPU will be pegged with 100% usage while at the same time dataloader workers fetch the next batch in the background. So yes, depending on WHERE you are checking .device, it might be in the CPU.

Your data and model should, however, be cuda:0 in the training step for example.
In your gist example, you are checking the parameters before the trainer has started. They are still initialized in the cpu. PL will move the model and dataloader output to the GPU in the start of training. Thats why the model is on cuda:0 after training.

The small gain on speed is probably because your model is very small (1k parameters). GPUs are optimized for parallel work over large batchs of data and/or large models. At this small size, the speed is probably bottlenecked by other operations such as I/O.

dscarmo on 26 Jun 2020

So a few details:

My data set is quite small, and should be able to fit entirely in the GPU.
Even if in the net's forward() method I ask where the datasets are, it says CPU.
I can make my model larger, of course, but that doesn't solve my problem that the datasets are constantly being shuttled between the CPU and the GPU :(

So the issue I am trying to resolve is why my data is not in cuda:0 in the training stage.

turian on 26 Jun 2020

❤3 🚀1

If your datasets fits entirely on the GPU, use batch_size = dataset_size and the dataloader should make a big batch containing the whole dataset for the model to forward over, which means an epoch will be done in only one batch.

As i said, the datasets going from CPU to GPU is by DataLoader design, to use CPU power between batches. Increase num_workers on the dataloader to have parallel loading.

The data being not on cuda on the step is weird. I just tested as a sanity check and for sure the batch gets to the step as a cuda tensor.

dscarmo on 26 Jun 2020

Okay, so my understanding of the behavior of pytorch lightning now (I don't think this is documented) is that each batch will be loaded onto the GPU from the CPU, and then there will be a training step within the GPU.

It seems that still, each epoch it will reload the data onto the GPU. I have tested this and there is a pause in between each epoch, which I suspect is the CPU to GPU data load. Then the epoch itself happens instantaneously.

The data being not on cuda on the step is weird. I just tested as a sanity check and for sure the batch gets to the step as a cuda tensor.

Given my understanding, what I meant is that the original dataset is not on the GPU during the step. If I look at the data, they are on the GPU during the step.

My question remains: How can I avoid copying from CPU to GPU every epoch? If my dataset fits in memory.

turian on 26 Jun 2020

I see what you want now. In this case, try to save your whole dataset's data as a GPU Tensor inside the dataset object when you create it (dataset's __init__). Return it already as GPU tensor from the get go and you will never move it.

Looking at your example, do:

self.dist = torch.Tensor(dist).float().cuda()
self.X1 = torch.Tensor(X1).float().cuda()
self.X2 = torch.Tensor(X2).float().cuda()

Note that from there on any operation done over these tensors has to be done with cuda tensors.

dscarmo on 26 Jun 2020

@dscarmo thank you I understand.

I have made a simplified gist: https://gist.github.com/turian/caf869ae30932384c7c4b7d201125493
One is the siamese network, one is a binary regressor.

In both, I time the training without .cuda() and with .cuda()

siamese is 20.3s without, 18.1s with.
binary is 15.6s without, 9.7s with.

The binary data is 10x the size, so it makes sense that binary gains more speedup through loading in advance.

I will perhaps submit a pull request to improve the documentation?

turian on 26 Jun 2020

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.