I am running GPU training, but it is not much faster. My batch_size is 1024. I see that the model's datasets are on the CPU. Also, so are the model parameters. Why?
How can I preload them all onto the GPU for faster training? They will fit in GPU memory.
I see 18 it/sec on CPU and 30 it/sec on GPU
for i, (x, y, d) in enumerate(model.train_dataloader()):
print(x.device)
print(d.device)
break
cpu
cpu
for p in model.parameters():
# p.requires_grad: bool
# p.data: Tensor
print(p.device)
cpu
...
Applying .to(device) on the dataset tensors.
Unfortunately, it is not so obvious to me from the pytorch lighning docs how to debug whether something is on the GPU or not.
can you put a colab showing it does NOT use GPUs?
we run tests on GPUs on every PR, all our examples use GPUs and we all work on GPUs daily...
it's very likely something about your code or maybe not installing the GPU pytorch.
If you can replicate this on colab we can look at this.
In the meantime, please follow the guide to clean up your code (https://pytorch-lightning.readthedocs.io/en/stable/multi_gpu.html#preparing-your-code)
So I have tried this on Colab, with and without GPUs. I have also tried it on an AWS GPU instance, with and without CPUs. Both have similar results, only 2x increase and all showing the model params and data on the CPU
@williamFalcon here is a minimal example.
https://colab.research.google.com/drive/1RA1Ny2wyVzzPIyvWdjUm41Pm4SUR31_C?usp=sharing
You can see that both CPU and GPU get the same speed. And the model parameters and dataset are on the CPU.
I read the instructions here https://pytorch-lightning.readthedocs.io/en/stable/multi_gpu.html#preparing-your-code
and have done everything except for "Init tensors using type_as". I didn't see that in the single GPU doc, and couldn't figure out how to do it or find any more documentation about it. If that is what is necessary I am happy to do it.
Besides this specific example, one thing I am missing from the PL docs is just some simple instructions on how to verify if GPU training is actually happening as desired. Is my diagnostic code:
for i, (x, y, d) in enumerate(model_gpu.datatrain):
print(x.device)
print(y.device)
print(d.device)
break
for p in model.parameters():
print(p.device)
the correct way to do it? Or something else?
What is probably happening (couldn't access your minimal example, permission issues) is your example is too simple and bottlenecked by I/O.
GPU really speeds ups things when it doesn't have to wait for HDD reads.
PL automatically uses the GPU if you give gpus=1 to the Trainer. One way to verify is to either: run nvidia-smi in a terminal, which is harder in Google Colab, or print .device (as you showed) and it should be "cuda".
There are also ways to monitor GPU usage with the nvidia-ml-py3 library but i think this is a bit overkill.
@dscarmo strange that you cannot see my colab. Anyway, here is that notebook exported as a gist:
https://gist.github.com/turian/f3ea99f6f495f4e6bd45c8a89d23552c
If you see, I do both CPU and GPU training. GPU training is working, as evidenced by GPU available: True, used: True. But the speed is the same, and it looks like the model params and data are on the CPU device. So if they are being transferred each batch to the GPU, yeah that would be slow.
What I would expect to happen is that there would be a loading time at the beginning onto the GPU and then training would be very fast.
@dscarmo
Yeah, so I am also trying to run it on an AWS GPU instance. It doesn't seem to utilize the GPU very much.
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 440.33.01 Driver Version: 440.33.01 CUDA Version: 10.2 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Tesla K80 On | 00000000:00:1E.0 Off | 0 |
| N/A 38C P0 72W / 149W | 332MiB / 11441MiB | 3% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 7005 C python3 321MiB |
+-----------------------------------------------------------------------------+
But it is used it somewhat.
The fact that if I run
for i, (x, y, d) in enumerate(model_gpu.datatrain):
print(x.device)
print(y.device)
print(d.device)
break
for p in model_gpu.parameters():
print(p.device)
prints
cpu
cpu
cpu
...
suggests to me that the model parameters and the data are not on the GPU, and that it keeps transferring during training.
Interesting, if I run the .device code after trainer_gpu.fit(model_gpu)
I get:
cpu
cpu
cpu
cuda:0
cuda:0
cuda:0
cuda:0
which suggests that the parameters are now in the GPU, but the datasets are still on the CPU
i鈥檓 not sure what the goal here is haha. if it鈥檚 to verify that it runs on GPUs then use nvidia-smi. if it鈥檚 to measure speed up vs CPU then use a large dataset.
for a small dataset you won鈥檛 see a difference
Calm down, there are some confusions going on here. Try to edit your comments instead of posting new ones. I will try to write up something, but you should study how Deep Learning training works, look-up exactly what a batch and epoch are, and why the DataLoader is necessary.
The DataLoader is a class that interfaces with the Dataset, which implements reading some kind of data from "the CPU" (hard drive) in an optimized way. This relegates data processing (such as transforms) to be done in the CPU, while the GPU trains on the previous batch.
Its possible but not common to have everything inside the GPU, your dataset would have to be pretty small for that to work.
Besides, you don't want to do ALL operations in the GPU, like data manipulation before the batch forward. Thats why dataloaders are implemented the way they are. If you do it properly, yout GPU will be pegged with 100% usage while at the same time dataloader workers fetch the next batch in the background. So yes, depending on WHERE you are checking .device, it might be in the CPU.
Your data and model should, however, be cuda:0 in the training step for example.
In your gist example, you are checking the parameters before the trainer has started. They are still initialized in the cpu. PL will move the model and dataloader output to the GPU in the start of training. Thats why the model is on cuda:0 after training.
The small gain on speed is probably because your model is very small (1k parameters). GPUs are optimized for parallel work over large batchs of data and/or large models. At this small size, the speed is probably bottlenecked by other operations such as I/O.
@dscarmo Thank you for being patient and helping me. I apologize for not editing my comments, I'll do it in the future. I am familiar with deep learning; if you Google me you will see I am a co-author on the original Theano paper :) What I am not familiar with is pytorch lightning and more recent tooling. But I must say that you are shipping a really beautiful product and I wish we had these tools ten years ago.
So a few details:
forward() method I ask where the datasets are, it says CPU.So the issue I am trying to resolve is why my data is not in cuda:0 in the training stage.
If your datasets fits entirely on the GPU, use batch_size = dataset_size and the dataloader should make a big batch containing the whole dataset for the model to forward over, which means an epoch will be done in only one batch.
As i said, the datasets going from CPU to GPU is by DataLoader design, to use CPU power between batches. Increase num_workers on the dataloader to have parallel loading.
The data being not on cuda on the step is weird. I just tested as a sanity check and for sure the batch gets to the step as a cuda tensor.
Okay, so my understanding of the behavior of pytorch lightning now (I don't think this is documented) is that each batch will be loaded onto the GPU from the CPU, and then there will be a training step within the GPU.
If your datasets fits entirely on the GPU, use batch_size = dataset_size and the dataloader should make a big batch containing the whole dataset for the model to forward over, which means an epoch will be done in only one batch.
It seems that still, each epoch it will reload the data onto the GPU. I have tested this and there is a pause in between each epoch, which I suspect is the CPU to GPU data load. Then the epoch itself happens instantaneously.
The data being not on cuda on the step is weird. I just tested as a sanity check and for sure the batch gets to the step as a cuda tensor.
Given my understanding, what I meant is that the original dataset is not on the GPU during the step. If I look at the data, they are on the GPU during the step.
My question remains: How can I avoid copying from CPU to GPU every epoch? If my dataset fits in memory.
I see what you want now. In this case, try to save your whole dataset's data as a GPU Tensor inside the dataset object when you create it (dataset's __init__). Return it already as GPU tensor from the get go and you will never move it.
Looking at your example, do:
self.dist = torch.Tensor(dist).float().cuda()
self.X1 = torch.Tensor(X1).float().cuda()
self.X2 = torch.Tensor(X2).float().cuda()
Note that from there on any operation done over these tensors has to be done with cuda tensors.
@dscarmo thank you I understand.
I have made a simplified gist: https://gist.github.com/turian/caf869ae30932384c7c4b7d201125493
One is the siamese network, one is a binary regressor.
In both, I time the training without .cuda() and with .cuda()
siamese is 20.3s without, 18.1s with.
binary is 15.6s without, 9.7s with.
The binary data is 10x the size, so it makes sense that binary gains more speedup through loading in advance.
I will perhaps submit a pull request to improve the documentation?
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
Most helpful comment
@dscarmo Thank you for being patient and helping me. I apologize for not editing my comments, I'll do it in the future. I am familiar with deep learning; if you Google me you will see I am a co-author on the original Theano paper :) What I am not familiar with is pytorch lightning and more recent tooling. But I must say that you are shipping a really beautiful product and I wish we had these tools ten years ago.
So a few details:
forward()method I ask where the datasets are, it says CPU.So the issue I am trying to resolve is why my data is not in cuda:0 in the training stage.