Pytorch-lightning: TPU error: RAM full, page stopped responding and slower than GPU on google colab

Created on 7 Apr 2020 · 16Comments · Source: PyTorchLightning/pytorch-lightning

🐛 Bug

To Reproduce

Steps to reproduce the behavior:

Open lightning_mnist_tpu.ipynb
Run the code

Expected behavior

The code runs normally and faster than GPU.

Error

The webpage stopped responding soon after running the trainer, on several devices such as PC, phone and puffin browser, with Ram reaching 100% on PC. (both GPU and TPU)
Iteration speed for TPU calculations is ~30 it/s while iteration speed for GPU is >90 it/s.

Additional context

Running the demo notebook Lightning-demo.ipynb on TPU solved the first error but the iteration speed is still slower for TPU, with perpare_data added.

TPU bug / fix help wanted information needed

Source

OliverCWY

👀2

All 16 comments

Hi! thanks for your contribution!, great first issue!

github-actions[bot] on 7 Apr 2020

@OliverCWY may you share a link to the notebook?

Borda on 8 Apr 2020

@OliverCWY may you share a link to the notebook?

Sorry I did not make it clear that I was using the official TPU demo notebook https://colab.research.google.com/drive/1-_LKx4HwAxl5M6xPJmqAAu444LTDQoa3

OliverCWY on 8 Apr 2020

well for me it is falling on TQDM error:

Exception in device=TPU:0: 'tqdm_notebook' object has no attribute 'leave'
  File "/usr/local/lib/python3.6/dist-packages/tqdm/notebook.py", line 247, in close
    if self.leave:
AttributeError: 'tqdm_notebook' object has no attribute 'leave'
AttributeError: 'tqdm_notebook' object has no attribute 'leave'
Traceback (most recent call last):
  File "/usr/local/lib/python3.6/dist-packages/torch_xla/distributed/xla_multiprocessing.py", line 119, in _start_fn
    fn(gindex, *args)
  File "/usr/local/lib/python3.6/dist-packages/pytorch_lightning/trainer/distrib_parts.py", line 505, in tpu_train
    self.run_pretrain_routine(model)
  File "/usr/local/lib/python3.6/dist-packages/pytorch_lightning/trainer/trainer.py", line 850, in run_pretrain_routine
    self.val_progress_bar.close()
  File "/usr/local/lib/python3.6/dist-packages/tqdm/notebook.py", line 247, in close
    if self.leave:
AttributeError: 'tqdm_notebook' object has no attribute 'leave'
Exception in device=TPU:1: 'tqdm_notebook' object has no attribute 'leave'
Exception in device=TPU:7: 'tqdm_notebook' object has no attribute 'leave'

@OliverCWY may you share your error?

Borda on 8 Apr 2020

@Borda There are no error messages for me, despite for a lot of warnings. The webpage stopped responding even when I set warnings.filter("ignore").
One possible reason is that the tqdm progress bar reloads for every update without freeing the memory, but the problem only exists in the TPU demo notebook. When I copy the codes into the demo notebook (https://colab.research.google.com/drive/1IqfISTenqy50Fq8DafCmm8KfUf9JssJF), everything is fine except for the iteration speed.

OliverCWY on 8 Apr 2020

well for me it is falling on TQDM error:

Exception in device=TPU:0: 'tqdm_notebook' object has no attribute 'leave'
  File "/usr/local/lib/python3.6/dist-packages/tqdm/notebook.py", line 247, in close
    if self.leave:
AttributeError: 'tqdm_notebook' object has no attribute 'leave'
AttributeError: 'tqdm_notebook' object has no attribute 'leave'
Traceback (most recent call last):
  File "/usr/local/lib/python3.6/dist-packages/torch_xla/distributed/xla_multiprocessing.py", line 119, in _start_fn
    fn(gindex, *args)
  File "/usr/local/lib/python3.6/dist-packages/pytorch_lightning/trainer/distrib_parts.py", line 505, in tpu_train
    self.run_pretrain_routine(model)
  File "/usr/local/lib/python3.6/dist-packages/pytorch_lightning/trainer/trainer.py", line 850, in run_pretrain_routine
    self.val_progress_bar.close()
  File "/usr/local/lib/python3.6/dist-packages/tqdm/notebook.py", line 247, in close
    if self.leave:
AttributeError: 'tqdm_notebook' object has no attribute 'leave'
Exception in device=TPU:1: 'tqdm_notebook' object has no attribute 'leave'
Exception in device=TPU:7: 'tqdm_notebook' object has no attribute 'leave'

@OliverCWY may you share your error?

If you restart the runtime, tqdm error should go away.

utsavnandi on 8 Apr 2020

👍1

I tried to profile a efficientnet_es model on cifar10. It is taking ~500+ seconds for forward propagation but only ~47 seconds for back prop. Also, it is taking over 15 mins to run 1 epoch which doesn't seem right. This was over 6 epochs.

utsavnandi on 9 Apr 2020

I used lightning on colab for other models and they all had this problem.

OliverCWY on 9 Apr 2020

I used lightning on colab for other models and they all had this problem.

For GPU as well

OliverCWY on 9 Apr 2020

colabs are slow with a low refresh frequency.
set the tqdm freq refresh to 10 or something more than 1

williamFalcon on 9 Apr 2020

can you share the colabs?
we have speed benchmarks in CI, and lightning is a few seconds slower than pure pytorch because of the loggers and tqdm bar but not slower by much (ie: if you added tensorboard to your code it would be as slow).

this is likely because you’re not putting something on GPU or something like that

williamFalcon on 9 Apr 2020

Just tested on colab... it works fine
https://colab.research.google.com/drive/1-_LKx4HwAxl5M6xPJmqAAu444LTDQoa3#scrollTo=kr8cql-aaKnC

williamFalcon on 9 Apr 2020

Just tested on colab... it works fine
https://colab.research.google.com/drive/1-_LKx4HwAxl5M6xPJmqAAu444LTDQoa3#scrollTo=kr8cql-aaKnC

The memory used by the iframes in google colab reaches 600+MB after the 29th epoch and continues to increase so probably setting the refresh requency does not actually address the problem.

And I am actually referring to the speed using different devices with pytorch-lightning. Running on TPU is significantly slower than running on GPU or even CPU.

Using a single layer of nn.Linear:
TPU:
TPU_0
CPU:
CPU_0
GPU(P100):
P100_0
With more layers:
TPU:
TPU_1
CPU:
CPU_1
GPU(P100):
P100_1

OliverCWY on 10 Apr 2020

speed fixed on 0.7.3
The RAM issue is a colab issue not a PL issue. Crash the ram using cell 1 or upgrade to PRO

williamFalcon on 10 Apr 2020

👍1

speed fixed on 0.7.3
The RAM issue is a colab issue not a PL issue. Crash the ram using cell 1 or upgrade to PRO

@williamFalcon Sorry but I don't think the problem is solved.
Just tested on colab:

version
CPU_0
TPU_0

I am referring to the memory used by my browser running colab when saying "RAM full", so it is not the problem of the backend.

Thank you for you patience.

OliverCWY on 11 Apr 2020

I am afraid this does not work for me either, hence I also don't think that the problem is solved.

I have tried all the versions, given in the notebook.

Additionally, I have also tried it with the version 20200516. That version is given in the official colab TPU MNIST example notebook which does not use pytorch-lightening, ie 20200516. A reference is below in NB2.

The summary of the results are:

"1.5" : wont run at all
"20200325" hangs in the final epoch (with 10 epochs in the 10th, with 3 epochs in the 3rd)
"nightly" crashes with : Exception: process 0 terminated with signal SIGABRT

"20200516" hangs after one epoch

I have tried this several times over the last few days. With the exception of the nightly all these results have always been the same.

NB1:
Locally I am on a Mac, not sure whether this makes a difference.

My terminal gives this

uname -a
Darwin osx-lhind6957 18.7.0 Darwin Kernel Version 18.7.0: Mon Apr 27 20:09:39 PDT 2020; root:xnu-4903.278.35~1/RELEASE_X86_64 x86_64

NB2:
The links for that official colab TPU MNIST example notebook which does not use pytorch lightning are here:
https://cloud.google.com/tpu/docs/colabs?hl=de

https://colab.research.google.com/github/pytorch/xla/blob/master/contrib/colab/mnist-training.ipynb?authuser=1#scrollTo=sPJVqAKyml5W

(The official notebook which does not use pytorch lightning has no problem and runs through with 20200516)