Colabtools: ResourceExhaustedError when using TPU

Created on 4 Aug 2020 · 22Comments · Source: googlecolab/colabtools

I have a few notebooks on Colab Pro which use TPU and worked perfectly a day ago, but now everything crashes

ResourceExhaustedError: 9 root error(s) found.
  (0) Resource exhausted: {{function_node __inference_train_function_453721}} Compilation failure: Ran out of memory in memory space hbm. Used 16.79G of 7.48G hbm. Exceeded hbm capacity by 9.31G.

Is there any changelog where I can see what did change in Colab, or this something to do with the TPU infrastructure?

I can make it work by reducing the batch size but it has to be reduced like twice making models train at least 2x slower.

I've noticed that TensorFlow started to show strange warnings:

WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/tensorflow/python/data/ops/multi_device_iterator_ops.py:601: get_next_as_optional (from tensorflow.python.data.ops.iterator_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use `tf.data.Iterator.get_next_as_optional()` instead.

WARNING:tensorflow:Callbacks method `on_train_batch_end` is slow compared to the batch time (batch time: 0.0340s vs `on_train_batch_end` time: 0.4056s). Check your callbacks.

Source

gena

👍6

Most helpful comment

For anyone looking to help get this fixed: making comments on the upstream issue is most helpful.

For anyone looking to use TF 2.2 with a TPU for now, this should get you unblocked:

!pip install tensorflow~=2.2.0 tensorflow_gcs_config~=2.2.0
import tensorflow as tf
import requests
import os
resp = requests.post("http://{}:8475/requestversion/{}".format(os.environ["COLAB_TPU_ADDR"].split(":")[0], tf.__version__))
if resp.status_code != 200:
  print("Failed to switch the TPU to TF {}".format(version))

craigcitro on 5 Aug 2020

👍6

All 22 comments

Is it possible to create a self-contained repro notebook?

craigcitro on 4 Aug 2020

I noticed they updated to tensorflow 2.3.0 yesterday and I also have been having issues with notebooks that ran perfectly on TPUs the day before while still on tensorflow 2.2.0. and that still run perfectly in Kaggle.

I noticed also the new warnings, including the:

WARNING:tensorflow:Callbacks methodon_train_batch_endis slow compared to the batch time (batch time: 0.0115s vson_train_batch_endtime: 0.2912s). Check your callbacks

popping up after the first few batches. Which I was able to reproduce just by running this colab tutorial notebook; It appears during the 1st epoch.

I have been having issues with my own notebooks "freezing" (i.e. not beginning to train, no OOM error either) with images larger than 380 (which trained perfectly before the tf 2.3.0 update). However, I have not been able to reproduce this with the flowers tutorial notebook above; it runs smoothly even with the size 512 flower dataset.

I'll keep trying to create a self-contained repro with the flowers notebook. I'll get back here if I am able to, or if I find something new.

reyvaz on 4 Aug 2020

@craigcitro: It's an on-going Kaggle competition, I can share a notebook privately (chat or mail [email protected])

gena on 4 Aug 2020

Please fix it as soon as possible. I am on the on-going Kaggle competition as well, the deadline is very close. Really need the colab to run experiments! Thanks!

linyuanthocr on 5 Aug 2020

I am having the same problem, really need a fix soon.

nguyenphuhien13 on 5 Aug 2020

If you have repro steps, I'd encourage you to raise an issue directly in the TF repo: we're just bundling TF 2.3.

In particular, is this the same issue? https://github.com/tensorflow/tensorflow/issues/42043

craigcitro on 5 Aug 2020

I think I was able to reproduce the issue in this repro notebook using the flowers dataset.

@craigcitro I believe it is related to tensorflow/tensorflow#42043 so I'm going to take a closer look there too. I have that same issue with pretty much all models that do not OOM like the one I'm sharing here.

reyvaz on 5 Aug 2020

I am having the same problem. Seems the tpu now only has 7.48G hbm.

PuckWong on 5 Aug 2020

For anyone looking to help get this fixed: making comments on the upstream issue is most helpful.

For anyone looking to use TF 2.2 with a TPU for now, this should get you unblocked:

!pip install tensorflow~=2.2.0 tensorflow_gcs_config~=2.2.0
import tensorflow as tf
import requests
import os
resp = requests.post("http://{}:8475/requestversion/{}".format(os.environ["COLAB_TPU_ADDR"].split(":")[0], tf.__version__))
if resp.status_code != 200:
  print("Failed to switch the TPU to TF {}".format(version))

craigcitro on 5 Aug 2020

👍6

For anyone looking to help get this fixed: making comments on the upstream issue is most helpful.

For anyone looking to use TF 2.2 with a TPU for now, this should get you unblocked:
!pip install tensorflow~=2.2.0 tensorflow_gcs_config~=2.2.0
import tensorflow as tf
import requests
import os
resp = requests.post("http://{}:8475/requestversion/{}".format(os.environ["COLAB_TPU_ADDR"].split(":")[0], tf.__version__))
if resp.status_code != 200:
  print("Failed to switch the TPU to TF {}".format(version))

switch to TF 2.2 works.

PuckWong on 5 Aug 2020

It works, thanks a lot!

linyuanthocr on 5 Aug 2020

Thanks, @craigcitro!! This will work for now before it's fixed in 2.3+

gena on 5 Aug 2020

It crashes again starting yesterday, and the workaround does not work anymore - @craigcitro, help!!! 😭

Here is a gist with the TF2.2 workaround applied using the notebook posted by @reyvaz:
https://colab.research.google.com/gist/gena/ccb3e4f43dd9980b7e275c5a2b145075/tf_tpu_issue_submission.ipynb

gena on 8 Aug 2020

The workaround stop working.... plz help....

chucksylar on 8 Aug 2020

I am facing the same issue (see reyvaz's notebook above). Please let us use TF 2.2 until this TPU bug is fixed in TF 2.3.

graf10a on 8 Aug 2020

@graf10a, just to clarify, the @reyvaz's notebook tries to use TF2.3 (current default) and fails, the notebook in my last comment tries to use the workaround posted by @craigcitro, and currently fails well. We need at least some working Colab, the current version is basically can't be used for models which worked a few days ago.

gena on 8 Aug 2020

@gena Sorry, I just ran the notebook at the link you provided and got the same error. This is what I meant. Actually, I made some changes in this notebook and managed to make it work:

https://colab.research.google.com/gist/graf10a/642c143fd80887ac8a69819250f5141b/tf_tpu_issue_submission_fixed.ipynb

Also, I ran one of my own old notebooks with the TF 2.2 workaround and it seems to be working just fine (EfficientNet B5 with 640x640 images). So, I am not sure what is the issue here (but I might be missing something).

graf10a on 8 Aug 2020

👍1

@graf10a, thanks for letting know this, interesting, I'm actually able to train B0 with this workaround, but without jumping into larger resolution and batch sizes. Maybe my old notebooks are somehow stuck with a TPU not switching context ... but it returns 200, strange.

gena on 8 Aug 2020

@gena Did you notice that I made some changes in the TPU initialization part of the notebook? I don't think this part was working properly in your version of the notebook. Also, I have changed the import of EfficentNet model. Maybe this was what made the difference?

graf10a on 8 Aug 2020

Yes, this is also how I initialized it.

I guess we should then stay with small batch sizes and I'm also using the same EFN, but somehow getting OOM on 256x256 batch size=16, strange, and sometimes it runs. I guess I need to dig more for some TPU inspection tools (if there are any) to see what is going on.

Update: found an error in my code multiplying by replicas two times, ooh! Sorry for the false alarm.

gena on 8 Aug 2020

@gena Great! I am glad it is working for you now!

graf10a on 9 Aug 2020

👍1

Duplicate of https://github.com/tensorflow/tensorflow/issues/42043
Please do follow up on the TensorFlow issue. We'll need to pursue a fix for this in TensorFlow rather than Colab.