I have a few notebooks on Colab Pro which use TPU and worked perfectly a day ago, but now everything crashes
ResourceExhaustedError: 9 root error(s) found.
(0) Resource exhausted: {{function_node __inference_train_function_453721}} Compilation failure: Ran out of memory in memory space hbm. Used 16.79G of 7.48G hbm. Exceeded hbm capacity by 9.31G.
Is there any changelog where I can see what did change in Colab, or this something to do with the TPU infrastructure?
I can make it work by reducing the batch size but it has to be reduced like twice making models train at least 2x slower.
I've noticed that TensorFlow started to show strange warnings:
WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/tensorflow/python/data/ops/multi_device_iterator_ops.py:601: get_next_as_optional (from tensorflow.python.data.ops.iterator_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use `tf.data.Iterator.get_next_as_optional()` instead.
WARNING:tensorflow:Callbacks method `on_train_batch_end` is slow compared to the batch time (batch time: 0.0340s vs `on_train_batch_end` time: 0.4056s). Check your callbacks.
Is it possible to create a self-contained repro notebook?
I noticed they updated to tensorflow 2.3.0 yesterday and I also have been having issues with notebooks that ran perfectly on TPUs the day before while still on tensorflow 2.2.0. and that still run perfectly in Kaggle.
I noticed also the new warnings, including the:
WARNING:tensorflow:Callbacks methodon_train_batch_endis slow compared to the batch time (batch time: 0.0115s vson_train_batch_endtime: 0.2912s). Check your callbacks
popping up after the first few batches. Which I was able to reproduce just by running this colab tutorial notebook; It appears during the 1st epoch.
I have been having issues with my own notebooks "freezing" (i.e. not beginning to train, no OOM error either) with images larger than 380 (which trained perfectly before the tf 2.3.0 update). However, I have not been able to reproduce this with the flowers tutorial notebook above; it runs smoothly even with the size 512 flower dataset.
I'll keep trying to create a self-contained repro with the flowers notebook. I'll get back here if I am able to, or if I find something new.
@craigcitro: It's an on-going Kaggle competition, I can share a notebook privately (chat or mail [email protected])
Please fix it as soon as possible. I am on the on-going Kaggle competition as well, the deadline is very close. Really need the colab to run experiments! Thanks!
I am having the same problem, really need a fix soon.
If you have repro steps, I'd encourage you to raise an issue directly in the TF repo: we're just bundling TF 2.3.
In particular, is this the same issue? https://github.com/tensorflow/tensorflow/issues/42043
I think I was able to reproduce the issue in this repro notebook using the flowers dataset.
@craigcitro I believe it is related to tensorflow/tensorflow#42043 so I'm going to take a closer look there too. I have that same issue with pretty much all models that do not OOM like the one I'm sharing here.
I am having the same problem. Seems the tpu now only has 7.48G hbm.
For anyone looking to help get this fixed: making comments on the upstream issue is most helpful.
For anyone looking to use TF 2.2 with a TPU for now, this should get you unblocked:
!pip install tensorflow~=2.2.0 tensorflow_gcs_config~=2.2.0
import tensorflow as tf
import requests
import os
resp = requests.post("http://{}:8475/requestversion/{}".format(os.environ["COLAB_TPU_ADDR"].split(":")[0], tf.__version__))
if resp.status_code != 200:
print("Failed to switch the TPU to TF {}".format(version))
For anyone looking to help get this fixed: making comments on the upstream issue is most helpful.
For anyone looking to use TF 2.2 with a TPU for now, this should get you unblocked:
!pip install tensorflow~=2.2.0 tensorflow_gcs_config~=2.2.0 import tensorflow as tf import requests import os resp = requests.post("http://{}:8475/requestversion/{}".format(os.environ["COLAB_TPU_ADDR"].split(":")[0], tf.__version__)) if resp.status_code != 200: print("Failed to switch the TPU to TF {}".format(version))
switch to TF 2.2 works.
It works, thanks a lot!
Thanks, @craigcitro!! This will work for now before it's fixed in 2.3+
It crashes again starting yesterday, and the workaround does not work anymore - @craigcitro, help!!! 馃槶
Here is a gist with the TF2.2 workaround applied using the notebook posted by @reyvaz:
https://colab.research.google.com/gist/gena/ccb3e4f43dd9980b7e275c5a2b145075/tf_tpu_issue_submission.ipynb
The workaround stop working.... plz help....
I am facing the same issue (see reyvaz's notebook above). Please let us use TF 2.2 until this TPU bug is fixed in TF 2.3.
@graf10a, just to clarify, the @reyvaz's notebook tries to use TF2.3 (current default) and fails, the notebook in my last comment tries to use the workaround posted by @craigcitro, and currently fails well. We need at least some working Colab, the current version is basically can't be used for models which worked a few days ago.
@gena Sorry, I just ran the notebook at the link you provided and got the same error. This is what I meant. Actually, I made some changes in this notebook and managed to make it work:
Also, I ran one of my own old notebooks with the TF 2.2 workaround and it seems to be working just fine (EfficientNet B5 with 640x640 images). So, I am not sure what is the issue here (but I might be missing something).
@graf10a, thanks for letting know this, interesting, I'm actually able to train B0 with this workaround, but without jumping into larger resolution and batch sizes. Maybe my old notebooks are somehow stuck with a TPU not switching context ... but it returns 200, strange.
@gena Did you notice that I made some changes in the TPU initialization part of the notebook? I don't think this part was working properly in your version of the notebook. Also, I have changed the import of EfficentNet model. Maybe this was what made the difference?
Yes, this is also how I initialized it.
I guess we should then stay with small batch sizes and I'm also using the same EFN, but somehow getting OOM on 256x256 batch size=16, strange, and sometimes it runs. I guess I need to dig more for some TPU inspection tools (if there are any) to see what is going on.
Update: found an error in my code multiplying by replicas two times, ooh! Sorry for the false alarm.
@gena Great! I am glad it is working for you now!
Duplicate of https://github.com/tensorflow/tensorflow/issues/42043
Please do follow up on the TensorFlow issue. We'll need to pursue a fix for this in TensorFlow rather than Colab.
Most helpful comment
For anyone looking to help get this fixed: making comments on the upstream issue is most helpful.
For anyone looking to use TF 2.2 with a TPU for now, this should get you unblocked: