Bug report for Colab: http://colab.research.google.com/.
When using a Colab TPU session and trying to initialize a TPU Strategy with the following code:
import tensorflow as tf
try:
tpu = tf.distribute.cluster_resolver.TPUClusterResolver() # TPU detection
print('Running on TPU ', tpu.cluster_spec().as_dict()['worker'])
except ValueError:
tpu = None
if tpu:
tf.config.experimental_connect_to_cluster(tpu)
tf.tpu.experimental.initialize_tpu_system(tpu)
strategy = tf.distribute.experimental.TPUStrategy(tpu)
else:
strategy = tf.distribute.get_strategy()
print("REPLICAS: ", strategy.num_replicas_in_sync)
There will be the following error:
Running on TPU ['10.109.245.114:8470']
INFO:tensorflow:Initializing the TPU system: 10.109.245.114:8470
INFO:tensorflow:Initializing the TPU system: 10.109.245.114:8470
INFO:tensorflow:Clearing out eager caches
INFO:tensorflow:Clearing out eager caches
---------------------------------------------------------------------------
NotFoundError Traceback (most recent call last)
<ipython-input-3-5c79288551ed> in <module>()
7 if tpu:
8 tf.config.experimental_connect_to_cluster(tpu)
----> 9 tf.tpu.experimental.initialize_tpu_system(tpu)
10 strategy = tf.distribute.experimental.TPUStrategy(tpu)
11 else:
3 frames
/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/tpu/tpu_strategy_util.py in initialize_tpu_system(cluster_resolver)
101 context.context()._clear_caches() # pylint: disable=protected-access
102
--> 103 serialized_topology = output.numpy()
104
105 # TODO(b/134094971): Remove this when lazy tensor copy in multi-device
/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/ops.py in numpy(self)
940 """
941 # TODO(slebedev): Consider avoiding a copy for non-CPU or remote tensors.
--> 942 maybe_arr = self._numpy() # pylint: disable=protected-access
943 return maybe_arr.copy() if isinstance(maybe_arr, np.ndarray) else maybe_arr
944
/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/ops.py in _numpy(self)
908 return self._numpy_internal()
909 except core._NotOkStatusException as e:
--> 910 six.raise_from(core._status_to_exception(e.code, e.message), None)
911
912 @property
/usr/local/lib/python3.6/dist-packages/six.py in raise_from(value, from_value)
NotFoundError: '__inference__tpu_init_fn_4' is neither a type of a primitive operation nor a name of a function registered in binary running on n-2221c432-w-0. Make sure the operation or function is registered in the binary running in this process.
Running on TPU ['10.26.51.18:8470']
INFO:tensorflow:Initializing the TPU system: 10.26.51.18:8470
INFO:tensorflow:Finished initializing TPU system.
INFO:tensorflow:Querying Tensorflow master (grpc://10.26.51.18:8470) for TPU system metadata.
INFO:tensorflow:Found TPU system:
INFO:tensorflow:*** Num TPU Cores: 8
INFO:tensorflow:*** Num TPU Workers: 1
INFO:tensorflow:*** Num TPU Cores Per Worker: 8
INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:CPU:0, CPU, -1, 5688683537495184073)
INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:0, TPU, 17179869184, 3749670591192472159)
INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:1, TPU, 17179869184, 12726202377899630824)
INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:2, TPU, 17179869184, 4112510768860420127)
INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:3, TPU, 17179869184, 4273195466617788134)
INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:4, TPU, 17179869184, 18003206366557860002)
INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:5, TPU, 17179869184, 11510611825613067855)
INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:6, TPU, 17179869184, 11320511437524126117)
INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:7, TPU, 17179869184, 10244199656502490705)
INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU_SYSTEM:0, TPU_SYSTEM, 8589934592, 14173748399582017948)
INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:XLA_CPU:0, XLA_CPU, 17179869184, 11903147722232858508)
REPLICAS: 8
The web browser you are using (Chrome, Firefox, Safari, etc.):
This is not related to the browser. However: Firefox 72.0.1 (64-bit) on Ubuntu
Link to self-contained notebook that reproduces this issue
(click the Share button, then Get Shareable Link):
Here
There is also an issue about this in the tensorflow repo.
Colab has tensorflow 2.1.0 pre-installed. In order to use it, run the magic %tensorflow_version 2.x. When you run this (before importing tensorflow), we not only select 2.x but also do some additional work to configure the cloud TPU to use 2.x. The error you're seeing results from the TPU using 1.15 and your runtime using 2.1.0.
I believe I have found a way to fix this issue. See https://github.com/huan/tensorflow-handbook-tpu/issues/1#issuecomment-606189444
Most helpful comment
Colab has tensorflow 2.1.0 pre-installed. In order to use it, run the magic
%tensorflow_version 2.x. When you run this (before importing tensorflow), we not only select 2.x but also do some additional work to configure the cloud TPU to use 2.x. The error you're seeing results from the TPU using 1.15 and your runtime using 2.1.0.