I'm trying to initialize a TPU distribution strategy and I have the following error:
tensorflow.python.framework.errors_impl.InvalidArgumentError: NodeDef expected inputs 'string' do not match 0 inputs specified; Op<name=_Send; signature=tensor:T -> ; attr=T:type; attr=tensor_name:string;
attr=send_device:string; attr=send_device_incarnation:int; attr=recv_device:string; attr=client_terminated:bool,default=false; is_stateful=true>; NodeDef: {{node _Send}}
System information
YesDebian 10binary (pip3)2.2.0-dev20200501Python 3.7.3none (GCP TPU)none (GCP TPU)Code to reproduce the issue
tpu_strategy.py
# -*- coding: utf-8 -*-
import os
from official.utils.misc import distribution_utils
tpu_name=os.getenv('TPU_NAME')
strategy = distribution_utils.get_distribution_strategy(
distribution_strategy="tpu",
tpu_address=tpu_name)
strategy_scope = distribution_utils.get_strategy_scope(strategy)
How to run this code
Follow the guide to run a TPU vm: https://cloud.google.com/tpu/docs/quickstart)
Then when you have a shell session on it, execute the following commands instead of running the MNIST example:
$ pip3 install tf-models-nightly
$ TPU_NAME=tpu-quickstart python3 tpu_strategy.py
Other info / logs
Complete log + stacktrace
2020-05-02 06:30:09.091314: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library 'libcudart.so.10.1'; dlerror: libcudart.so.10.1: cannot open shared object file: N
o such file or directory
2020-05-02 06:30:09.091362: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.
Get strategy tpu
on TPU ichimia
2020-05-02 06:30:10.718866: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library 'libcuda.so.1'; dlerror: libcuda.so.1: cannot open shared object file: No such fil
e or directory
2020-05-02 06:30:10.718920: E tensorflow/stream_executor/cuda/cuda_driver.cc:313] failed call to cuInit: UNKNOWN ERROR (303)
2020-05-02 06:30:10.718950: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:156] kernel driver does not appear to be running on this host (ichimia): /proc/driver/nvidia/version does not exist
2020-05-02 06:30:10.886804: I tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with Intel(R) MKL-DNN to use the following CPU instructions in performance-critical oper
ations: AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2020-05-02 06:30:10.894306: I tensorflow/core/platform/profile_utils/cpu_utils.cc:102] CPU Frequency: 2300000000 Hz
2020-05-02 06:30:10.894601: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x5a05920 initialized for platform Host (this does not guarantee that XLA will be used). Devices:
2020-05-02 06:30:10.894631: I tensorflow/compiler/xla/service/service.cc:176] StreamExecutor device (0): Host, Default Version
2020-05-02 06:30:10.903492: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:301] Initialize GrpcChannelCache for job worker -> {0 -> 10.240.1.2:8470}
2020-05-02 06:30:10.903537: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:301] Initialize GrpcChannelCache for job localhost -> {0 -> localhost:40985}
2020-05-02 06:30:10.919176: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:301] Initialize GrpcChannelCache for job worker -> {0 -> 10.240.1.2:8470}
2020-05-02 06:30:10.919227: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:301] Initialize GrpcChannelCache for job localhost -> {0 -> localhost:40985}
2020-05-02 06:30:10.919829: I tensorflow/core/distributed_runtime/rpc/grpc_server_lib.cc:390] Started server with target: grpc://localhost:40985
Traceback (most recent call last):
File "src/tpu.py", line 15, in <module>
tpu_address=tpu_name)
File "/home/amaret93/.local/lib/python3.7/site-packages/official/utils/misc/distribution_utils.py", line 129, in get_distribution_strategy
cluster_resolver = tpu_lib.tpu_initialize(tpu_address)
File "/home/amaret93/.local/lib/python3.7/site-packages/official/utils/misc/tpu_lib.py", line 33, in tpu_initialize
tf.tpu.experimental.initialize_tpu_system(cluster_resolver)
File "/home/amaret93/.local/lib/python3.7/site-packages/tensorflow/python/tpu/tpu_strategy_util.py", line 103, in initialize_tpu_system
serialized_topology = output.numpy()
File "/home/amaret93/.local/lib/python3.7/site-packages/tensorflow/python/framework/ops.py", line 1110, in numpy
maybe_arr = self._numpy() # pylint: disable=protected-access
File "/home/amaret93/.local/lib/python3.7/site-packages/tensorflow/python/framework/ops.py", line 1078, in _numpy
six.raise_from(core._status_to_exception(e.code, e.message), None)
File "<string>", line 3, in raise_from
tensorflow.python.framework.errors_impl.InvalidArgumentError: NodeDef expected inputs 'string' do not match 0 inputs specified; Op<name=_Send; signature=tensor:T -> ; attr=T:type; attr=tensor_name:string;
attr=send_device:string; attr=send_device_incarnation:int; attr=recv_device:string; attr=client_terminated:bool,default=false; is_stateful=true>; NodeDef: {{node _Send}}
2020-05-02 06:30:21.913994: W tensorflow/core/distributed_runtime/eager/remote_tensor_handle_data.cc:76] Unable to destroy remote tensor handles. If you are running a tf.function, it usually indicates some
op in the graph gets an error: NodeDef expected inputs 'string' do not match 0 inputs specified; Op<name=_Send; signature=tensor:T -> ; attr=T:type; attr=tensor_name:string; attr=send_device:string; attr=
send_device_incarnation:int; attr=recv_device:string; attr=client_terminated:bool,default=false; is_stateful=true>; NodeDef: {{node _Send}}
FYI, I manage to get ride of the error by using Tensorflow 2.1 and the following code (inspired from this repo):
# -*- coding: utf-8 -*-
import os
import tensorflow as tf
tpu_name=os.getenv('TPU_NAME')
cluster_resolver = tf.distribute.cluster_resolver.TPUClusterResolver(
tpu=tpu_name)
tf.tpu.experimental.initialize_tpu_system(cluster_resolver)
strategy = tf.distribute.experimental.TPUStrategy(cluster_resolver)
It seems this is now resolved. I am closing it.
Please re-file the issue if you encounter additional problems.
The issue is still there in tf-nightly 2.3.0-dev20200625 .
same issue tf-nightly==2.4.0.dev20200708 and tf-nightly==2.4.0.dev20200709
I'm Facing the same issue in tf-2.3 stable.
try:
tpu = tf.distribute.cluster_resolver.TPUClusterResolver()
print('Running on TPU ', tpu.master())
except ValueError:
tpu = None
if tpu:
tf.config.experimental_connect_to_cluster(tpu)
tf.tpu.experimental.initialize_tpu_system(tpu)
strategy = tf.distribute.TPUStrategy(tpu)
else:
strategy = tf.distribute.get_strategy() # default distribution strategy in Tensorflow. Works on CPU and single GPU.
print("REPLICAS: ", strategy.num_replicas_in_sync)
The error:
InvalidArgumentError: NodeDef expected inputs 'string' do not match 0 inputs specified; Op<name=_Send; signature=tensor:T -> ; attr=T:type; attr=tensor_name:string; attr=send_device:string; attr=send_device_incarnation:int; attr=recv_device:string; attr=client_terminated:bool,default=false; is_stateful=true>; NodeDef: {{node _Send}}
Could you please reopen the issue, Thanks.
same issue tf-nightly==2.4.0-dev20200730
Most helpful comment
I'm Facing the same issue in tf-2.3 stable.
The error:
Could you please reopen the issue, Thanks.