Models: TPU distribution strategy fail: NodeDef expected inputs 'string' do not match 0 inputs specified

Created on 2 May 2020 · 6Comments · Source: tensorflow/models

I'm trying to initialize a TPU distribution strategy and I have the following error:

tensorflow.python.framework.errors_impl.InvalidArgumentError: NodeDef expected inputs 'string' do not match 0 inputs specified; Op<name=_Send; signature=tensor:T -> ; attr=T:type; attr=tensor_name:string; 
attr=send_device:string; attr=send_device_incarnation:int; attr=recv_device:string; attr=client_terminated:bool,default=false; is_stateful=true>; NodeDef: {{node _Send}}

System information

Have I written custom code (as opposed to using a stock example script provided in TensorFlow): Yes
OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Debian 10
TensorFlow installed from (source or binary): binary (pip3)
TensorFlow version (use command below): 2.2.0-dev20200501
Python version: Python 3.7.3
CUDA/cuDNN version: none (GCP TPU)
GPU model and memory: none (GCP TPU)

Code to reproduce the issue

tpu_strategy.py

# -*- coding: utf-8 -*-
import os

from official.utils.misc import distribution_utils

tpu_name=os.getenv('TPU_NAME')

strategy = distribution_utils.get_distribution_strategy(
    distribution_strategy="tpu",
    tpu_address=tpu_name)

strategy_scope = distribution_utils.get_strategy_scope(strategy)

How to run this code

Follow the guide to run a TPU vm: https://cloud.google.com/tpu/docs/quickstart)

Then when you have a shell session on it, execute the following commands instead of running the MNIST example:

$ pip3 install tf-models-nightly
$ TPU_NAME=tpu-quickstart python3 tpu_strategy.py

Other info / logs

Complete log + stacktrace

2020-05-02 06:30:09.091314: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library 'libcudart.so.10.1'; dlerror: libcudart.so.10.1: cannot open shared object file: N
o such file or directory
2020-05-02 06:30:09.091362: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.
Get strategy tpu
on TPU ichimia
2020-05-02 06:30:10.718866: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library 'libcuda.so.1'; dlerror: libcuda.so.1: cannot open shared object file: No such fil
e or directory
2020-05-02 06:30:10.718920: E tensorflow/stream_executor/cuda/cuda_driver.cc:313] failed call to cuInit: UNKNOWN ERROR (303)
2020-05-02 06:30:10.718950: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:156] kernel driver does not appear to be running on this host (ichimia): /proc/driver/nvidia/version does not exist
2020-05-02 06:30:10.886804: I tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with Intel(R) MKL-DNN to use the following CPU instructions in performance-critical oper
ations:  AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2020-05-02 06:30:10.894306: I tensorflow/core/platform/profile_utils/cpu_utils.cc:102] CPU Frequency: 2300000000 Hz
2020-05-02 06:30:10.894601: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x5a05920 initialized for platform Host (this does not guarantee that XLA will be used). Devices:
2020-05-02 06:30:10.894631: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (0): Host, Default Version
2020-05-02 06:30:10.903492: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:301] Initialize GrpcChannelCache for job worker -> {0 -> 10.240.1.2:8470}
2020-05-02 06:30:10.903537: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:301] Initialize GrpcChannelCache for job localhost -> {0 -> localhost:40985}
2020-05-02 06:30:10.919176: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:301] Initialize GrpcChannelCache for job worker -> {0 -> 10.240.1.2:8470}
2020-05-02 06:30:10.919227: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:301] Initialize GrpcChannelCache for job localhost -> {0 -> localhost:40985}
2020-05-02 06:30:10.919829: I tensorflow/core/distributed_runtime/rpc/grpc_server_lib.cc:390] Started server with target: grpc://localhost:40985
Traceback (most recent call last):
  File "src/tpu.py", line 15, in <module>
    tpu_address=tpu_name)
  File "/home/amaret93/.local/lib/python3.7/site-packages/official/utils/misc/distribution_utils.py", line 129, in get_distribution_strategy
    cluster_resolver = tpu_lib.tpu_initialize(tpu_address)
  File "/home/amaret93/.local/lib/python3.7/site-packages/official/utils/misc/tpu_lib.py", line 33, in tpu_initialize
    tf.tpu.experimental.initialize_tpu_system(cluster_resolver)
  File "/home/amaret93/.local/lib/python3.7/site-packages/tensorflow/python/tpu/tpu_strategy_util.py", line 103, in initialize_tpu_system
    serialized_topology = output.numpy()
  File "/home/amaret93/.local/lib/python3.7/site-packages/tensorflow/python/framework/ops.py", line 1110, in numpy
    maybe_arr = self._numpy()  # pylint: disable=protected-access
  File "/home/amaret93/.local/lib/python3.7/site-packages/tensorflow/python/framework/ops.py", line 1078, in _numpy
    six.raise_from(core._status_to_exception(e.code, e.message), None)
  File "<string>", line 3, in raise_from
tensorflow.python.framework.errors_impl.InvalidArgumentError: NodeDef expected inputs 'string' do not match 0 inputs specified; Op<name=_Send; signature=tensor:T -> ; attr=T:type; attr=tensor_name:string; 
attr=send_device:string; attr=send_device_incarnation:int; attr=recv_device:string; attr=client_terminated:bool,default=false; is_stateful=true>; NodeDef: {{node _Send}}
2020-05-02 06:30:21.913994: W tensorflow/core/distributed_runtime/eager/remote_tensor_handle_data.cc:76] Unable to destroy remote tensor handles. If you are running a tf.function, it usually indicates some
 op in the graph gets an error: NodeDef expected inputs 'string' do not match 0 inputs specified; Op<name=_Send; signature=tensor:T -> ; attr=T:type; attr=tensor_name:string; attr=send_device:string; attr=
send_device_incarnation:int; attr=recv_device:string; attr=client_terminated:bool,default=false; is_stateful=true>; NodeDef: {{node _Send}}

official bug

Source

Aschen

👍3

Most helpful comment

I'm Facing the same issue in tf-2.3 stable.

try:
    tpu = tf.distribute.cluster_resolver.TPUClusterResolver()  
    print('Running on TPU ', tpu.master())
except ValueError:
    tpu = None

if tpu:
    tf.config.experimental_connect_to_cluster(tpu)
    tf.tpu.experimental.initialize_tpu_system(tpu)
    strategy = tf.distribute.TPUStrategy(tpu)
else:
    strategy = tf.distribute.get_strategy() # default distribution strategy in Tensorflow. Works on CPU and single GPU.

print("REPLICAS: ", strategy.num_replicas_in_sync)

The error:

InvalidArgumentError: NodeDef expected inputs 'string' do not match 0 inputs specified; Op<name=_Send; signature=tensor:T -> ; attr=T:type; attr=tensor_name:string; attr=send_device:string; attr=send_device_incarnation:int; attr=recv_device:string; attr=client_terminated:bool,default=false; is_stateful=true>; NodeDef: {{node _Send}}

Could you please reopen the issue, Thanks.

bisakhmondal on 29 Jul 2020

👍2

All 6 comments

FYI, I manage to get ride of the error by using Tensorflow 2.1 and the following code (inspired from this repo):

# -*- coding: utf-8 -*-
import os

import tensorflow as tf

tpu_name=os.getenv('TPU_NAME')

cluster_resolver = tf.distribute.cluster_resolver.TPUClusterResolver(
    tpu=tpu_name)

tf.tpu.experimental.initialize_tpu_system(cluster_resolver)

strategy = tf.distribute.experimental.TPUStrategy(cluster_resolver)

Aschen on 2 May 2020

It seems this is now resolved. I am closing it.

Please re-file the issue if you encounter additional problems.

pengchongjin on 8 May 2020

The issue is still there in tf-nightly 2.3.0-dev20200625 .

legacyai on 25 Jun 2020

same issue tf-nightly==2.4.0.dev20200708 and tf-nightly==2.4.0.dev20200709

lai-agent-t on 10 Jul 2020

👍1

I'm Facing the same issue in tf-2.3 stable.

try:
    tpu = tf.distribute.cluster_resolver.TPUClusterResolver()  
    print('Running on TPU ', tpu.master())
except ValueError:
    tpu = None

if tpu:
    tf.config.experimental_connect_to_cluster(tpu)
    tf.tpu.experimental.initialize_tpu_system(tpu)
    strategy = tf.distribute.TPUStrategy(tpu)
else:
    strategy = tf.distribute.get_strategy() # default distribution strategy in Tensorflow. Works on CPU and single GPU.

print("REPLICAS: ", strategy.num_replicas_in_sync)

The error:

InvalidArgumentError: NodeDef expected inputs 'string' do not match 0 inputs specified; Op<name=_Send; signature=tensor:T -> ; attr=T:type; attr=tensor_name:string; attr=send_device:string; attr=send_device_incarnation:int; attr=recv_device:string; attr=client_terminated:bool,default=false; is_stateful=true>; NodeDef: {{node _Send}}

Could you please reopen the issue, Thanks.

bisakhmondal on 29 Jul 2020

👍2

same issue tf-nightly==2.4.0-dev20200730