Tensorboard: Rendezvous of RPC terminated due overflow encountered in long_scalars

Created on 4 Apr 2018 · 13Comments · Source: tensorflow/tensorboard

TensorFlow/TensorBoard version: 1.7.0
OS Platform and version: Windows 10
Python version: 3.6

I use this snippet of code to set and increment the variable count_step:

        self.count_step = tf.Variable(0, trainable=False, dtype=tf.int64, name='count_step')

self.count_step_increase = tf.assign(self.count_step, self.count_step + 1)

def increase_count_step(self):
    return self.sess.run(self.count_step_increase)

work properly when I run my entire code without calling tf_debug.
if I switch to sess = tf_debug.TensorBoardDebugWrapperSession(sess, 'localhost: 6064')

when the function increase_count_step is called , first I get this:

C:\Users\user\Anaconda3\envs\env\lib\site-packages\tensorflow\python\ops\nn_ops.py:2161: RuntimeWarning: overflow encountered in long_scalars
(output_count * filter_in_depth * filter_height * filter_width * 2))

that is strange cause I have some convolutional layers in my model but no one is inovolved when calling
increase_count_step...

btw after this gRPC wrapper crashes and I get this error:

File "C:\Users\user\Anaconda3\envs\env\lib\site-packages\tensorflow\python\debug\wrappers\grpc_wrapper.py", line 225, in run
self._sent_graph_version)
File "C:\Users\user\Anaconda3\envs\env\lib\site-packages\tensorflow\python\debug\wrappers\grpc_wrapper.py", line 61, in publish_traceback
send_source=True)
File "C:\Users\user\Anaconda3\envs\env\lib\site-packages\tensorflow\python\debug\lib\source_remote.py", line 192, in send_graph_tracebacks
graph=graph, send_source=send_source)
File "C:\Users\user\Anaconda3\envs\env\lib\site-packages\tensorflow\python\debug\lib\source_remote.py", line 166, in _send_call_tracebacks
stub.SendTracebacks(call_traceback)
File "C:\Users\user\Anaconda3\envs\env\lib\site-packages\grpc_channel.py", line 487, in __call__
return _end_unary_response_blocking(state, call, False, deadline)
File "C:\Users\user\Anaconda3\envs\env\lib\site-packages\grpc_channel.py", line 437, in _end_unary_response_blocking
raise _Rendezvous(state, None, None, deadline)
grpc._channel._Rendezvous: <_Rendezvous of RPC that terminated with (StatusCode.RESOURCE_EXHAUSTED, Received message larger than max (10116890 vs. 4194304))>

debugger

Source

JCMiles

Most helpful comment

RPCs have a max payload size of 4194304 bytes (by default). Your program is raising an exception when debugger logic (within TensorBoardDebugWrapperSession) tries to send the graph of the model to TensorBoard. Perhaps the graph is large.

For now, would it be OK for you to prevent the debugger from sending the graph (and python tracebacks) to TensorBoard? Via setting end_traceback_and_source_code to False:

sess = tf_debug.TensorBoardDebugWrapperSession(
    sess=sess,
    grpc_debug_server_addresses=['localhost:6064'],
    send_traceback_and_source_code=False)

You would not be able to use the debugger plugin's source code view, which maps tensor values to lines of python code. However, the graph will still show in TensorBoard (because it's provided elsewhere aside from via RPC).

cc @caisq

chihuahua on 5 Apr 2018

👍9

All 13 comments

For now, would it be OK for you to prevent the debugger from sending the graph (and python tracebacks) to TensorBoard? Via setting end_traceback_and_source_code to False:

sess = tf_debug.TensorBoardDebugWrapperSession(
    sess=sess,
    grpc_debug_server_addresses=['localhost:6064'],
    send_traceback_and_source_code=False)

cc @caisq

chihuahua on 5 Apr 2018

👍9

I am facing the same issue when trying to run debugger with Keras. As suggested I have tried send_traceback_and_source_code=False flag.

# debugging
K.set_session(tf_debug.TensorBoardDebugWrapperSession(K.get_session(),
                                                      'localhost:6064',
                                                      send_traceback_and_source_code=False))

autoencoder.fit_generator(generator=get_train_generator(),
                epochs=100,
                steps_per_epoch=798, # batch size 128
                validation_data=get_val_generator(),
                validation_steps=171,
                workers = 0,
                use_multiprocessing=True,
                callbacks=[TensorBoard(log_dir='./iris_ae/1')])

Edit: Turns out I was calling K.set_session repeatedly without restarting jupyter kernal. Restarting the kernal fixed it.

AgrawalAmey on 15 Apr 2018

I am also having the same issue with mnist debug example code.

Please run the following command to reproduce:

python -m tensorflow.python.debug.examples.debug_mnist --tensorboard_debug_address localhost:6064

agupta74 on 27 Apr 2018

It looks like I was able to resolve the issue after removing the proxy settings

agupta74 on 27 Apr 2018

please can you be more specific?, I'm still not able to run the tensorboard
debugger

https://www.avast.com/sig-email?utm_medium=email&utm_source=link&utm_campaign=sig-email&utm_content=webmail
Mail
priva di virus. www.avast.com
https://www.avast.com/sig-email?utm_medium=email&utm_source=link&utm_campaign=sig-email&utm_content=webmail
<#DAB4FAD8-2DD7-40BB-A1B8-4E2AA1F9FDF2>

2018-04-27 19:37 GMT+02:00 agupta74 notifications@github.com:

It looks like I was able to resolve the issue after removing the proxy
settings

—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
https://github.com/tensorflow/tensorboard/issues/1103#issuecomment-385041794,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AgaGUgjXwXGVFsyl8TtN1Kb11SFodPPZks5ts1dTgaJpZM4TGvgk
.

JCMiles on 16 May 2018

@caisq Would it be possible to turn off gRPC size limits, only when transmitting tf.GraphDef?

jart on 17 May 2018

@jart As discussed in #1202, the issue here is not GraphDef, which is already divided into small chunks prior to sending through gRPC. The issue here is source code and Python tracebacks, which are populated in a single protocol buffer and sent through gRPC. This fails when the size of the protocol buffer is larger than 4 MB, due to, for example, large source files or many of them. The current workaround is using send_traceback_and_source_code=False as mentioned above.

I'll work on a fix to this issue.

caisq on 22 May 2018

This issue should have been fixed by https://github.com/tensorflow/tensorflow/commit/d2090672fe8305289156460c43f7fcc1a5dd5422#diff-3a99c734be4389d1419cb133c18e5c29

and will become available in tensorflow release 1.9.

caisq on 8 Jun 2018

Still getting this error with 1.9.0-rc1 as well as the current nightly...

enricoschroeder on 27 Jun 2018

@schroederen Just to help me diagnose your issue better, what happens if you set send_traceback_and_source_code to False as suggested in one of the posts above?

caisq on 27 Jun 2018

@caisq This fixes the problem. However, I was under the impression that the latest bug fixes are supposed to make it work with send_traceback_and_source_code=True as well?

enricoschroeder on 3 Jul 2018

The issue is still there in the TF 1.11

  File ".local/lib/python3.6/site-packages/grpc/_channel.py", line 466, in _end_unary_response_blocking
    raise _Rendezvous(state, None, None, deadline)
grpc._channel._Rendezvous: <_Rendezvous of RPC that terminated with:
    status = StatusCode.RESOURCE_EXHAUSTED
    details = "Received message larger than max (5415050 vs. 4194304)"
    debug_error_string = "{"created":"@1547184417.371441869","description":"Error received from peer","file":"src/core/lib/surface/call.cc","file_line":1099,"grpc_message":"Received message larger than max (5415050 vs. 4194304)","grpc_status":8}"
>