I can train an Object Detection model just fine locally, but when I try to run the training on CloudML, it runs for a little bit (during the last run it ran for about 340 steps) and then terminates because of the following error:
UnavailableError: Endpoint read failed
The full stack trace is pasted at the end of this post.
Full stack trace:
severity: "ERROR"
textPayload: "The replica worker 0 exited with a non-zero status of 1. Termination reason: Error.
Traceback (most recent call last):
File "/usr/lib/python2.7/runpy.py", line 174, in _run_module_as_main
"__main__", fname, loader, pkg_name)
File "/usr/lib/python2.7/runpy.py", line 72, in _run_code
exec code in run_globals
File "/root/.local/lib/python2.7/site-packages/object_detection/train.py", line 163, in
tf.app.run()
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/platform/app.py", line 48, in run
_sys.exit(main(_sys.argv[:1] + flags_passthrough))
File "/root/.local/lib/python2.7/site-packages/object_detection/train.py", line 159, in main
worker_job_name, is_chief, FLAGS.train_dir)
File "/root/.local/lib/python2.7/site-packages/object_detection/trainer.py", line 332, in train
saver=saver)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/contrib/slim/python/slim/learning.py", line 763, in train
sess, train_op, global_step, train_step_kwargs)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/contrib/slim/python/slim/learning.py", line 487, in train_step
run_metadata=run_metadata)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 889, in run
run_metadata_ptr)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 1120, in _run
feed_dict_tensor, options, run_metadata)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 1317, in _do_run
options, run_metadata)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 1336, in _do_call
raise type(e)(node_def, op, message)
UnavailableError: Endpoint read failed
Faced the same issue
While training on CloudML , it has thrown an error after 640 iterations.
textPayload: "The replica worker 2 exited with a non-zero status of 1. Termination reason: Error.
UnavailableError: Endpoint read failed
@tombstone can you please take a look or point to CloudML folks?
@glarchev Can you try changing the runtime version to 1.2 in the command.
I installed the tensorflow 1.4 version but with the same runtime version 1.4, training could not be completed. Then I tried with 1.2 its executed.
--runtime-version 1.2
Thanks for feedback, please use 1.2 runtime version as @vasudevmaduri suggested for now.
We are investigating the issue.
I can confirm that changing to --runtime-version 1.2 fixes the problem
It fixed the problem but after exporting the inference graph I end up with another error.
InternalError Traceback (most recent call last)
c:\users\kannan\appdata\local\programs\python\python35\lib\site-packages\tensorflow\python\client\session.py in _do_call(self, fn, *args)
1322 try:
-> 1323 return fn(*args)
1324 except errors.OpError as e:
c:\users\kannan\appdata\local\programs\python\python35\lib\site-packages\tensorflow\python\client\session.py in _run_fn(session, feed_dict, fetch_list, target_list, options, run_metadata)
1301 feed_dict, fetch_list, target_list,
-> 1302 status, run_metadata)
1303
c:\users\kannan\appdata\local\programs\python\python35\lib\site-packages\tensorflow\python\framework\errors_impl.py in __exit__(self, type_arg, value_arg, traceback_arg)
472 compat.as_text(c_api.TF_Message(self.status.status)),
--> 473 c_api.TF_GetCode(self.status.status))
474 # Delete the underlying status object from memory otherwise it stays alive
InternalError: cuDNN launch failure : input shape([300,512,7,7]) filter shape([3,3,512,512])
[[Node: SecondStageFeatureExtractor/resnet_v1_101/block4/unit_1/bottleneck_v1/conv2/Conv2D = Conv2D[T=DT_FLOAT, data_format="NHWC", padding="SAME", strides=[1, 1, 1, 1], use_cudnn_on_gpu=true, _device="/job:localhost/replica:0/task:0/device:GPU:0"](SecondStageFeatureExtractor/resnet_v1_101/block4/unit_1/bottleneck_v1/conv1/Relu, SecondStageFeatureExtractor/resnet_v1_101/block4/unit_1/bottleneck_v1/conv2/weights)]]
[[Node: SecondStagePostprocessor/BatchMultiClassNonMaxSuppression/map/while/Identity/_107 = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device_incarnation=1, tensor_name="edge_1917_SecondStagePostprocessor/BatchMultiClassNonMaxSuppression/map/while/Identity", tensor_type=DT_INT32, _device="/job:localhost/replica:0/task:0/device:CPU:0"](^_cloopSecondStagePostprocessor/BatchMultiClassNonMaxSuppression/map/while/strided_slice/stack_1/_1)]]
During handling of the above exception, another exception occurred:
InternalError Traceback (most recent call last)
<ipython-input-9-7493eea60222> in <module>()
20 (boxes, scores, classes, num) = sess.run(
21 [detection_boxes, detection_scores, detection_classes, num_detections],
---> 22 feed_dict={image_tensor: image_np_expanded})
23 # Visualization of the results of a detection.
24 vis_util.visualize_boxes_and_labels_on_image_array(
c:\users\kannan\appdata\local\programs\python\python35\lib\site-packages\tensorflow\python\client\session.py in run(self, fetches, feed_dict, options, run_metadata)
887 try:
888 result = self._run(None, fetches, feed_dict, options_ptr,
--> 889 run_metadata_ptr)
890 if run_metadata:
891 proto_data = tf_session.TF_GetBuffer(run_metadata_ptr)
c:\users\kannan\appdata\local\programs\python\python35\lib\site-packages\tensorflow\python\client\session.py in _run(self, handle, fetches, feed_dict, options, run_metadata)
1118 if final_fetches or final_targets or (handle and feed_dict_tensor):
1119 results = self._do_run(handle, final_targets, final_fetches,
-> 1120 feed_dict_tensor, options, run_metadata)
1121 else:
1122 results = []
c:\users\kannan\appdata\local\programs\python\python35\lib\site-packages\tensorflow\python\client\session.py in _do_run(self, handle, target_list, fetch_list, feed_dict, options, run_metadata)
1315 if handle is None:
1316 return self._do_call(_run_fn, self._session, feeds, fetches, targets,
-> 1317 options, run_metadata)
1318 else:
1319 return self._do_call(_prun_fn, self._session, handle, feeds, fetches)
c:\users\kannan\appdata\local\programs\python\python35\lib\site-packages\tensorflow\python\client\session.py in _do_call(self, fn, *args)
1334 except KeyError:
1335 pass
-> 1336 raise type(e)(node_def, op, message)
1337
1338 def _extend_graph(self):
InternalError: cuDNN launch failure : input shape([300,512,7,7]) filter shape([3,3,512,512])
[[Node: SecondStageFeatureExtractor/resnet_v1_101/block4/unit_1/bottleneck_v1/conv2/Conv2D = Conv2D[T=DT_FLOAT, data_format="NHWC", padding="SAME", strides=[1, 1, 1, 1], use_cudnn_on_gpu=true, _device="/job:localhost/replica:0/task:0/device:GPU:0"](SecondStageFeatureExtractor/resnet_v1_101/block4/unit_1/bottleneck_v1/conv1/Relu, SecondStageFeatureExtractor/resnet_v1_101/block4/unit_1/bottleneck_v1/conv2/weights)]]
[[Node: SecondStagePostprocessor/BatchMultiClassNonMaxSuppression/map/while/Identity/_107 = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device_incarnation=1, tensor_name="edge_1917_SecondStagePostprocessor/BatchMultiClassNonMaxSuppression/map/while/Identity", tensor_type=DT_INT32, _device="/job:localhost/replica:0/task:0/device:CPU:0"](^_cloopSecondStagePostprocessor/BatchMultiClassNonMaxSuppression/map/while/strided_slice/stack_1/_1)]]
Caused by op 'SecondStageFeatureExtractor/resnet_v1_101/block4/unit_1/bottleneck_v1/conv2/Conv2D', defined at:
File "c:\users\kannan\appdata\local\programs\python\python35\lib\runpy.py", line 184, in _run_module_as_main
"__main__", mod_spec)
File "c:\users\kannan\appdata\local\programs\python\python35\lib\runpy.py", line 85, in _run_code
exec(code, run_globals)
File "c:\users\kannan\appdata\local\programs\python\python35\lib\site-packages\ipykernel_launcher.py", line 16, in <module>
app.launch_new_instance()
File "c:\users\kannan\appdata\local\programs\python\python35\lib\site-packages\traitlets\config\application.py", line 658, in launch_instance
app.start()
File "c:\users\kannan\appdata\local\programs\python\python35\lib\site-packages\ipykernel\kernelapp.py", line 477, in start
ioloop.IOLoop.instance().start()
File "c:\users\kannan\appdata\local\programs\python\python35\lib\site-packages\zmq\eventloop\ioloop.py", line 177, in start
super(ZMQIOLoop, self).start()
File "c:\users\kannan\appdata\local\programs\python\python35\lib\site-packages\tornado\ioloop.py", line 888, in start
handler_func(fd_obj, events)
File "c:\users\kannan\appdata\local\programs\python\python35\lib\site-packages\tornado\stack_context.py", line 277, in null_wrapper
return fn(*args, **kwargs)
File "c:\users\kannan\appdata\local\programs\python\python35\lib\site-packages\zmq\eventloop\zmqstream.py", line 440, in _handle_events
self._handle_recv()
File "c:\users\kannan\appdata\local\programs\python\python35\lib\site-packages\zmq\eventloop\zmqstream.py", line 472, in _handle_recv
self._run_callback(callback, msg)
File "c:\users\kannan\appdata\local\programs\python\python35\lib\site-packages\zmq\eventloop\zmqstream.py", line 414, in _run_callback
callback(*args, **kwargs)
File "c:\users\kannan\appdata\local\programs\python\python35\lib\site-packages\tornado\stack_context.py", line 277, in null_wrapper
return fn(*args, **kwargs)
File "c:\users\kannan\appdata\local\programs\python\python35\lib\site-packages\ipykernel\kernelbase.py", line 283, in dispatcher
return self.dispatch_shell(stream, msg)
File "c:\users\kannan\appdata\local\programs\python\python35\lib\site-packages\ipykernel\kernelbase.py", line 235, in dispatch_shell
handler(stream, idents, msg)
File "c:\users\kannan\appdata\local\programs\python\python35\lib\site-packages\ipykernel\kernelbase.py", line 399, in execute_request
user_expressions, allow_stdin)
File "c:\users\kannan\appdata\local\programs\python\python35\lib\site-packages\ipykernel\ipkernel.py", line 196, in do_execute
res = shell.run_cell(code, store_history=store_history, silent=silent)
File "c:\users\kannan\appdata\local\programs\python\python35\lib\site-packages\ipykernel\zmqshell.py", line 533, in run_cell
return super(ZMQInteractiveShell, self).run_cell(*args, **kwargs)
File "c:\users\kannan\appdata\local\programs\python\python35\lib\site-packages\IPython\core\interactiveshell.py", line 2728, in run_cell
interactivity=interactivity, compiler=compiler, result=result)
File "c:\users\kannan\appdata\local\programs\python\python35\lib\site-packages\IPython\core\interactiveshell.py", line 2850, in run_ast_nodes
if self.run_code(code, result):
File "c:\users\kannan\appdata\local\programs\python\python35\lib\site-packages\IPython\core\interactiveshell.py", line 2910, in run_code
exec(code_obj, self.user_global_ns, self.user_ns)
File "<ipython-input-5-0d8b8f2357e8>", line 7, in <module>
tf.import_graph_def(od_graph_def, name='')
File "c:\users\kannan\appdata\local\programs\python\python35\lib\site-packages\tensorflow\python\framework\importer.py", line 313, in import_graph_def
op_def=op_def)
File "c:\users\kannan\appdata\local\programs\python\python35\lib\site-packages\tensorflow\python\framework\ops.py", line 2956, in create_op
op_def=op_def)
File "c:\users\kannan\appdata\local\programs\python\python35\lib\site-packages\tensorflow\python\framework\ops.py", line 1470, in __init__
self._traceback = self._graph._extract_stack() # pylint: disable=protected-access
InternalError (see above for traceback): cuDNN launch failure : input shape([300,512,7,7]) filter shape([3,3,512,512])
[[Node: SecondStageFeatureExtractor/resnet_v1_101/block4/unit_1/bottleneck_v1/conv2/Conv2D = Conv2D[T=DT_FLOAT, data_format="NHWC", padding="SAME", strides=[1, 1, 1, 1], use_cudnn_on_gpu=true, _device="/job:localhost/replica:0/task:0/device:GPU:0"](SecondStageFeatureExtractor/resnet_v1_101/block4/unit_1/bottleneck_v1/conv1/Relu, SecondStageFeatureExtractor/resnet_v1_101/block4/unit_1/bottleneck_v1/conv2/weights)]]
[[Node: SecondStagePostprocessor/BatchMultiClassNonMaxSuppression/map/while/Identity/_107 = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device_incarnation=1, tensor_name="edge_1917_SecondStagePostprocessor/BatchMultiClassNonMaxSuppression/map/while/Identity", tensor_type=DT_INT32, _device="/job:localhost/replica:0/task:0/device:CPU:0"](^_cloopSecondStagePostprocessor/BatchMultiClassNonMaxSuppression/map/while/strided_slice/stack_1/_1)]]
I can export the inference graph, but, for some reason it looks like the resulting model performs a lot worse than the model trained locally.
I tired using mobilenet model and it worked. The results are however not that great like you mentioned. So for better accuracy, when I use faster rcnn, it throws the error after exporting inference graph. What model did you use?
I used faster rcnn as a seed. It seems to work as expected when trained locally but gives poor results when trained via CloudML.
@glarchev, could you provide more details of the training results? I tried to train via CloudML and got 94.08% mAP.
i'm also seeing "Endpoint read failed" error when switching from 1.2 to 1.4. Is this related to this grpc issue?
@jiaxunwu I don't have hard metrics for my training results, I typically train until I see loss that's low enough, and then visually evaluate the resulting model (my application is object detection). With local training, the model performs roughly as expected. With CloudML training, however, it seems to produce a lot of false-positives (even though the training set is the same, and the loss at the end of training is roughly the same).
Anybody resolved it on TF 1.4 runtime?
I am also getting the same error now though I haven't made any code changes. It doesn't fail in the beginning. It trains to arbitrary number of steps and then fails
@puneetjindal could you try 1.2 instead as described in https://github.com/tensorflow/models/blob/master/research/object_detection/g3doc/running_on_cloud.md?
But we have dataset API limitations in 1.2.
On Jan 24, 2018 8:19 AM, "Jiaxun Wu" notifications@github.com wrote:
@puneetjindal https://github.com/puneetjindal could you try 1.2 instead
as described in https://github.com/tensorflow/models/blob/master/research/
object_detection/g3doc/running_on_cloud.md?—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/tensorflow/models/issues/3071#issuecomment-360003609,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AC0ExZUci_0ospwpu9TBKzo44J9XEPYvks5tNppNgaJpZM4RNyvI
.
Running into this issue. Using 1.4, and using a single gpu it works fine, but when i try scaling up with varying 1-10 worker nodes it runs into UnavailableError: Endpoint read failed which i'm guessing is when tensorflow loses connectivity from a node.
I can't really revert back to 1.2 because i had to resolve another issue and modified the code to work for 1.4. Guess its back to AWS for now...
Same here. 1.4 failed twice arbitrarily but 1.2 works. @jiaxunwu Can I export my trained model with TF 1.4 and expect it to work? My TF-serving infra is based on TF 1.4.
So far this issue seems to mainly effect Faster CNN (I've tried Resnet 101), similar to others have described: fails during the first several hundred iterations, also the loss drops far too quickly to be believable.
Has anyone had any luck with different Faster CNN models or do they all fail with this error? I have actually seen this UnavailableError: Endpoint read failed using SSD mobile net, but only once or twice usually after an hour or more of training.
Down-grading to 1.2 causes other issues with ML cloud engine, but good to know this might solve the issue & may be worth trying.
My setup: Google Cloud ML engine
trainingInput:
runtimeVersion: "1.4"
scaleTier: CUSTOM
masterType: standard_gpu
workerCount: 5
workerType: standard_gpu
parameterServerCount: 3
parameterServerType: standard
Is this just a corner case that a few of us are hitting with our various data sets or cloud.yml configuration? I'm surprised this isn't a more popular/urgent issue because while ODAPI on Cloud ML is still officially on 1.2 (as @jiaxunwu points out above) in at least 2 related issues (see my 2 references above) the workaround for other issues seems to be to use 1.4 or 1.5. So using newer versions seems to be popular / the only supported workaround for other issues.
Has anyone figured out any workarounds? Right now I am using 1.4 and only SSD, and manually restarting training when I hit a timeout, which is less painful than 1.2 and trying to deal with customizing more of the ODAPI code.
Judging by the discussion in this issue, this is expected behavior when there is network connection issue. The remedy is catching this and restarting session. In TF 1.4 this is done for us by Estimator class. So I would guess there won't any incentive to "fix" existing script, and the way forward to use object detection code with 1.4 is to rewrite relevant part using estimator API.
It is not possible to move back down to 1.2. With 1.2 I get the error : Tensorflow AttributeError: 'module' object has no attribute data. That was the reason for moving to 1.4+. Is there any way to fix this other than reducing the worker count to 1?
When I moved from runtime version 1.4 to 1.2, I am facing a weird error, " that /usr/bin/python: No module named util" . FYI, I have not changed anything else, and to make sure when I used 1.4 again, model ran for 1hour again, then job failed with error UnavailableError: Endpoint read failed.
Closing this issue since its resolved. Feel free to reopen if the issue still persists. Thanks!