Models: Get UnavailableError when running object detection training on CloudML

Created on 27 Dec 2017 · 24Comments · Source: tensorflow/models

I can train an Object Detection model just fine locally, but when I try to run the training on CloudML, it runs for a little bit (during the last run it ran for about 340 steps) and then terminates because of the following error:

UnavailableError: Endpoint read failed

The full stack trace is pasted at the end of this post.

System information

What is the top-level directory of the model you are using: N/A, training my own model
Have I written custom code (as opposed to using a stock example script provided in TensorFlow): No, but the Object Detection source code was modified for workaround of bugs #2739 and #2653
OS Platform and Distribution (e.g., Linux Ubuntu 16.04): CloudML default
TensorFlow installed from (source or binary): N/A
TensorFlow version (use command below): 1.4
Bazel version (if compiling from source): N/A
CUDA/cuDNN version: CloudML default
GPU model and memory: CloudML default
Exact command to reproduce: sudo gcloud ml-engine jobs submit training object_detection_171227 --job-dir=gs://my-sandbox/ml/train --packages /Users/user/object_detection/models/research/dist/object_detection-0.1.tar.gz,/Users/user/object_detection/models/research/slim/dist/slim-0.1.tar.gz --module-name object_detection.train --runtime-version 1.4 --region us-east1 --config /Users/user/cloud_yml/cloud.yml -- --train_dir=gs://my-sandbox/ml/train --pipeline_config_path=gs://my-sandbox/ml/data/pipeline.config

Full stack trace:

severity: "ERROR"
textPayload: "The replica worker 0 exited with a non-zero status of 1. Termination reason: Error.
Traceback (most recent call last):
File "/usr/lib/python2.7/runpy.py", line 174, in _run_module_as_main
"__main__", fname, loader, pkg_name)
File "/usr/lib/python2.7/runpy.py", line 72, in _run_code
exec code in run_globals
File "/root/.local/lib/python2.7/site-packages/object_detection/train.py", line 163, in
tf.app.run()
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/platform/app.py", line 48, in run
_sys.exit(main(_sys.argv[:1] + flags_passthrough))
File "/root/.local/lib/python2.7/site-packages/object_detection/train.py", line 159, in main
worker_job_name, is_chief, FLAGS.train_dir)
File "/root/.local/lib/python2.7/site-packages/object_detection/trainer.py", line 332, in train
saver=saver)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/contrib/slim/python/slim/learning.py", line 763, in train
sess, train_op, global_step, train_step_kwargs)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/contrib/slim/python/slim/learning.py", line 487, in train_step
run_metadata=run_metadata)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 889, in run
run_metadata_ptr)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 1120, in _run
feed_dict_tensor, options, run_metadata)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 1317, in _do_run
options, run_metadata)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 1336, in _do_call
raise type(e)(node_def, op, message)
UnavailableError: Endpoint read failed

Source

glarchev

👍1

All 24 comments

Faced the same issue
While training on CloudML , it has thrown an error after 640 iterations.

textPayload: "The replica worker 2 exited with a non-zero status of 1. Termination reason: Error.
UnavailableError: Endpoint read failed

vasudevmaduri on 29 Dec 2017

@tombstone can you please take a look or point to CloudML folks?

bignamehyp on 31 Dec 2017

I have the same issue. https://stackoverflow.com/questions/48058198/google-object-detection-api-using-faster-rcnn-resnet101-coco-model-for-trainin

kannan60 on 2 Jan 2018

👍1

@glarchev Can you try changing the runtime version to 1.2 in the command.

I installed the tensorflow 1.4 version but with the same runtime version 1.4, training could not be completed. Then I tried with 1.2 its executed.

--runtime-version 1.2

vasudevmaduri on 2 Jan 2018

Thanks for feedback, please use 1.2 runtime version as @vasudevmaduri suggested for now.
We are investigating the issue.

jiaxunwu on 2 Jan 2018

I can confirm that changing to --runtime-version 1.2 fixes the problem

glarchev on 2 Jan 2018

It fixed the problem but after exporting the inference graph I end up with another error.

InternalError                             Traceback (most recent call last)
c:\users\kannan\appdata\local\programs\python\python35\lib\site-packages\tensorflow\python\client\session.py in _do_call(self, fn, *args)
   1322     try:
-> 1323       return fn(*args)
   1324     except errors.OpError as e:

c:\users\kannan\appdata\local\programs\python\python35\lib\site-packages\tensorflow\python\client\session.py in _run_fn(session, feed_dict, fetch_list, target_list, options, run_metadata)
   1301                                    feed_dict, fetch_list, target_list,
-> 1302                                    status, run_metadata)
   1303 

c:\users\kannan\appdata\local\programs\python\python35\lib\site-packages\tensorflow\python\framework\errors_impl.py in __exit__(self, type_arg, value_arg, traceback_arg)
    472             compat.as_text(c_api.TF_Message(self.status.status)),
--> 473             c_api.TF_GetCode(self.status.status))
    474     # Delete the underlying status object from memory otherwise it stays alive

InternalError: cuDNN launch failure : input shape([300,512,7,7]) filter shape([3,3,512,512])
     [[Node: SecondStageFeatureExtractor/resnet_v1_101/block4/unit_1/bottleneck_v1/conv2/Conv2D = Conv2D[T=DT_FLOAT, data_format="NHWC", padding="SAME", strides=[1, 1, 1, 1], use_cudnn_on_gpu=true, _device="/job:localhost/replica:0/task:0/device:GPU:0"](SecondStageFeatureExtractor/resnet_v1_101/block4/unit_1/bottleneck_v1/conv1/Relu, SecondStageFeatureExtractor/resnet_v1_101/block4/unit_1/bottleneck_v1/conv2/weights)]]
     [[Node: SecondStagePostprocessor/BatchMultiClassNonMaxSuppression/map/while/Identity/_107 = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device_incarnation=1, tensor_name="edge_1917_SecondStagePostprocessor/BatchMultiClassNonMaxSuppression/map/while/Identity", tensor_type=DT_INT32, _device="/job:localhost/replica:0/task:0/device:CPU:0"](^_cloopSecondStagePostprocessor/BatchMultiClassNonMaxSuppression/map/while/strided_slice/stack_1/_1)]]

During handling of the above exception, another exception occurred:

InternalError                             Traceback (most recent call last)
<ipython-input-9-7493eea60222> in <module>()
     20       (boxes, scores, classes, num) = sess.run(
     21           [detection_boxes, detection_scores, detection_classes, num_detections],
---> 22           feed_dict={image_tensor: image_np_expanded})
     23       # Visualization of the results of a detection.
     24       vis_util.visualize_boxes_and_labels_on_image_array(

c:\users\kannan\appdata\local\programs\python\python35\lib\site-packages\tensorflow\python\client\session.py in run(self, fetches, feed_dict, options, run_metadata)
    887     try:
    888       result = self._run(None, fetches, feed_dict, options_ptr,
--> 889                          run_metadata_ptr)
    890       if run_metadata:
    891         proto_data = tf_session.TF_GetBuffer(run_metadata_ptr)

c:\users\kannan\appdata\local\programs\python\python35\lib\site-packages\tensorflow\python\client\session.py in _run(self, handle, fetches, feed_dict, options, run_metadata)
   1118     if final_fetches or final_targets or (handle and feed_dict_tensor):
   1119       results = self._do_run(handle, final_targets, final_fetches,
-> 1120                              feed_dict_tensor, options, run_metadata)
   1121     else:
   1122       results = []

c:\users\kannan\appdata\local\programs\python\python35\lib\site-packages\tensorflow\python\client\session.py in _do_run(self, handle, target_list, fetch_list, feed_dict, options, run_metadata)
   1315     if handle is None:
   1316       return self._do_call(_run_fn, self._session, feeds, fetches, targets,
-> 1317                            options, run_metadata)
   1318     else:
   1319       return self._do_call(_prun_fn, self._session, handle, feeds, fetches)

c:\users\kannan\appdata\local\programs\python\python35\lib\site-packages\tensorflow\python\client\session.py in _do_call(self, fn, *args)
   1334         except KeyError:
   1335           pass
-> 1336       raise type(e)(node_def, op, message)
   1337 
   1338   def _extend_graph(self):

InternalError: cuDNN launch failure : input shape([300,512,7,7]) filter shape([3,3,512,512])
     [[Node: SecondStageFeatureExtractor/resnet_v1_101/block4/unit_1/bottleneck_v1/conv2/Conv2D = Conv2D[T=DT_FLOAT, data_format="NHWC", padding="SAME", strides=[1, 1, 1, 1], use_cudnn_on_gpu=true, _device="/job:localhost/replica:0/task:0/device:GPU:0"](SecondStageFeatureExtractor/resnet_v1_101/block4/unit_1/bottleneck_v1/conv1/Relu, SecondStageFeatureExtractor/resnet_v1_101/block4/unit_1/bottleneck_v1/conv2/weights)]]
     [[Node: SecondStagePostprocessor/BatchMultiClassNonMaxSuppression/map/while/Identity/_107 = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device_incarnation=1, tensor_name="edge_1917_SecondStagePostprocessor/BatchMultiClassNonMaxSuppression/map/while/Identity", tensor_type=DT_INT32, _device="/job:localhost/replica:0/task:0/device:CPU:0"](^_cloopSecondStagePostprocessor/BatchMultiClassNonMaxSuppression/map/while/strided_slice/stack_1/_1)]]

Caused by op 'SecondStageFeatureExtractor/resnet_v1_101/block4/unit_1/bottleneck_v1/conv2/Conv2D', defined at:
  File "c:\users\kannan\appdata\local\programs\python\python35\lib\runpy.py", line 184, in _run_module_as_main
    "__main__", mod_spec)
  File "c:\users\kannan\appdata\local\programs\python\python35\lib\runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "c:\users\kannan\appdata\local\programs\python\python35\lib\site-packages\ipykernel_launcher.py", line 16, in <module>
    app.launch_new_instance()
  File "c:\users\kannan\appdata\local\programs\python\python35\lib\site-packages\traitlets\config\application.py", line 658, in launch_instance
    app.start()
  File "c:\users\kannan\appdata\local\programs\python\python35\lib\site-packages\ipykernel\kernelapp.py", line 477, in start
    ioloop.IOLoop.instance().start()
  File "c:\users\kannan\appdata\local\programs\python\python35\lib\site-packages\zmq\eventloop\ioloop.py", line 177, in start
    super(ZMQIOLoop, self).start()
  File "c:\users\kannan\appdata\local\programs\python\python35\lib\site-packages\tornado\ioloop.py", line 888, in start
    handler_func(fd_obj, events)
  File "c:\users\kannan\appdata\local\programs\python\python35\lib\site-packages\tornado\stack_context.py", line 277, in null_wrapper
    return fn(*args, **kwargs)
  File "c:\users\kannan\appdata\local\programs\python\python35\lib\site-packages\zmq\eventloop\zmqstream.py", line 440, in _handle_events
    self._handle_recv()
  File "c:\users\kannan\appdata\local\programs\python\python35\lib\site-packages\zmq\eventloop\zmqstream.py", line 472, in _handle_recv
    self._run_callback(callback, msg)
  File "c:\users\kannan\appdata\local\programs\python\python35\lib\site-packages\zmq\eventloop\zmqstream.py", line 414, in _run_callback
    callback(*args, **kwargs)
  File "c:\users\kannan\appdata\local\programs\python\python35\lib\site-packages\tornado\stack_context.py", line 277, in null_wrapper
    return fn(*args, **kwargs)
  File "c:\users\kannan\appdata\local\programs\python\python35\lib\site-packages\ipykernel\kernelbase.py", line 283, in dispatcher
    return self.dispatch_shell(stream, msg)
  File "c:\users\kannan\appdata\local\programs\python\python35\lib\site-packages\ipykernel\kernelbase.py", line 235, in dispatch_shell
    handler(stream, idents, msg)
  File "c:\users\kannan\appdata\local\programs\python\python35\lib\site-packages\ipykernel\kernelbase.py", line 399, in execute_request
    user_expressions, allow_stdin)
  File "c:\users\kannan\appdata\local\programs\python\python35\lib\site-packages\ipykernel\ipkernel.py", line 196, in do_execute
    res = shell.run_cell(code, store_history=store_history, silent=silent)
  File "c:\users\kannan\appdata\local\programs\python\python35\lib\site-packages\ipykernel\zmqshell.py", line 533, in run_cell
    return super(ZMQInteractiveShell, self).run_cell(*args, **kwargs)
  File "c:\users\kannan\appdata\local\programs\python\python35\lib\site-packages\IPython\core\interactiveshell.py", line 2728, in run_cell
    interactivity=interactivity, compiler=compiler, result=result)
  File "c:\users\kannan\appdata\local\programs\python\python35\lib\site-packages\IPython\core\interactiveshell.py", line 2850, in run_ast_nodes
    if self.run_code(code, result):
  File "c:\users\kannan\appdata\local\programs\python\python35\lib\site-packages\IPython\core\interactiveshell.py", line 2910, in run_code
    exec(code_obj, self.user_global_ns, self.user_ns)
  File "<ipython-input-5-0d8b8f2357e8>", line 7, in <module>
    tf.import_graph_def(od_graph_def, name='')
  File "c:\users\kannan\appdata\local\programs\python\python35\lib\site-packages\tensorflow\python\framework\importer.py", line 313, in import_graph_def
    op_def=op_def)
  File "c:\users\kannan\appdata\local\programs\python\python35\lib\site-packages\tensorflow\python\framework\ops.py", line 2956, in create_op
    op_def=op_def)
  File "c:\users\kannan\appdata\local\programs\python\python35\lib\site-packages\tensorflow\python\framework\ops.py", line 1470, in __init__
    self._traceback = self._graph._extract_stack()  # pylint: disable=protected-access

InternalError (see above for traceback): cuDNN launch failure : input shape([300,512,7,7]) filter shape([3,3,512,512])
     [[Node: SecondStageFeatureExtractor/resnet_v1_101/block4/unit_1/bottleneck_v1/conv2/Conv2D = Conv2D[T=DT_FLOAT, data_format="NHWC", padding="SAME", strides=[1, 1, 1, 1], use_cudnn_on_gpu=true, _device="/job:localhost/replica:0/task:0/device:GPU:0"](SecondStageFeatureExtractor/resnet_v1_101/block4/unit_1/bottleneck_v1/conv1/Relu, SecondStageFeatureExtractor/resnet_v1_101/block4/unit_1/bottleneck_v1/conv2/weights)]]
     [[Node: SecondStagePostprocessor/BatchMultiClassNonMaxSuppression/map/while/Identity/_107 = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device_incarnation=1, tensor_name="edge_1917_SecondStagePostprocessor/BatchMultiClassNonMaxSuppression/map/while/Identity", tensor_type=DT_INT32, _device="/job:localhost/replica:0/task:0/device:CPU:0"](^_cloopSecondStagePostprocessor/BatchMultiClassNonMaxSuppression/map/while/strided_slice/stack_1/_1)]]

kannan60 on 3 Jan 2018

I can export the inference graph, but, for some reason it looks like the resulting model performs a lot worse than the model trained locally.

glarchev on 3 Jan 2018

I tired using mobilenet model and it worked. The results are however not that great like you mentioned. So for better accuracy, when I use faster rcnn, it throws the error after exporting inference graph. What model did you use?

kannan60 on 3 Jan 2018

I used faster rcnn as a seed. It seems to work as expected when trained locally but gives poor results when trained via CloudML.

glarchev on 3 Jan 2018

👍1

@glarchev, could you provide more details of the training results? I tried to train via CloudML and got 94.08% mAP.

jiaxunwu on 5 Jan 2018

i'm also seeing "Endpoint read failed" error when switching from 1.2 to 1.4. Is this related to this grpc issue?

mrfortynine on 5 Jan 2018

👍1

@jiaxunwu I don't have hard metrics for my training results, I typically train until I see loss that's low enough, and then visually evaluate the resulting model (my application is object detection). With local training, the model performs roughly as expected. With CloudML training, however, it seems to produce a lot of false-positives (even though the training set is the same, and the loss at the end of training is roughly the same).

glarchev on 5 Jan 2018

Anybody resolved it on TF 1.4 runtime?
I am also getting the same error now though I haven't made any code changes. It doesn't fail in the beginning. It trains to arbitrary number of steps and then fails

puneetjindal on 22 Jan 2018

@puneetjindal could you try 1.2 instead as described in https://github.com/tensorflow/models/blob/master/research/object_detection/g3doc/running_on_cloud.md?

jiaxunwu on 24 Jan 2018

But we have dataset API limitations in 1.2.

On Jan 24, 2018 8:19 AM, "Jiaxun Wu" notifications@github.com wrote:

@puneetjindal https://github.com/puneetjindal could you try 1.2 instead
as described in https://github.com/tensorflow/models/blob/master/research/
object_detection/g3doc/running_on_cloud.md?

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/tensorflow/models/issues/3071#issuecomment-360003609,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AC0ExZUci_0ospwpu9TBKzo44J9XEPYvks5tNppNgaJpZM4RNyvI
.

puneetjindal on 24 Jan 2018

Running into this issue. Using 1.4, and using a single gpu it works fine, but when i try scaling up with varying 1-10 worker nodes it runs into UnavailableError: Endpoint read failed which i'm guessing is when tensorflow loses connectivity from a node.

I can't really revert back to 1.2 because i had to resolve another issue and modified the code to work for 1.4. Guess its back to AWS for now...

aysark on 7 Feb 2018

Same here. 1.4 failed twice arbitrarily but 1.2 works. @jiaxunwu Can I export my trained model with TF 1.4 and expect it to work? My TF-serving infra is based on TF 1.4.

siddharthm83 on 9 Feb 2018

So far this issue seems to mainly effect Faster CNN (I've tried Resnet 101), similar to others have described: fails during the first several hundred iterations, also the loss drops far too quickly to be believable.

Has anyone had any luck with different Faster CNN models or do they all fail with this error? I have actually seen this UnavailableError: Endpoint read failed using SSD mobile net, but only once or twice usually after an hour or more of training.

Down-grading to 1.2 causes other issues with ML cloud engine, but good to know this might solve the issue & may be worth trying.

My setup: Google Cloud ML engine

trainingInput:
  runtimeVersion: "1.4"
  scaleTier: CUSTOM
  masterType: standard_gpu
  workerCount: 5
  workerType: standard_gpu
  parameterServerCount: 3
  parameterServerType: standard

jhovell on 26 Feb 2018

Is this just a corner case that a few of us are hitting with our various data sets or cloud.yml configuration? I'm surprised this isn't a more popular/urgent issue because while ODAPI on Cloud ML is still officially on 1.2 (as @jiaxunwu points out above) in at least 2 related issues (see my 2 references above) the workaround for other issues seems to be to use 1.4 or 1.5. So using newer versions seems to be popular / the only supported workaround for other issues.

Has anyone figured out any workarounds? Right now I am using 1.4 and only SSD, and manually restarting training when I hit a timeout, which is less painful than 1.2 and trying to deal with customizing more of the ODAPI code.

jhovell on 31 Mar 2018

Judging by the discussion in this issue, this is expected behavior when there is network connection issue. The remedy is catching this and restarting session. In TF 1.4 this is done for us by Estimator class. So I would guess there won't any incentive to "fix" existing script, and the way forward to use object detection code with 1.4 is to rewrite relevant part using estimator API.

mrfortynine on 6 Apr 2018

It is not possible to move back down to 1.2. With 1.2 I get the error : Tensorflow AttributeError: 'module' object has no attribute data. That was the reason for moving to 1.4+. Is there any way to fix this other than reducing the worker count to 1?

wordjelly on 8 Apr 2018

When I moved from runtime version 1.4 to 1.2, I am facing a weird error, " that /usr/bin/python: No module named util" . FYI, I have not changed anything else, and to make sure when I used 1.4 again, model ran for 1hour again, then job failed with error UnavailableError: Endpoint read failed.