Models: tensorflow.python.framework.errors_impl.UnavailableError: OS Error

Created on 28 Mar 2018  路  15Comments  路  Source: tensorflow/models

Describe the problem

While running a training job on Cloud ML Engine (Runtime Version 1.6), after running for a while (~5900 steps), the job fails and worker-replica-0 reports the following message:

Error reported to Coordinator: , OS Error

Finally, worker-replica-0 returns:

returned non-zero exit status 1

Please see logs/Traceback below.

System information

  • What is the top-level directory of the model you are using:
    /models/research
  • Have I written custom code:
    No, Running the latest 2018/03/27 commit of Object Detection API over Cloud ML Engine. However, I did have to make a small python 3 compatibility fix.
    In models/research/object_detection/utils/learning_schedules.py: line 168, I wrapped the range() with a list():
rate_index = tf.reduce_max(tf.where(tf.greater_equal(global_step, boundaries),
                                               list(range(num_boundaries)),
                                               [0] * num_boundaries))
  • OS Platform and Distribution:
    Google Cloud ML Engine Runtime 1.6
  • TensorFlow version (use command below):
    1.6
  • GPU model and memory:
    Nvidia Tesla P100 16GB
  • Exact command to reproduce:
gcloud ml-engine jobs submit training object_detection_`date +%s` \
    --job-dir=gs://${TRAIN_DIR} \
    --packages gs://${DIST_DIR}/object_detection-0.1.tar.gz,gs://${DIST_DIR}/slim-0.1.tar.gz \
    --module-name object_detection.train \
    --region us-central1 \
    --config ${PATH_TO_LOCAL_YAML_FILE} \
    -- \
    --train_dir=gs://${TRAIN_DIR} \
    --pipeline_config_path=gs://${PIPELINE_CONFIG_PATH} \
    --num_clones=4
  • Exact Cloud ML Engine YAML file to reproduce:
trainingInput:
  runtimeVersion: '1.6'
  pythonVersion: '3.5'
  scaleTier: CUSTOM
  masterType: complex_model_m_p100
  workerType: complex_model_m_p100
  parameterServerType: large_model
  workerCount: 1
  parameterServerCount: 3
model {
  faster_rcnn {
    num_classes: 1
    image_resizer {
      keep_aspect_ratio_resizer {
        min_dimension: 600
        max_dimension: 1024
      }
    }
    feature_extractor {
      type: 'faster_rcnn_resnet101'
      first_stage_features_stride: 16
    }
    first_stage_anchor_generator {
      grid_anchor_generator {
        scales: [0.25, 0.5, 1.0, 2.0]
        aspect_ratios: [0.5, 1.0, 2.0]
        height_stride: 16
        width_stride: 16
      }
    }
    first_stage_box_predictor_conv_hyperparams {
      op: CONV
      regularizer {
        l2_regularizer {
          weight: 0.0
        }
      }
      initializer {
        truncated_normal_initializer {
          stddev: 0.01
        }
      }
    }
    first_stage_nms_score_threshold: 0.0
    first_stage_nms_iou_threshold: 0.7
    first_stage_max_proposals: 300
    first_stage_localization_loss_weight: 2.0
    first_stage_objectness_loss_weight: 1.0
    initial_crop_size: 14
    maxpool_kernel_size: 2
    maxpool_stride: 2
    second_stage_box_predictor {
      mask_rcnn_box_predictor {
        use_dropout: false
        dropout_keep_probability: 1.0
        fc_hyperparams {
          op: FC
          regularizer {
            l2_regularizer {
              weight: 0.0
            }
          }
          initializer {
            variance_scaling_initializer {
              factor: 1.0
              uniform: true
              mode: FAN_AVG
            }
          }
        }
      }
    }
    second_stage_post_processing {
      batch_non_max_suppression {
        score_threshold: 0.0
        iou_threshold: 0.6
        max_detections_per_class: 100
        max_total_detections: 300
      }
      score_converter: SOFTMAX
    }
    second_stage_localization_loss_weight: 2.0
    second_stage_classification_loss_weight: 1.0
  }
}

train_config: {
  batch_size: 4
  optimizer {
    momentum_optimizer: {
      learning_rate: {
        manual_step_learning_rate {
          initial_learning_rate: 3e-4
          schedule {
            step: 900000
            learning_rate: 3e-5
          }
          schedule {
            step: 1200000
            learning_rate: 3e-6
          }
        }
      }
      momentum_optimizer_value: 0.9
    }
    use_moving_average: false
  }
  gradient_clipping_by_norm: 10.0
  fine_tune_checkpoint: "gs://PATH_TO_BE_CONFIGURED/model.ckpt"
  from_detection_checkpoint: true
  num_steps: 30000
  data_augmentation_options {
    random_horizontal_flip {
    }
  }
}

train_input_reader: {
  tf_record_input_reader {
    input_path: "gs://PATH_TO_BE_CONFIGURED/agroset_train.record"
  }
  label_map_path: "gs://PATH_TO_BE_CONFIGURED/agroset_label_map.pbtxt"
}

eval_config: {
  num_examples: 25
  # Note: The below line limits the evaluation process to 10 evaluations.
  # Remove the below line to evaluate indefinitely.
  max_evals: 30
  visualization_export_dir: "gs://PATH_TO_BE_CONFIGURED/visualization"
}

eval_input_reader: {
  tf_record_input_reader {
    input_path: "gs://PATH_TO_BE_CONFIGURED/agroset_val.record"
  }
  label_map_path: "gs://PATH_TO_BE_CONFIGURED/agroset_label_map.pbtxt"
  shuffle: false
  num_readers: 1
}

Logs

worker-replica-0

The replica worker 0 exited with a non-zero status of 1. Termination reason: Error.

Traceback (most recent call last):
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 1361, in _do_call
    return fn(*args)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 1340, in _run_fn
    target_list, status, run_metadata)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/framework/errors_impl.py", line 516, in __exit__
    c_api.TF_GetCode(self.status.status))
tensorflow.python.framework.errors_impl.UnavailableError: OS Error

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/lib/python3.5/runpy.py", line 184, in _run_module_as_main
    "__main__", mod_spec)
  File "/usr/lib/python3.5/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/root/.local/lib/python3.5/site-packages/object_detection/train.py", line 167, in <module>
    tf.app.run()
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/platform/app.py", line 126, in run
    _sys.exit(main(argv))
  File "/root/.local/lib/python3.5/site-packages/object_detection/train.py", line 163, in main
    worker_job_name, is_chief, FLAGS.train_dir)
  File "/root/.local/lib/python3.5/site-packages/object_detection/trainer.py", line 370, in train
    saver=saver)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/contrib/slim/python/slim/learning.py", line 768, in train
    sess, train_op, global_step, train_step_kwargs)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/contrib/slim/python/slim/learning.py", line 487, in train_step
    run_metadata=run_metadata)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 905, in run
    run_metadata_ptr)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 1137, in _run
    feed_dict_tensor, options, run_metadata)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 1355, in _do_run
    options, run_metadata)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 1374, in _do_call
    raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.UnavailableError: OS Error

Most helpful comment

I had the same error when the tensorflow instances trying to contact each other. I found out that the problem is that GRPC uses the native "epoll" polling engine for communication. Changing this to a portable polling engine solved this issue for me. The way to do is to set the environment variable, "GRPC_POLL_STRATEGY=poll" before running the tensorflow programs. This solved this issue for me. For reference, see, https://github.com/grpc/grpc/blob/master/doc/environment_variables.md.
May be this can help.

All 15 comments

Hi, I managed to run the faster_rcnn kitti model with transfer learning for my own dataset by disabling the workers (so only the standard_gpu master) and two large_model paramater servers up to ~80k steps. I suspect that the issue is caused by an OOM error as the GPU's (tesla) memory is at ~40%

Hi @MoMo-Tech, how removing the workers can alleviate GPU OOM issues?

Hi @pkulzc, if you'll repeat the same experiment but with Cloud ML Engine TF 1.5 instead of 1.6 then you'll reproduce #3757. I believe they are the same issue that manifests itself differently in TF 1.5 and TF 1.6.

On my initial runs I noticed that the workers were assigned on the same machine ID. And with this in mind I changed the GPU type to the next tesla tier and noticed that a few more steps were being computed, but still it failed ~10K epochs. There are also a few other tickets that claim there might be an OOM handled by the OS and thus getting the OS error.

Thanks for the info, we're investigating this as well as 3757 now.

Hi @pkulzc, today I pulled the HEAD of models repository and tried again and the problem still persists. However, I got more informative error messages than I was getting before:

master-replica-0

Error reported to Coordinator: <class 'tensorflow.python.framework.errors_impl.UnavailableError'>, OS Error
[[Node: clip_grads/clip_by_norm_74/truediv_S10171 =
  _Recv[client_terminated=false,
  recv_device="/job:ps/replica:0/task:2/device:CPU:0",
  send_device="/job:master/replica:0/task:0/device:CPU:0",
  send_device_incarnation=4406029107278666217,
  tensor_name="edge_32391_clip_grads/clip_by_norm_74/truediv",
  tensor_type=DT_FLOAT,
  _device="/job:ps/replica:0/task:2/device:CPU:0"]()]]
[[Node: Momentum/update/NoOp_2_S10192 =
  _Recv[client_terminated=false,
  recv_device="/job:master/replica:0/task:0/device:CPU:0",
  send_device="/job:ps/replica:0/task:2/device:CPU:0",
  send_device_incarnation=7295333245552102225,
  tensor_name="edge_32739_Momentum/update/NoOp_2",
  tensor_type=DT_FLOAT,
  _device="/job:master/replica:0/task:0/device:CPU:0"]()]]
[[Node: clip_grads/clip_by_norm_213/truediv_S9737 =
  _Recv[client_terminated=false,
  recv_device="/job:ps/replica:0/task:3/device:CPU:0",
  send_device="/job:master/replica:0/task:0/device:CPU:0",
  send_device_incarnation=4406029107278666217,
  tensor_name="edge_26466_clip_grads/clip_by_norm_213/truediv",
  tensor_type=DT_FLOAT,
  _device="/job:ps/replica:0/task:3/device:CPU:0"]()]]
Traceback (most recent call last):
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 1361, in _do_call
    return fn(*args)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 1340, in _run_fn
    target_list, status, run_metadata)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/framework/errors_impl.py", line 516, in __exit__
    c_api.TF_GetCode(self.status.status))
tensorflow.python.framework.errors_impl.UnavailableError: OS Error
     [[Node: clip_grads/clip_by_norm_74/truediv_S10171 = _Recv[client_terminated=false, recv_device="/job:ps/replica:0/task:2/device:CPU:0", send_device="/job:master/replica:0/task:0/device:CPU:0", send_device_incarnation=4406029107278666217, tensor_name="edge_32391_clip_grads/clip_by_norm_74/truediv", tensor_type=DT_FLOAT, _device="/job:ps/replica:0/task:2/device:CPU:0"]()]]
     [[Node: Momentum/update/NoOp_2_S10192 = _Recv[client_terminated=false, recv_device="/job:master/replica:0/task:0/device:CPU:0", send_device="/job:ps/replica:0/task:2/device:CPU:0", send_device_incarnation=7295333245552102225, tensor_name="edge_32739_Momentum/update/NoOp_2", tensor_type=DT_FLOAT, _device="/job:master/replica:0/task:0/device:CPU:0"]()]]
     [[Node: clip_grads/clip_by_norm_213/truediv_S9737 = _Recv[client_terminated=false, recv_device="/job:ps/replica:0/task:3/device:CPU:0", send_device="/job:master/replica:0/task:0/device:CPU:0", send_device_incarnation=4406029107278666217, tensor_name="edge_26466_clip_grads/clip_by_norm_213/truediv", tensor_type=DT_FLOAT, _device="/job:ps/replica:0/task:3/device:CPU:0"]()]]

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/lib/python3.5/runpy.py", line 184, in _run_module_as_main
    "__main__", mod_spec)
  File "/usr/lib/python3.5/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/root/.local/lib/python3.5/site-packages/object_detection/train.py", line 167, in <module>
    tf.app.run()
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/platform/app.py", line 126, in run
    _sys.exit(main(argv))
  File "/root/.local/lib/python3.5/site-packages/object_detection/train.py", line 163, in main
    worker_job_name, is_chief, FLAGS.train_dir)
  File "/root/.local/lib/python3.5/site-packages/object_detection/trainer.py", line 370, in train
    saver=saver)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/contrib/slim/python/slim/learning.py", line 768, in train
    sess, train_op, global_step, train_step_kwargs)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/contrib/slim/python/slim/learning.py", line 487, in train_step
    run_metadata=run_metadata)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 905, in run
    run_metadata_ptr)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 1137, in _run
    feed_dict_tensor, options, run_metadata)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 1355, in _do_run
    options, run_metadata)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 1374, in _do_call
    raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.UnavailableError: OS Error
     [[Node: clip_grads/clip_by_norm_74/truediv_S10171 = _Recv[client_terminated=false, recv_device="/job:ps/replica:0/task:2/device:CPU:0", send_device="/job:master/replica:0/task:0/device:CPU:0", send_device_incarnation=4406029107278666217, tensor_name="edge_32391_clip_grads/clip_by_norm_74/truediv", tensor_type=DT_FLOAT, _device="/job:ps/replica:0/task:2/device:CPU:0"]()]]
     [[Node: Momentum/update/NoOp_2_S10192 = _Recv[client_terminated=false, recv_device="/job:master/replica:0/task:0/device:CPU:0", send_device="/job:ps/replica:0/task:2/device:CPU:0", send_device_incarnation=7295333245552102225, tensor_name="edge_32739_Momentum/update/NoOp_2", tensor_type=DT_FLOAT, _device="/job:master/replica:0/task:0/device:CPU:0"]()]]
     [[Node: clip_grads/clip_by_norm_213/truediv_S9737 = _Recv[client_terminated=false, recv_device="/job:ps/replica:0/task:3/device:CPU:0", send_device="/job:master/replica:0/task:0/device:CPU:0", send_device_incarnation=4406029107278666217, tensor_name="edge_26466_clip_grads/clip_by_norm_213/truediv", tensor_type=DT_FLOAT, _device="/job:ps/replica:0/task:3/device:CPU:0"]()]]
/usr/local/lib/python3.5/dist-packages/tensorflow/python/ops/gradients_impl.py:98: UserWarning: Converting sparse IndexedSlices to a dense Tensor of unknown shape. This may consume a large amount of memory.
Converting sparse IndexedSlices to a dense Tensor of unknown shape.

worker-replica-0

Error reported to Coordinator: <class 'tensorflow.python.framework.errors_impl.UnavailableError'>, OS Error
2018-04-04 16:19:47.928690: E tensorflow/core/distributed_runtime/master_session.cc:1663] Cleanup partition error: Unavailable: OS Error
Traceback (most recent call last):
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 1361, in _do_call
    return fn(*args)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 1340, in _run_fn
    target_list, status, run_metadata)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/framework/errors_impl.py", line 516, in __exit__
    c_api.TF_GetCode(self.status.status))
tensorflow.python.framework.errors_impl.UnavailableError: OS Error

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/lib/python3.5/runpy.py", line 184, in _run_module_as_main
    "__main__", mod_spec)
  File "/usr/lib/python3.5/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/root/.local/lib/python3.5/site-packages/object_detection/train.py", line 167, in <module>
    tf.app.run()
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/platform/app.py", line 126, in run
    _sys.exit(main(argv))
  File "/root/.local/lib/python3.5/site-packages/object_detection/train.py", line 163, in main
    worker_job_name, is_chief, FLAGS.train_dir)
  File "/root/.local/lib/python3.5/site-packages/object_detection/trainer.py", line 370, in train
    saver=saver)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/contrib/slim/python/slim/learning.py", line 768, in train
    sess, train_op, global_step, train_step_kwargs)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/contrib/slim/python/slim/learning.py", line 503, in train_step
    if sess.run(train_step_kwargs['should_log']):
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 905, in run
    run_metadata_ptr)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 1137, in _run
    feed_dict_tensor, options, run_metadata)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 1355, in _do_run
    options, run_metadata)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 1374, in _do_call
    raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.UnavailableError: OS Error
/usr/local/lib/python3.5/dist-packages/tensorflow/python/ops/gradients_impl.py:98: UserWarning: Converting sparse IndexedSlices to a dense Tensor of unknown shape. This may consume a large amount of memory.
Converting sparse IndexedSlices to a dense Tensor of unknown shape. 

ml-engine

The replica master 0 exited with a non-zero status of 1. Termination reason: Error. 
Traceback (most recent call last):
  [...]
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/contrib/slim/python/slim/learning.py", line 487, in train_step
    run_metadata=run_metadata)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 905, in run
    run_metadata_ptr)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 1137, in _run
    feed_dict_tensor, options, run_metadata)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 1355, in _do_run
    options, run_metadata)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 1374, in _do_call
    raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.UnavailableError: OS Error
     [[Node: clip_grads/clip_by_norm_74/truediv_S10171 = _Recv[client_terminated=false, recv_device="/job:ps/replica:0/task:2/device:CPU:0", send_device="/job:master/replica:0/task:0/device:CPU:0", send_device_incarnation=4406029107278666217, tensor_name="edge_32391_clip_grads/clip_by_norm_74/truediv", tensor_type=DT_FLOAT, _device="/job:ps/replica:0/task:2/device:CPU:0"]()]]
     [[Node: Momentum/update/NoOp_2_S10192 = _Recv[client_terminated=false, recv_device="/job:master/replica:0/task:0/device:CPU:0", send_device="/job:ps/replica:0/task:2/device:CPU:0", send_device_incarnation=7295333245552102225, tensor_name="edge_32739_Momentum/update/NoOp_2", tensor_type=DT_FLOAT, _device="/job:master/replica:0/task:0/device:CPU:0"]()]]
     [[Node: clip_grads/clip_by_norm_213/truediv_S9737 = _Recv[client_terminated=false, recv_device="/job:ps/replica:0/task:3/device:CPU:0", send_device="/job:master/replica:0/task:0/device:CPU:0", send_device_incarnation=4406029107278666217, tensor_name="edge_26466_clip_grads/clip_by_norm_213/truediv", tensor_type=DT_FLOAT, _device="/job:ps/replica:0/task:3/device:CPU:0"]()]]

The replica worker 0 exited with a non-zero status of 1. Termination reason: Error. 
Traceback (most recent call last):
  [...]
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/framework/errors_impl.py", line 516, in __exit__
    c_api.TF_GetCode(self.status.status))
tensorflow.python.framework.errors_impl.UnavailableError: OS Error

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/lib/python3.5/runpy.py", line 184, in _run_module_as_main
    "__main__", mod_spec)
  File "/usr/lib/python3.5/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/root/.local/lib/python3.5/site-packages/object_detection/train.py", line 167, in <module>
    tf.app.run()
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/platform/app.py", line 126, in run
    _sys.exit(main(argv))
  File "/root/.local/lib/python3.5/site-packages/object_detection/train.py", line 163, in main
    worker_job_name, is_chief, FLAGS.train_dir)
  File "/root/.local/lib/python3.5/site-packages/object_detection/trainer.py", line 370, in train
    saver=saver)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/contrib/slim/python/slim/learning.py", line 768, in train
    sess, train_op, global_step, train_step_kwargs)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/contrib/slim/python/slim/learning.py", line 503, in train_step
    if sess.run(train_step_kwargs['should_log']):
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 905, in run
    run_metadata_ptr)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 1137, in _run
    feed_dict_tensor, options, run_metadata)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 1355, in _do_run
    options, run_metadata)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 1374, in _do_call
    raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.UnavailableError: OS Error

Hi @pkulzc, thank you for investigating this issue!

Are you able to please share any updates or information regarding the status of this issue?

Thanks 馃憤

Sorry for the delay, right now the cloud team and tensorflow team are still investigating.

Great!

Thanks for the update

Hi @pkulzc, I think I may have a lead:

On line 357, object_detection/trainer.py calls tf.contrib.slim.learning.train() which uses the deprecated tf.train.Supervisor and should be migrated to tf.train.MonitoredTrainingSession instead, as documented in tf.train.Supervisor

This is already requested in tensorflow/tensorflow#15793 and is reported as a solution to tensorflow/tensorflow#17852 on the last comment of yahoo/TensorFlowOnSpark#245.

@MoMo-Tech I also managed to run transfer learning based on ssd_mobilenet_v1_coco_2017_11_17 models up to about ~80k steps and it stopped with the following error.

The replica worker 4 exited with a non-zero status of 1. Termination reason: Error. Traceback (most recent call last): File "/usr/lib/python2.7/runpy.py", line 174, in _run_module_as_main "__main__", fname, loader, pkg_name) File "/usr/lib/python2.7/runpy.py", line 72, in _run_code exec code in run_globals File "/root/.local/lib/python2.7/site-packages/object_detection/train.py", line 184, in <module> tf.app.run() File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/platform/app.py", line 126, in run _sys.exit(main(argv)) File "/root/.local/lib/python2.7/site-packages/object_detection/train.py", line 180, in main graph_hook_fn=graph_rewriter_fn) File "/root/.local/lib/python2.7/site-packages/object_detection/trainer.py", line 410, in train saver=saver) File "/usr/local/lib/python2.7/dist-packages/tensorflow/contrib/slim/python/slim/learning.py", line 769, in train sess, train_op, global_step, train_step_kwargs) File "/usr/local/lib/python2.7/dist-packages/tensorflow/contrib/slim/python/slim/learning.py", line 487, in train_step run_metadata=run_metadata) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 900, in run run_metadata_ptr) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 1135, in _run feed_dict_tensor, options, run_metadata) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 1316, in _do_run run_metadata) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 1335, in _do_call raise type(e)(node_def, op, message) UnavailableError: OS Error To find out more about why your job exited please check the logs: https://console.cloud.google.com/logs/viewer?project=1466216932&resource=ml_job%2Fjob_id%2Ffreshturf_object_detection_1530117506&advancedFilter=resource.type%3D%22ml_job%22%0Aresource.labels.job_id%3D%22freshturf_object_detection_1530117506%22

The master memory usage was around 38.8 ~ 49.7 %

I had the same error when the tensorflow instances trying to contact each other. I found out that the problem is that GRPC uses the native "epoll" polling engine for communication. Changing this to a portable polling engine solved this issue for me. The way to do is to set the environment variable, "GRPC_POLL_STRATEGY=poll" before running the tensorflow programs. This solved this issue for me. For reference, see, https://github.com/grpc/grpc/blob/master/doc/environment_variables.md.
May be this can help.

I believe after our switching to tf.estimator framework this issue is already gone. Closing this.

@pkulzc Can you please clarify how this issue is solved? I am still facing the same problem

@moussas1 Have you synced to latest?

Was this page helpful?
0 / 5 - 0 ratings