Models: tensorflow.python.framework.errors_impl.UnavailableError: OS Error

Created on 28 Mar 2018 · 15Comments · Source: tensorflow/models

Describe the problem

While running a training job on Cloud ML Engine (Runtime Version 1.6), after running for a while (~5900 steps), the job fails and worker-replica-0 reports the following message:

Error reported to Coordinator: , OS Error

Finally, worker-replica-0 returns:

returned non-zero exit status 1

Please see logs/Traceback below.

System information

What is the top-level directory of the model you are using:
/models/research
Have I written custom code:
No, Running the latest 2018/03/27 commit of Object Detection API over Cloud ML Engine. However, I did have to make a small python 3 compatibility fix.
In models/research/object_detection/utils/learning_schedules.py: line 168, I wrapped the range() with a list():

rate_index = tf.reduce_max(tf.where(tf.greater_equal(global_step, boundaries),
                                               list(range(num_boundaries)),
                                               [0] * num_boundaries))

OS Platform and Distribution:
Google Cloud ML Engine Runtime 1.6
TensorFlow version (use command below):
1.6
GPU model and memory:
Nvidia Tesla P100 16GB
Exact command to reproduce:

gcloud ml-engine jobs submit training object_detection_`date +%s` \
    --job-dir=gs://${TRAIN_DIR} \
    --packages gs://${DIST_DIR}/object_detection-0.1.tar.gz,gs://${DIST_DIR}/slim-0.1.tar.gz \
    --module-name object_detection.train \
    --region us-central1 \
    --config ${PATH_TO_LOCAL_YAML_FILE} \
    -- \
    --train_dir=gs://${TRAIN_DIR} \
    --pipeline_config_path=gs://${PIPELINE_CONFIG_PATH} \
    --num_clones=4

Exact Cloud ML Engine YAML file to reproduce:

trainingInput:
  runtimeVersion: '1.6'
  pythonVersion: '3.5'
  scaleTier: CUSTOM
  masterType: complex_model_m_p100
  workerType: complex_model_m_p100
  parameterServerType: large_model
  workerCount: 1
  parameterServerCount: 3

Link to Pre-trained model to reproduce:
faster_rcnn_resnet101_coco_2018_01_28
Exact Pipeline Config file to reproduce:

model {
  faster_rcnn {
    num_classes: 1
    image_resizer {
      keep_aspect_ratio_resizer {
        min_dimension: 600
        max_dimension: 1024
      }
    }
    feature_extractor {
      type: 'faster_rcnn_resnet101'
      first_stage_features_stride: 16
    }
    first_stage_anchor_generator {
      grid_anchor_generator {
        scales: [0.25, 0.5, 1.0, 2.0]
        aspect_ratios: [0.5, 1.0, 2.0]
        height_stride: 16
        width_stride: 16
      }
    }
    first_stage_box_predictor_conv_hyperparams {
      op: CONV
      regularizer {
        l2_regularizer {
          weight: 0.0
        }
      }
      initializer {
        truncated_normal_initializer {
          stddev: 0.01
        }
      }
    }
    first_stage_nms_score_threshold: 0.0
    first_stage_nms_iou_threshold: 0.7
    first_stage_max_proposals: 300
    first_stage_localization_loss_weight: 2.0
    first_stage_objectness_loss_weight: 1.0
    initial_crop_size: 14
    maxpool_kernel_size: 2
    maxpool_stride: 2
    second_stage_box_predictor {
      mask_rcnn_box_predictor {
        use_dropout: false
        dropout_keep_probability: 1.0
        fc_hyperparams {
          op: FC
          regularizer {
            l2_regularizer {
              weight: 0.0
            }
          }
          initializer {
            variance_scaling_initializer {
              factor: 1.0
              uniform: true
              mode: FAN_AVG
            }
          }
        }
      }
    }
    second_stage_post_processing {
      batch_non_max_suppression {
        score_threshold: 0.0
        iou_threshold: 0.6
        max_detections_per_class: 100
        max_total_detections: 300
      }
      score_converter: SOFTMAX
    }
    second_stage_localization_loss_weight: 2.0
    second_stage_classification_loss_weight: 1.0
  }
}

train_config: {
  batch_size: 4
  optimizer {
    momentum_optimizer: {
      learning_rate: {
        manual_step_learning_rate {
          initial_learning_rate: 3e-4
          schedule {
            step: 900000
            learning_rate: 3e-5
          }
          schedule {
            step: 1200000
            learning_rate: 3e-6
          }
        }
      }
      momentum_optimizer_value: 0.9
    }
    use_moving_average: false
  }
  gradient_clipping_by_norm: 10.0
  fine_tune_checkpoint: "gs://PATH_TO_BE_CONFIGURED/model.ckpt"
  from_detection_checkpoint: true
  num_steps: 30000
  data_augmentation_options {
    random_horizontal_flip {
    }
  }
}

train_input_reader: {
  tf_record_input_reader {
    input_path: "gs://PATH_TO_BE_CONFIGURED/agroset_train.record"
  }
  label_map_path: "gs://PATH_TO_BE_CONFIGURED/agroset_label_map.pbtxt"
}

eval_config: {
  num_examples: 25
  # Note: The below line limits the evaluation process to 10 evaluations.
  # Remove the below line to evaluate indefinitely.
  max_evals: 30
  visualization_export_dir: "gs://PATH_TO_BE_CONFIGURED/visualization"
}

eval_input_reader: {
  tf_record_input_reader {
    input_path: "gs://PATH_TO_BE_CONFIGURED/agroset_val.record"
  }
  label_map_path: "gs://PATH_TO_BE_CONFIGURED/agroset_label_map.pbtxt"
  shuffle: false
  num_readers: 1
}

Logs

worker-replica-0

The replica worker 0 exited with a non-zero status of 1. Termination reason: Error.

Traceback (most recent call last):
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 1361, in _do_call
    return fn(*args)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 1340, in _run_fn
    target_list, status, run_metadata)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/framework/errors_impl.py", line 516, in __exit__
    c_api.TF_GetCode(self.status.status))
tensorflow.python.framework.errors_impl.UnavailableError: OS Error

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/lib/python3.5/runpy.py", line 184, in _run_module_as_main
    "__main__", mod_spec)
  File "/usr/lib/python3.5/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/root/.local/lib/python3.5/site-packages/object_detection/train.py", line 167, in <module>
    tf.app.run()
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/platform/app.py", line 126, in run
    _sys.exit(main(argv))
  File "/root/.local/lib/python3.5/site-packages/object_detection/train.py", line 163, in main
    worker_job_name, is_chief, FLAGS.train_dir)
  File "/root/.local/lib/python3.5/site-packages/object_detection/trainer.py", line 370, in train
    saver=saver)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/contrib/slim/python/slim/learning.py", line 768, in train
    sess, train_op, global_step, train_step_kwargs)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/contrib/slim/python/slim/learning.py", line 487, in train_step
    run_metadata=run_metadata)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 905, in run
    run_metadata_ptr)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 1137, in _run
    feed_dict_tensor, options, run_metadata)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 1355, in _do_run
    options, run_metadata)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 1374, in _do_call
    raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.UnavailableError: OS Error

Source

roysheffi

Most helpful comment

I had the same error when the tensorflow instances trying to contact each other. I found out that the problem is that GRPC uses the native "epoll" polling engine for communication. Changing this to a portable polling engine solved this issue for me. The way to do is to set the environment variable, "GRPC_POLL_STRATEGY=poll" before running the tensorflow programs. This solved this issue for me. For reference, see, https://github.com/grpc/grpc/blob/master/doc/environment_variables.md.
May be this can help.

nerdyalbin on 20 Aug 2018

❤1 🎉1 👍1

All 15 comments

Hi, I managed to run the faster_rcnn kitti model with transfer learning for my own dataset by disabling the workers (so only the standard_gpu master) and two large_model paramater servers up to ~80k steps. I suspect that the issue is caused by an OOM error as the GPU's (tesla) memory is at ~40%

AlMo-Modo on 28 Mar 2018

👍2

Hi @MoMo-Tech, how removing the workers can alleviate GPU OOM issues?

roysheffi on 28 Mar 2018

Hi @pkulzc, if you'll repeat the same experiment but with Cloud ML Engine TF 1.5 instead of 1.6 then you'll reproduce #3757. I believe they are the same issue that manifests itself differently in TF 1.5 and TF 1.6.

roysheffi on 28 Mar 2018

On my initial runs I noticed that the workers were assigned on the same machine ID. And with this in mind I changed the GPU type to the next tesla tier and noticed that a few more steps were being computed, but still it failed ~10K epochs. There are also a few other tickets that claim there might be an OOM handled by the OS and thus getting the OS error.

AlMo-Modo on 28 Mar 2018

Thanks for the info, we're investigating this as well as 3757 now.

pkulzc on 28 Mar 2018

👍1

Hi @pkulzc, today I pulled the HEAD of models repository and tried again and the problem still persists. However, I got more informative error messages than I was getting before:

master-replica-0

Error reported to Coordinator: <class 'tensorflow.python.framework.errors_impl.UnavailableError'>, OS Error
[[Node: clip_grads/clip_by_norm_74/truediv_S10171 =
  _Recv[client_terminated=false,
  recv_device="/job:ps/replica:0/task:2/device:CPU:0",
  send_device="/job:master/replica:0/task:0/device:CPU:0",
  send_device_incarnation=4406029107278666217,
  tensor_name="edge_32391_clip_grads/clip_by_norm_74/truediv",
  tensor_type=DT_FLOAT,
  _device="/job:ps/replica:0/task:2/device:CPU:0"]()]]
[[Node: Momentum/update/NoOp_2_S10192 =
  _Recv[client_terminated=false,
  recv_device="/job:master/replica:0/task:0/device:CPU:0",
  send_device="/job:ps/replica:0/task:2/device:CPU:0",
  send_device_incarnation=7295333245552102225,
  tensor_name="edge_32739_Momentum/update/NoOp_2",
  tensor_type=DT_FLOAT,
  _device="/job:master/replica:0/task:0/device:CPU:0"]()]]
[[Node: clip_grads/clip_by_norm_213/truediv_S9737 =
  _Recv[client_terminated=false,
  recv_device="/job:ps/replica:0/task:3/device:CPU:0",
  send_device="/job:master/replica:0/task:0/device:CPU:0",
  send_device_incarnation=4406029107278666217,
  tensor_name="edge_26466_clip_grads/clip_by_norm_213/truediv",
  tensor_type=DT_FLOAT,
  _device="/job:ps/replica:0/task:3/device:CPU:0"]()]]

Traceback (most recent call last):
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 1361, in _do_call
    return fn(*args)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 1340, in _run_fn
    target_list, status, run_metadata)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/framework/errors_impl.py", line 516, in __exit__
    c_api.TF_GetCode(self.status.status))
tensorflow.python.framework.errors_impl.UnavailableError: OS Error
     [[Node: clip_grads/clip_by_norm_74/truediv_S10171 = _Recv[client_terminated=false, recv_device="/job:ps/replica:0/task:2/device:CPU:0", send_device="/job:master/replica:0/task:0/device:CPU:0", send_device_incarnation=4406029107278666217, tensor_name="edge_32391_clip_grads/clip_by_norm_74/truediv", tensor_type=DT_FLOAT, _device="/job:ps/replica:0/task:2/device:CPU:0"]()]]
     [[Node: Momentum/update/NoOp_2_S10192 = _Recv[client_terminated=false, recv_device="/job:master/replica:0/task:0/device:CPU:0", send_device="/job:ps/replica:0/task:2/device:CPU:0", send_device_incarnation=7295333245552102225, tensor_name="edge_32739_Momentum/update/NoOp_2", tensor_type=DT_FLOAT, _device="/job:master/replica:0/task:0/device:CPU:0"]()]]
     [[Node: clip_grads/clip_by_norm_213/truediv_S9737 = _Recv[client_terminated=false, recv_device="/job:ps/replica:0/task:3/device:CPU:0", send_device="/job:master/replica:0/task:0/device:CPU:0", send_device_incarnation=4406029107278666217, tensor_name="edge_26466_clip_grads/clip_by_norm_213/truediv", tensor_type=DT_FLOAT, _device="/job:ps/replica:0/task:3/device:CPU:0"]()]]

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/lib/python3.5/runpy.py", line 184, in _run_module_as_main
    "__main__", mod_spec)
  File "/usr/lib/python3.5/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/root/.local/lib/python3.5/site-packages/object_detection/train.py", line 167, in <module>
    tf.app.run()
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/platform/app.py", line 126, in run
    _sys.exit(main(argv))
  File "/root/.local/lib/python3.5/site-packages/object_detection/train.py", line 163, in main
    worker_job_name, is_chief, FLAGS.train_dir)
  File "/root/.local/lib/python3.5/site-packages/object_detection/trainer.py", line 370, in train
    saver=saver)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/contrib/slim/python/slim/learning.py", line 768, in train
    sess, train_op, global_step, train_step_kwargs)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/contrib/slim/python/slim/learning.py", line 487, in train_step
    run_metadata=run_metadata)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 905, in run
    run_metadata_ptr)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 1137, in _run
    feed_dict_tensor, options, run_metadata)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 1355, in _do_run
    options, run_metadata)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 1374, in _do_call
    raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.UnavailableError: OS Error
     [[Node: clip_grads/clip_by_norm_74/truediv_S10171 = _Recv[client_terminated=false, recv_device="/job:ps/replica:0/task:2/device:CPU:0", send_device="/job:master/replica:0/task:0/device:CPU:0", send_device_incarnation=4406029107278666217, tensor_name="edge_32391_clip_grads/clip_by_norm_74/truediv", tensor_type=DT_FLOAT, _device="/job:ps/replica:0/task:2/device:CPU:0"]()]]
     [[Node: Momentum/update/NoOp_2_S10192 = _Recv[client_terminated=false, recv_device="/job:master/replica:0/task:0/device:CPU:0", send_device="/job:ps/replica:0/task:2/device:CPU:0", send_device_incarnation=7295333245552102225, tensor_name="edge_32739_Momentum/update/NoOp_2", tensor_type=DT_FLOAT, _device="/job:master/replica:0/task:0/device:CPU:0"]()]]
     [[Node: clip_grads/clip_by_norm_213/truediv_S9737 = _Recv[client_terminated=false, recv_device="/job:ps/replica:0/task:3/device:CPU:0", send_device="/job:master/replica:0/task:0/device:CPU:0", send_device_incarnation=4406029107278666217, tensor_name="edge_26466_clip_grads/clip_by_norm_213/truediv", tensor_type=DT_FLOAT, _device="/job:ps/replica:0/task:3/device:CPU:0"]()]]

/usr/local/lib/python3.5/dist-packages/tensorflow/python/ops/gradients_impl.py:98: UserWarning: Converting sparse IndexedSlices to a dense Tensor of unknown shape. This may consume a large amount of memory.

Converting sparse IndexedSlices to a dense Tensor of unknown shape.

worker-replica-0

Error reported to Coordinator: <class 'tensorflow.python.framework.errors_impl.UnavailableError'>, OS Error

2018-04-04 16:19:47.928690: E tensorflow/core/distributed_runtime/master_session.cc:1663] Cleanup partition error: Unavailable: OS Error

Traceback (most recent call last):
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 1361, in _do_call
    return fn(*args)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 1340, in _run_fn
    target_list, status, run_metadata)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/framework/errors_impl.py", line 516, in __exit__
    c_api.TF_GetCode(self.status.status))
tensorflow.python.framework.errors_impl.UnavailableError: OS Error

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/lib/python3.5/runpy.py", line 184, in _run_module_as_main
    "__main__", mod_spec)
  File "/usr/lib/python3.5/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/root/.local/lib/python3.5/site-packages/object_detection/train.py", line 167, in <module>
    tf.app.run()
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/platform/app.py", line 126, in run
    _sys.exit(main(argv))
  File "/root/.local/lib/python3.5/site-packages/object_detection/train.py", line 163, in main
    worker_job_name, is_chief, FLAGS.train_dir)
  File "/root/.local/lib/python3.5/site-packages/object_detection/trainer.py", line 370, in train
    saver=saver)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/contrib/slim/python/slim/learning.py", line 768, in train
    sess, train_op, global_step, train_step_kwargs)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/contrib/slim/python/slim/learning.py", line 503, in train_step
    if sess.run(train_step_kwargs['should_log']):
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 905, in run
    run_metadata_ptr)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 1137, in _run
    feed_dict_tensor, options, run_metadata)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 1355, in _do_run
    options, run_metadata)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 1374, in _do_call
    raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.UnavailableError: OS Error

/usr/local/lib/python3.5/dist-packages/tensorflow/python/ops/gradients_impl.py:98: UserWarning: Converting sparse IndexedSlices to a dense Tensor of unknown shape. This may consume a large amount of memory.

Converting sparse IndexedSlices to a dense Tensor of unknown shape.

ml-engine

The replica master 0 exited with a non-zero status of 1. Termination reason: Error. 
Traceback (most recent call last):
  [...]
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/contrib/slim/python/slim/learning.py", line 487, in train_step
    run_metadata=run_metadata)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 905, in run
    run_metadata_ptr)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 1137, in _run
    feed_dict_tensor, options, run_metadata)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 1355, in _do_run
    options, run_metadata)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 1374, in _do_call
    raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.UnavailableError: OS Error
     [[Node: clip_grads/clip_by_norm_74/truediv_S10171 = _Recv[client_terminated=false, recv_device="/job:ps/replica:0/task:2/device:CPU:0", send_device="/job:master/replica:0/task:0/device:CPU:0", send_device_incarnation=4406029107278666217, tensor_name="edge_32391_clip_grads/clip_by_norm_74/truediv", tensor_type=DT_FLOAT, _device="/job:ps/replica:0/task:2/device:CPU:0"]()]]
     [[Node: Momentum/update/NoOp_2_S10192 = _Recv[client_terminated=false, recv_device="/job:master/replica:0/task:0/device:CPU:0", send_device="/job:ps/replica:0/task:2/device:CPU:0", send_device_incarnation=7295333245552102225, tensor_name="edge_32739_Momentum/update/NoOp_2", tensor_type=DT_FLOAT, _device="/job:master/replica:0/task:0/device:CPU:0"]()]]
     [[Node: clip_grads/clip_by_norm_213/truediv_S9737 = _Recv[client_terminated=false, recv_device="/job:ps/replica:0/task:3/device:CPU:0", send_device="/job:master/replica:0/task:0/device:CPU:0", send_device_incarnation=4406029107278666217, tensor_name="edge_26466_clip_grads/clip_by_norm_213/truediv", tensor_type=DT_FLOAT, _device="/job:ps/replica:0/task:3/device:CPU:0"]()]]

The replica worker 0 exited with a non-zero status of 1. Termination reason: Error. 
Traceback (most recent call last):
  [...]
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/framework/errors_impl.py", line 516, in __exit__
    c_api.TF_GetCode(self.status.status))
tensorflow.python.framework.errors_impl.UnavailableError: OS Error

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/lib/python3.5/runpy.py", line 184, in _run_module_as_main
    "__main__", mod_spec)
  File "/usr/lib/python3.5/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/root/.local/lib/python3.5/site-packages/object_detection/train.py", line 167, in <module>
    tf.app.run()
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/platform/app.py", line 126, in run
    _sys.exit(main(argv))
  File "/root/.local/lib/python3.5/site-packages/object_detection/train.py", line 163, in main
    worker_job_name, is_chief, FLAGS.train_dir)
  File "/root/.local/lib/python3.5/site-packages/object_detection/trainer.py", line 370, in train
    saver=saver)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/contrib/slim/python/slim/learning.py", line 768, in train
    sess, train_op, global_step, train_step_kwargs)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/contrib/slim/python/slim/learning.py", line 503, in train_step
    if sess.run(train_step_kwargs['should_log']):
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 905, in run
    run_metadata_ptr)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 1137, in _run
    feed_dict_tensor, options, run_metadata)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 1355, in _do_run
    options, run_metadata)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 1374, in _do_call
    raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.UnavailableError: OS Error

roysheffi on 4 Apr 2018

Hi @pkulzc, thank you for investigating this issue!

Are you able to please share any updates or information regarding the status of this issue?

Thanks 👍

roysheffi on 6 Apr 2018

Sorry for the delay, right now the cloud team and tensorflow team are still investigating.

pkulzc on 6 Apr 2018

👍1

Great!

Thanks for the update

roysheffi on 6 Apr 2018

Hi @pkulzc, I think I may have a lead:

On line 357, object_detection/trainer.py calls tf.contrib.slim.learning.train() which uses the deprecated tf.train.Supervisor and should be migrated to tf.train.MonitoredTrainingSession instead, as documented in tf.train.Supervisor

This is already requested in tensorflow/tensorflow#15793 and is reported as a solution to tensorflow/tensorflow#17852 on the last comment of yahoo/TensorFlowOnSpark#245.

roysheffi on 7 Apr 2018

👍2

@MoMo-Tech I also managed to run transfer learning based on ssd_mobilenet_v1_coco_2017_11_17 models up to about ~80k steps and it stopped with the following error.

The replica worker 4 exited with a non-zero status of 1. Termination reason: Error. Traceback (most recent call last): File "/usr/lib/python2.7/runpy.py", line 174, in _run_module_as_main "__main__", fname, loader, pkg_name) File "/usr/lib/python2.7/runpy.py", line 72, in _run_code exec code in run_globals File "/root/.local/lib/python2.7/site-packages/object_detection/train.py", line 184, in <module> tf.app.run() File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/platform/app.py", line 126, in run _sys.exit(main(argv)) File "/root/.local/lib/python2.7/site-packages/object_detection/train.py", line 180, in main graph_hook_fn=graph_rewriter_fn) File "/root/.local/lib/python2.7/site-packages/object_detection/trainer.py", line 410, in train saver=saver) File "/usr/local/lib/python2.7/dist-packages/tensorflow/contrib/slim/python/slim/learning.py", line 769, in train sess, train_op, global_step, train_step_kwargs) File "/usr/local/lib/python2.7/dist-packages/tensorflow/contrib/slim/python/slim/learning.py", line 487, in train_step run_metadata=run_metadata) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 900, in run run_metadata_ptr) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 1135, in _run feed_dict_tensor, options, run_metadata) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 1316, in _do_run run_metadata) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 1335, in _do_call raise type(e)(node_def, op, message) UnavailableError: OS Error To find out more about why your job exited please check the logs: https://console.cloud.google.com/logs/viewer?project=1466216932&resource=ml_job%2Fjob_id%2Ffreshturf_object_detection_1530117506&advancedFilter=resource.type%3D%22ml_job%22%0Aresource.labels.job_id%3D%22freshturf_object_detection_1530117506%22

The master memory usage was around 38.8 ~ 49.7 %