Models: "Unavailable OSError" when training tensorflow object detection model on google cloud.

Created on 18 Feb 2019 · 8Comments · Source: tensorflow/models

Describe the problem

I have been trying to use a pretrained model on the tensorflow object detection API and retrain it on Google cloud, however after almost ~40 trials, I am stuck with this error. Sometimes the job tears down from the beginning and sometimes it runs for a few steps then fails.

System information

What is the top-level directory of the model you are using: models/research
Have I written custom code (as opposed to using a stock example script provided in TensorFlow): Only config modification
OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Windows 10(base). Training on Google Cloud ML Engine Runtime 1.9
TensorFlow installed from (source or binary):
TensorFlow version (use command below): 1.9 gpu
Bazel version (if compiling from source):
CUDA/cuDNN version: v 9.0
GPU model and memory: standard_gpu on google cloud. Local: Gtx 1050 4GB
Exact command to reproduce:

gcloud ml-engine jobs submit training hardhat40 
--job-dir=gs://ppe_detection/data/
--packages dist/object_detection-0.1.tar.gz,slim/dist/slim-0.1.tar.gz,dist/absl-py-0.7.0.tar.gz
--module-name object_detection.train
--region europe-west1 
--config object_detection/samples/cloud/cloud.yml 
--runtime-version=1.9 -- 
--train_dir=gs://ppe_detection/data/
--pipeline_config_path=gs://ppe_detection/data/faster_rcnn_inception_resnet_v2_atrous_oid.config

Exact config to reproduce:

model {
  faster_rcnn {
    num_classes: 2
    image_resizer {
      keep_aspect_ratio_resizer {
        min_dimension: 600
        max_dimension: 1024
      }
    }
    feature_extractor {
      type: 'faster_rcnn_inception_resnet_v2'
      first_stage_features_stride: 8
    }
    first_stage_anchor_generator {
      grid_anchor_generator {
        scales: [0.25, 0.5, 1.0, 2.0]
        aspect_ratios: [0.5, 1.0, 2.0]
        height_stride: 8
        width_stride: 8
      }
    }
    first_stage_atrous_rate: 2
    first_stage_box_predictor_conv_hyperparams {
      op: CONV
      regularizer {
        l2_regularizer {
          weight: 0.0
        }
      }
      initializer {
        truncated_normal_initializer {
          stddev: 0.01
        }
      }
    }
    first_stage_nms_score_threshold: 0.0
    first_stage_nms_iou_threshold: 0.7
    first_stage_max_proposals: 300
    first_stage_localization_loss_weight: 2.0
    first_stage_objectness_loss_weight: 1.0
    initial_crop_size: 17
    maxpool_kernel_size: 1
    maxpool_stride: 1
    second_stage_box_predictor {
      mask_rcnn_box_predictor {
        use_dropout: false
        dropout_keep_probability: 1.0
        fc_hyperparams {
          op: FC
          regularizer {
            l2_regularizer {
              weight: 0.0
            }
          }
          initializer {
            variance_scaling_initializer {
              factor: 1.0
              uniform: true
              mode: FAN_AVG
            }
          }
        }
      }
    }
    second_stage_post_processing {
      batch_non_max_suppression {
        score_threshold: 0.0
        iou_threshold: 0.6
        max_detections_per_class: 100
        max_total_detections: 100
      }
      score_converter: SOFTMAX
    }
    second_stage_localization_loss_weight: 2.0
    second_stage_classification_loss_weight: 1.0
  }
}

train_config: {
  batch_size: 1
  optimizer {
    momentum_optimizer: {
      learning_rate: {
        manual_step_learning_rate {
          initial_learning_rate: 0.00006
          schedule {
            step: 6000000
            learning_rate: .000006
          }
          schedule {
            step: 7000000
            learning_rate: .0000006
          }
        }
      }
      momentum_optimizer_value: 0.9
    }
    use_moving_average: false
  }
  gradient_clipping_by_norm: 10.0
  fine_tune_checkpoint: "gs://ppe_detection/data/model.ckpt"
  from_detection_checkpoint: true
  load_all_detection_checkpoint_vars: true

  num_steps: 400000
  data_augmentation_options {
    random_horizontal_flip {
    }
  }
}

train_input_reader: {
  tf_record_input_reader {
    input_path: "gs://ppe_detection/data/train.record"
  }
  label_map_path: "gs://ppe_detection/data/labelmap.pbtxt"
}

eval_config: {
  metrics_set: "open_images_metrics"
  num_examples: 104
  max_evals: 10
}

eval_input_reader: {
  tf_record_input_reader {
    input_path: "gs://ppe_detection/data/test.record"
  }
  label_map_path: "gs://ppe_detection/data/labelmap.pbtxt"
  shuffle: false
  num_readers: 1
}

Cloud yaml file:

trainingInput:
  --runtimeVersion: "1.9"
 -- scaleTier: CUSTOM
 -- masterType: standard_gpu
 --workerCount: 3
 --workerType: standard_gpu
 --parameterServerCount: 3
 --parameterServerType: standard

Model Type:
faster_rcnn_inception_resnet_v2_atrous_oid

Source code / logs

master-replica-0
Traceback (most recent call last): File "/usr/lib/python2.7/runpy.py", line 174, in _run_module_as_main "__main__", fname, loader, pkg_name) File "/usr/lib/python2.7/runpy.py", line 72, in _run_code exec code in run_globals File "/root/.local/lib/python2.7/site-packages/object_detection/train.py", line 184, in <module> tf.app.run() File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/platform/app.py", line 125, in run _sys.exit(main(argv)) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/util/deprecation.py", line 250, in new_func return func(*args, **kwargs) File "/root/.local/lib/python2.7/site-packages/object_detection/train.py", line 180, in main graph_hook_fn=graph_rewriter_fn) File "/root/.local/lib/python2.7/site-packages/object_detection/legacy/trainer.py", line 415, in train saver=saver) File "/usr/local/lib/python2.7/dist-packages/tensorflow/contrib/slim/python/slim/learning.py", line 785, in train ignore_live_threads=ignore_live_threads) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/supervisor.py", line 833, in stop ignore_live_threads=ignore_live_threads) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/coordinator.py", line 389, in join six.reraise(*self._exc_info_to_raise) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/coordinator.py", line 297, in stop_on_exception yield File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/coordinator.py", line 495, in run self.run_loop() File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/supervisor.py", line 1035, in run_loop self._sv.global_step]) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 900, in run run_metadata_ptr) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 1135, in _run feed_dict_tensor, options, run_metadata) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 1316, in _do_run run_metadata) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 1335, in _do_call raise type(e)(node_def, op, message) UnavailableError: OS Error [[Node: SecondStageFeatureExtractor/InceptionResnetV2/Repeat/block8_2/Branch_0/Conv2d_1x1/weights/read_S10901 = _Recv[client_terminated=false, recv_device="/job:master/replica:0/task:0/device:GPU:0", send_device="/job:ps/replica:0/task:1/device:CPU:0", send_device_incarnation=7265986053344373253, tensor_name="edge_10431...ights/read", tensor_type=DT_FLOAT, _device="/job:master/replica:0/task:0/device:GPU:0"]()]] [[Node: BatchMultiClassNonMaxSuppression/map/while/MultiClassNonMaxSuppression/SortByField/Size_G289 = _Recv[client_terminated=false, recv_device="/job:master/replica:0/task:0/device:CPU:0", send_device="/job:master/replica:0/task:0/device:GPU:0", send_device_incarnation=-2716543638016603643, tensor_name="edge_11837...Field/Size", tensor_type=DT_INT32, _device="/job:master/replica:0/task:0/device:CPU:0"](^_cloopBatchMultiClassNonMaxSuppression/map/while/MultiClassNonMaxSuppression/SortByField_1/Assert/Assert/data_0_S4645)]]

research

Source

moussas1

All 8 comments

Are you using legacy/train.py? It's using slim.learning which doesn't work well with gRPC ever since tf 1.5. Try following the latest guide and let me know if it doesn't work.

pkulzc on 21 Feb 2019

@moussas1 Any progress on this?

prameshbajra on 23 Feb 2019

I've faced the same issue. It's reproducible with the latest Pet detection tutorial.

message:  "An error was raised. This may be due to a preemption in a connected worker or parameter server. The current session will be closed and a new session will be created. Error: OS Error"

Job configuration:

{
  "scaleTier": "CUSTOM",
  "masterType": "standard_gpu",
  "workerType": "standard_gpu",
  "parameterServerType": "standard",
  "workerCount": "5",
  "parameterServerCount": "3",
  "packageUris": [
    "gs://nevermind-tf/model_dir/packages/cbdfa31f46b8fae654b4dd2463d790543084a4540e67306aba3859fadede5c09/object_detection-0.1.tar.gz",
    "gs://nevermind-tf/model_dir/packages/cbdfa31f46b8fae654b4dd2463d790543084a4540e67306aba3859fadede5c09/slim-0.1.tar.gz",
    "gs://nevermind-tf/model_dir/packages/cbdfa31f46b8fae654b4dd2463d790543084a4540e67306aba3859fadede5c09/pycocotools-2.0.tar.gz"
  ],
  "pythonModule": "object_detection.model_main",
  "args": [
    "--model_dir=gs://nevermind-tf/model_dir",
    "--pipeline_config_path=gs://nevermind-tf/data/faster_rcnn_resnet101_pets.config"
  ],
  "region": "us-central1",
  "runtimeVersion": "1.9",
  "jobDir": "gs://nevermind-tf/model_dir"
}

ak47-nevermind on 23 Feb 2019

@pkulzc Well now I am receiving the following errors.

Truncated error message: Not found: /tmp/tmpnfbpvza3/model.ckpt-0_temp_5d9e4dd6d7a44a6bab1c206a5a916008; No such file or directory

Followed by this

worker-replica-0 An error was raised. This may be due to a preemption in a connected worker or parameter server. The current session will be closed and a new session will be created. Error: OS Error

moussas1 on 27 Feb 2019

@prameshbajra Unfortunately not

moussas1 on 27 Feb 2019

👀1

An error was raised. This may be due to a preemption in a connected worker or parameter server. The current session will be closed and a new session will be created. Error: OS Error [[Node: clip_by_global_norm/mul_482_S12705 = _Recv[client_terminated=false, recv_device="/job:ps/replica:0/task:1/device:CPU:0", send_device="/job:worker/replica:0/task:4/device:GPU:0", send_device_incarnation=-2529168372228152166, tensor_name="edge_24320_clip_by_global_norm/mul_482", tensor_type=DT_FLOAT, _device="/job:ps/replica:0/task:1/device:CPU:0"]()]] [[Node: train/update/NoOp_1_S12824 = _Recv[client_terminated=false, recv_device="/job:worker/replica:0/task:4/device:GPU:0", send_device="/job:ps/replica:0/task:1/device:CPU:0", send_device_incarnation=1392361354333875400, tensor_name="edge_25101_train/update/NoOp_1", tensor_type=DT_FLOAT, _device="/job:worker/replica:0/task:4/device:GPU:0"]()]] [[Node: control_dependency_G415 = _Recv[client_terminated=false, recv_device="/job:worker/replica:0/task:4/device:CPU:0", send_device="/job:worker/replica:0/task:4/device:GPU:0", send_device_incarnation=-2529168372228152166, tensor_name="edge_21952_control_dependency", tensor_type=DT_FLOAT, _device="/job:worker/replica:0/task:4/device:CPU:0"]()]]

moussas1 on 27 Feb 2019

👀1

For future viewers, please switch to another thread to track the progress of this issue since this one is duplicated.

pkulzc on 9 Mar 2019

👍1

Hi There,
We are checking to see if you still need help on this, as this seems to be an old issue. Please update this issue with the latest information, code snippet to reproduce your issue and error you are seeing.
If we don't hear from you in the next 7 days, this issue will be closed automatically. If you don't need help on this issue any more, please consider closing this.