Models: "Unavailable OSError" when training tensorflow object detection model on google cloud.

Created on 18 Feb 2019  路  8Comments  路  Source: tensorflow/models

Describe the problem

I have been trying to use a pretrained model on the tensorflow object detection API and retrain it on Google cloud, however after almost ~40 trials, I am stuck with this error. Sometimes the job tears down from the beginning and sometimes it runs for a few steps then fails.

System information

  • What is the top-level directory of the model you are using: models/research
  • Have I written custom code (as opposed to using a stock example script provided in TensorFlow): Only config modification
  • OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Windows 10(base). Training on Google Cloud ML Engine Runtime 1.9
  • TensorFlow installed from (source or binary):
  • TensorFlow version (use command below): 1.9 gpu
  • Bazel version (if compiling from source):

  • CUDA/cuDNN version: v 9.0

  • GPU model and memory: standard_gpu on google cloud. Local: Gtx 1050 4GB
  • Exact command to reproduce:
gcloud ml-engine jobs submit training hardhat40 
--job-dir=gs://ppe_detection/data/
--packages dist/object_detection-0.1.tar.gz,slim/dist/slim-0.1.tar.gz,dist/absl-py-0.7.0.tar.gz
--module-name object_detection.train
--region europe-west1 
--config object_detection/samples/cloud/cloud.yml 
--runtime-version=1.9 -- 
--train_dir=gs://ppe_detection/data/
--pipeline_config_path=gs://ppe_detection/data/faster_rcnn_inception_resnet_v2_atrous_oid.config
  • Exact config to reproduce:
model {
  faster_rcnn {
    num_classes: 2
    image_resizer {
      keep_aspect_ratio_resizer {
        min_dimension: 600
        max_dimension: 1024
      }
    }
    feature_extractor {
      type: 'faster_rcnn_inception_resnet_v2'
      first_stage_features_stride: 8
    }
    first_stage_anchor_generator {
      grid_anchor_generator {
        scales: [0.25, 0.5, 1.0, 2.0]
        aspect_ratios: [0.5, 1.0, 2.0]
        height_stride: 8
        width_stride: 8
      }
    }
    first_stage_atrous_rate: 2
    first_stage_box_predictor_conv_hyperparams {
      op: CONV
      regularizer {
        l2_regularizer {
          weight: 0.0
        }
      }
      initializer {
        truncated_normal_initializer {
          stddev: 0.01
        }
      }
    }
    first_stage_nms_score_threshold: 0.0
    first_stage_nms_iou_threshold: 0.7
    first_stage_max_proposals: 300
    first_stage_localization_loss_weight: 2.0
    first_stage_objectness_loss_weight: 1.0
    initial_crop_size: 17
    maxpool_kernel_size: 1
    maxpool_stride: 1
    second_stage_box_predictor {
      mask_rcnn_box_predictor {
        use_dropout: false
        dropout_keep_probability: 1.0
        fc_hyperparams {
          op: FC
          regularizer {
            l2_regularizer {
              weight: 0.0
            }
          }
          initializer {
            variance_scaling_initializer {
              factor: 1.0
              uniform: true
              mode: FAN_AVG
            }
          }
        }
      }
    }
    second_stage_post_processing {
      batch_non_max_suppression {
        score_threshold: 0.0
        iou_threshold: 0.6
        max_detections_per_class: 100
        max_total_detections: 100
      }
      score_converter: SOFTMAX
    }
    second_stage_localization_loss_weight: 2.0
    second_stage_classification_loss_weight: 1.0
  }
}

train_config: {
  batch_size: 1
  optimizer {
    momentum_optimizer: {
      learning_rate: {
        manual_step_learning_rate {
          initial_learning_rate: 0.00006
          schedule {
            step: 6000000
            learning_rate: .000006
          }
          schedule {
            step: 7000000
            learning_rate: .0000006
          }
        }
      }
      momentum_optimizer_value: 0.9
    }
    use_moving_average: false
  }
  gradient_clipping_by_norm: 10.0
  fine_tune_checkpoint: "gs://ppe_detection/data/model.ckpt"
  from_detection_checkpoint: true
  load_all_detection_checkpoint_vars: true

  num_steps: 400000
  data_augmentation_options {
    random_horizontal_flip {
    }
  }
}

train_input_reader: {
  tf_record_input_reader {
    input_path: "gs://ppe_detection/data/train.record"
  }
  label_map_path: "gs://ppe_detection/data/labelmap.pbtxt"
}

eval_config: {
  metrics_set: "open_images_metrics"
  num_examples: 104
  max_evals: 10
}

eval_input_reader: {
  tf_record_input_reader {
    input_path: "gs://ppe_detection/data/test.record"
  }
  label_map_path: "gs://ppe_detection/data/labelmap.pbtxt"
  shuffle: false
  num_readers: 1
}

  • Cloud yaml file:
trainingInput:
  --runtimeVersion: "1.9"
 -- scaleTier: CUSTOM
 -- masterType: standard_gpu
 --workerCount: 3
 --workerType: standard_gpu
 --parameterServerCount: 3
 --parameterServerType: standard
  • Model Type:
    faster_rcnn_inception_resnet_v2_atrous_oid

Source code / logs

master-replica-0
Traceback (most recent call last): File "/usr/lib/python2.7/runpy.py", line 174, in _run_module_as_main "__main__", fname, loader, pkg_name) File "/usr/lib/python2.7/runpy.py", line 72, in _run_code exec code in run_globals File "/root/.local/lib/python2.7/site-packages/object_detection/train.py", line 184, in <module> tf.app.run() File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/platform/app.py", line 125, in run _sys.exit(main(argv)) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/util/deprecation.py", line 250, in new_func return func(*args, **kwargs) File "/root/.local/lib/python2.7/site-packages/object_detection/train.py", line 180, in main graph_hook_fn=graph_rewriter_fn) File "/root/.local/lib/python2.7/site-packages/object_detection/legacy/trainer.py", line 415, in train saver=saver) File "/usr/local/lib/python2.7/dist-packages/tensorflow/contrib/slim/python/slim/learning.py", line 785, in train ignore_live_threads=ignore_live_threads) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/supervisor.py", line 833, in stop ignore_live_threads=ignore_live_threads) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/coordinator.py", line 389, in join six.reraise(*self._exc_info_to_raise) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/coordinator.py", line 297, in stop_on_exception yield File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/coordinator.py", line 495, in run self.run_loop() File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/supervisor.py", line 1035, in run_loop self._sv.global_step]) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 900, in run run_metadata_ptr) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 1135, in _run feed_dict_tensor, options, run_metadata) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 1316, in _do_run run_metadata) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 1335, in _do_call raise type(e)(node_def, op, message) UnavailableError: OS Error [[Node: SecondStageFeatureExtractor/InceptionResnetV2/Repeat/block8_2/Branch_0/Conv2d_1x1/weights/read_S10901 = _Recv[client_terminated=false, recv_device="/job:master/replica:0/task:0/device:GPU:0", send_device="/job:ps/replica:0/task:1/device:CPU:0", send_device_incarnation=7265986053344373253, tensor_name="edge_10431...ights/read", tensor_type=DT_FLOAT, _device="/job:master/replica:0/task:0/device:GPU:0"]()]] [[Node: BatchMultiClassNonMaxSuppression/map/while/MultiClassNonMaxSuppression/SortByField/Size_G289 = _Recv[client_terminated=false, recv_device="/job:master/replica:0/task:0/device:CPU:0", send_device="/job:master/replica:0/task:0/device:GPU:0", send_device_incarnation=-2716543638016603643, tensor_name="edge_11837...Field/Size", tensor_type=DT_INT32, _device="/job:master/replica:0/task:0/device:CPU:0"](^_cloopBatchMultiClassNonMaxSuppression/map/while/MultiClassNonMaxSuppression/SortByField_1/Assert/Assert/data_0_S4645)]]

research

All 8 comments

Are you using legacy/train.py? It's using slim.learning which doesn't work well with gRPC ever since tf 1.5. Try following the latest guide and let me know if it doesn't work.

@moussas1 Any progress on this?

I've faced the same issue. It's reproducible with the latest Pet detection tutorial.

message:  "An error was raised. This may be due to a preemption in a connected worker or parameter server. The current session will be closed and a new session will be created. Error: OS Error"

Job configuration:

{
  "scaleTier": "CUSTOM",
  "masterType": "standard_gpu",
  "workerType": "standard_gpu",
  "parameterServerType": "standard",
  "workerCount": "5",
  "parameterServerCount": "3",
  "packageUris": [
    "gs://nevermind-tf/model_dir/packages/cbdfa31f46b8fae654b4dd2463d790543084a4540e67306aba3859fadede5c09/object_detection-0.1.tar.gz",
    "gs://nevermind-tf/model_dir/packages/cbdfa31f46b8fae654b4dd2463d790543084a4540e67306aba3859fadede5c09/slim-0.1.tar.gz",
    "gs://nevermind-tf/model_dir/packages/cbdfa31f46b8fae654b4dd2463d790543084a4540e67306aba3859fadede5c09/pycocotools-2.0.tar.gz"
  ],
  "pythonModule": "object_detection.model_main",
  "args": [
    "--model_dir=gs://nevermind-tf/model_dir",
    "--pipeline_config_path=gs://nevermind-tf/data/faster_rcnn_resnet101_pets.config"
  ],
  "region": "us-central1",
  "runtimeVersion": "1.9",
  "jobDir": "gs://nevermind-tf/model_dir"
}

@pkulzc Well now I am receiving the following errors.

Truncated error message: Not found: /tmp/tmpnfbpvza3/model.ckpt-0_temp_5d9e4dd6d7a44a6bab1c206a5a916008; No such file or directory

Followed by this

worker-replica-0 An error was raised. This may be due to a preemption in a connected worker or parameter server. The current session will be closed and a new session will be created. Error: OS Error

@prameshbajra Unfortunately not

An error was raised. This may be due to a preemption in a connected worker or parameter server. The current session will be closed and a new session will be created. Error: OS Error [[Node: clip_by_global_norm/mul_482_S12705 = _Recv[client_terminated=false, recv_device="/job:ps/replica:0/task:1/device:CPU:0", send_device="/job:worker/replica:0/task:4/device:GPU:0", send_device_incarnation=-2529168372228152166, tensor_name="edge_24320_clip_by_global_norm/mul_482", tensor_type=DT_FLOAT, _device="/job:ps/replica:0/task:1/device:CPU:0"]()]] [[Node: train/update/NoOp_1_S12824 = _Recv[client_terminated=false, recv_device="/job:worker/replica:0/task:4/device:GPU:0", send_device="/job:ps/replica:0/task:1/device:CPU:0", send_device_incarnation=1392361354333875400, tensor_name="edge_25101_train/update/NoOp_1", tensor_type=DT_FLOAT, _device="/job:worker/replica:0/task:4/device:GPU:0"]()]] [[Node: control_dependency_G415 = _Recv[client_terminated=false, recv_device="/job:worker/replica:0/task:4/device:CPU:0", send_device="/job:worker/replica:0/task:4/device:GPU:0", send_device_incarnation=-2529168372228152166, tensor_name="edge_21952_control_dependency", tensor_type=DT_FLOAT, _device="/job:worker/replica:0/task:4/device:CPU:0"]()]]

For future viewers, please switch to another thread to track the progress of this issue since this one is duplicated.

Hi There,
We are checking to see if you still need help on this, as this seems to be an old issue. Please update this issue with the latest information, code snippet to reproduce your issue and error you are seeing.
If we don't hear from you in the next 7 days, this issue will be closed automatically. If you don't need help on this issue any more, please consider closing this.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

xbcReal picture xbcReal  路  3Comments

frankkloster picture frankkloster  路  3Comments

kamal4493 picture kamal4493  路  3Comments

nmfisher picture nmfisher  路  3Comments

noumanriazkhan picture noumanriazkhan  路  3Comments