I have been trying to use a pretrained model on the tensorflow object detection API and retrain it on Google cloud, however after almost ~40 trials, I am stuck with this error. Sometimes the job tears down from the beginning and sometimes it runs for a few steps then fails.
Bazel version (if compiling from source):
CUDA/cuDNN version: v 9.0
gcloud ml-engine jobs submit training hardhat40
--job-dir=gs://ppe_detection/data/
--packages dist/object_detection-0.1.tar.gz,slim/dist/slim-0.1.tar.gz,dist/absl-py-0.7.0.tar.gz
--module-name object_detection.train
--region europe-west1
--config object_detection/samples/cloud/cloud.yml
--runtime-version=1.9 --
--train_dir=gs://ppe_detection/data/
--pipeline_config_path=gs://ppe_detection/data/faster_rcnn_inception_resnet_v2_atrous_oid.config
model {
faster_rcnn {
num_classes: 2
image_resizer {
keep_aspect_ratio_resizer {
min_dimension: 600
max_dimension: 1024
}
}
feature_extractor {
type: 'faster_rcnn_inception_resnet_v2'
first_stage_features_stride: 8
}
first_stage_anchor_generator {
grid_anchor_generator {
scales: [0.25, 0.5, 1.0, 2.0]
aspect_ratios: [0.5, 1.0, 2.0]
height_stride: 8
width_stride: 8
}
}
first_stage_atrous_rate: 2
first_stage_box_predictor_conv_hyperparams {
op: CONV
regularizer {
l2_regularizer {
weight: 0.0
}
}
initializer {
truncated_normal_initializer {
stddev: 0.01
}
}
}
first_stage_nms_score_threshold: 0.0
first_stage_nms_iou_threshold: 0.7
first_stage_max_proposals: 300
first_stage_localization_loss_weight: 2.0
first_stage_objectness_loss_weight: 1.0
initial_crop_size: 17
maxpool_kernel_size: 1
maxpool_stride: 1
second_stage_box_predictor {
mask_rcnn_box_predictor {
use_dropout: false
dropout_keep_probability: 1.0
fc_hyperparams {
op: FC
regularizer {
l2_regularizer {
weight: 0.0
}
}
initializer {
variance_scaling_initializer {
factor: 1.0
uniform: true
mode: FAN_AVG
}
}
}
}
}
second_stage_post_processing {
batch_non_max_suppression {
score_threshold: 0.0
iou_threshold: 0.6
max_detections_per_class: 100
max_total_detections: 100
}
score_converter: SOFTMAX
}
second_stage_localization_loss_weight: 2.0
second_stage_classification_loss_weight: 1.0
}
}
train_config: {
batch_size: 1
optimizer {
momentum_optimizer: {
learning_rate: {
manual_step_learning_rate {
initial_learning_rate: 0.00006
schedule {
step: 6000000
learning_rate: .000006
}
schedule {
step: 7000000
learning_rate: .0000006
}
}
}
momentum_optimizer_value: 0.9
}
use_moving_average: false
}
gradient_clipping_by_norm: 10.0
fine_tune_checkpoint: "gs://ppe_detection/data/model.ckpt"
from_detection_checkpoint: true
load_all_detection_checkpoint_vars: true
num_steps: 400000
data_augmentation_options {
random_horizontal_flip {
}
}
}
train_input_reader: {
tf_record_input_reader {
input_path: "gs://ppe_detection/data/train.record"
}
label_map_path: "gs://ppe_detection/data/labelmap.pbtxt"
}
eval_config: {
metrics_set: "open_images_metrics"
num_examples: 104
max_evals: 10
}
eval_input_reader: {
tf_record_input_reader {
input_path: "gs://ppe_detection/data/test.record"
}
label_map_path: "gs://ppe_detection/data/labelmap.pbtxt"
shuffle: false
num_readers: 1
}
trainingInput:
--runtimeVersion: "1.9"
-- scaleTier: CUSTOM
-- masterType: standard_gpu
--workerCount: 3
--workerType: standard_gpu
--parameterServerCount: 3
--parameterServerType: standard
master-replica-0
Traceback (most recent call last): File "/usr/lib/python2.7/runpy.py", line 174, in _run_module_as_main "__main__", fname, loader, pkg_name) File "/usr/lib/python2.7/runpy.py", line 72, in _run_code exec code in run_globals File "/root/.local/lib/python2.7/site-packages/object_detection/train.py", line 184, in <module> tf.app.run() File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/platform/app.py", line 125, in run _sys.exit(main(argv)) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/util/deprecation.py", line 250, in new_func return func(*args, **kwargs) File "/root/.local/lib/python2.7/site-packages/object_detection/train.py", line 180, in main graph_hook_fn=graph_rewriter_fn) File "/root/.local/lib/python2.7/site-packages/object_detection/legacy/trainer.py", line 415, in train saver=saver) File "/usr/local/lib/python2.7/dist-packages/tensorflow/contrib/slim/python/slim/learning.py", line 785, in train ignore_live_threads=ignore_live_threads) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/supervisor.py", line 833, in stop ignore_live_threads=ignore_live_threads) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/coordinator.py", line 389, in join six.reraise(*self._exc_info_to_raise) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/coordinator.py", line 297, in stop_on_exception yield File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/coordinator.py", line 495, in run self.run_loop() File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/supervisor.py", line 1035, in run_loop self._sv.global_step]) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 900, in run run_metadata_ptr) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 1135, in _run feed_dict_tensor, options, run_metadata) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 1316, in _do_run run_metadata) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 1335, in _do_call raise type(e)(node_def, op, message) UnavailableError: OS Error [[Node: SecondStageFeatureExtractor/InceptionResnetV2/Repeat/block8_2/Branch_0/Conv2d_1x1/weights/read_S10901 = _Recv[client_terminated=false, recv_device="/job:master/replica:0/task:0/device:GPU:0", send_device="/job:ps/replica:0/task:1/device:CPU:0", send_device_incarnation=7265986053344373253, tensor_name="edge_10431...ights/read", tensor_type=DT_FLOAT, _device="/job:master/replica:0/task:0/device:GPU:0"]()]] [[Node: BatchMultiClassNonMaxSuppression/map/while/MultiClassNonMaxSuppression/SortByField/Size_G289 = _Recv[client_terminated=false, recv_device="/job:master/replica:0/task:0/device:CPU:0", send_device="/job:master/replica:0/task:0/device:GPU:0", send_device_incarnation=-2716543638016603643, tensor_name="edge_11837...Field/Size", tensor_type=DT_INT32, _device="/job:master/replica:0/task:0/device:CPU:0"](^_cloopBatchMultiClassNonMaxSuppression/map/while/MultiClassNonMaxSuppression/SortByField_1/Assert/Assert/data_0_S4645)]]
Are you using legacy/train.py? It's using slim.learning which doesn't work well with gRPC ever since tf 1.5. Try following the latest guide and let me know if it doesn't work.
@moussas1 Any progress on this?
I've faced the same issue. It's reproducible with the latest Pet detection tutorial.
message: "An error was raised. This may be due to a preemption in a connected worker or parameter server. The current session will be closed and a new session will be created. Error: OS Error"
Job configuration:
{
"scaleTier": "CUSTOM",
"masterType": "standard_gpu",
"workerType": "standard_gpu",
"parameterServerType": "standard",
"workerCount": "5",
"parameterServerCount": "3",
"packageUris": [
"gs://nevermind-tf/model_dir/packages/cbdfa31f46b8fae654b4dd2463d790543084a4540e67306aba3859fadede5c09/object_detection-0.1.tar.gz",
"gs://nevermind-tf/model_dir/packages/cbdfa31f46b8fae654b4dd2463d790543084a4540e67306aba3859fadede5c09/slim-0.1.tar.gz",
"gs://nevermind-tf/model_dir/packages/cbdfa31f46b8fae654b4dd2463d790543084a4540e67306aba3859fadede5c09/pycocotools-2.0.tar.gz"
],
"pythonModule": "object_detection.model_main",
"args": [
"--model_dir=gs://nevermind-tf/model_dir",
"--pipeline_config_path=gs://nevermind-tf/data/faster_rcnn_resnet101_pets.config"
],
"region": "us-central1",
"runtimeVersion": "1.9",
"jobDir": "gs://nevermind-tf/model_dir"
}
@pkulzc Well now I am receiving the following errors.
Truncated error message: Not found: /tmp/tmpnfbpvza3/model.ckpt-0_temp_5d9e4dd6d7a44a6bab1c206a5a916008; No such file or directory
Followed by this
worker-replica-0
An error was raised. This may be due to a preemption in a connected worker or parameter server. The current session will be closed and a new session will be created. Error: OS Error
@prameshbajra Unfortunately not
An error was raised. This may be due to a preemption in a connected worker or parameter server. The current session will be closed and a new session will be created. Error: OS Error [[Node: clip_by_global_norm/mul_482_S12705 = _Recv[client_terminated=false, recv_device="/job:ps/replica:0/task:1/device:CPU:0", send_device="/job:worker/replica:0/task:4/device:GPU:0", send_device_incarnation=-2529168372228152166, tensor_name="edge_24320_clip_by_global_norm/mul_482", tensor_type=DT_FLOAT, _device="/job:ps/replica:0/task:1/device:CPU:0"]()]] [[Node: train/update/NoOp_1_S12824 = _Recv[client_terminated=false, recv_device="/job:worker/replica:0/task:4/device:GPU:0", send_device="/job:ps/replica:0/task:1/device:CPU:0", send_device_incarnation=1392361354333875400, tensor_name="edge_25101_train/update/NoOp_1", tensor_type=DT_FLOAT, _device="/job:worker/replica:0/task:4/device:GPU:0"]()]] [[Node: control_dependency_G415 = _Recv[client_terminated=false, recv_device="/job:worker/replica:0/task:4/device:CPU:0", send_device="/job:worker/replica:0/task:4/device:GPU:0", send_device_incarnation=-2529168372228152166, tensor_name="edge_21952_control_dependency", tensor_type=DT_FLOAT, _device="/job:worker/replica:0/task:4/device:CPU:0"]()]]
For future viewers, please switch to another thread to track the progress of this issue since this one is duplicated.
Hi There,
We are checking to see if you still need help on this, as this seems to be an old issue. Please update this issue with the latest information, code snippet to reproduce your issue and error you are seeing.
If we don't hear from you in the next 7 days, this issue will be closed automatically. If you don't need help on this issue any more, please consider closing this.