While running a training job on Cloud ML Engine (Runtime Version 1.6), after running for a while (~5900 steps), the job fails and worker-replica-0 reports the following message:
Error reported to Coordinator:
, OS Error
Finally, worker-replica-0 returns:
returned non-zero exit status 1
Please see logs/Traceback below.
rate_index = tf.reduce_max(tf.where(tf.greater_equal(global_step, boundaries),
list(range(num_boundaries)),
[0] * num_boundaries))
gcloud ml-engine jobs submit training object_detection_`date +%s` \
--job-dir=gs://${TRAIN_DIR} \
--packages gs://${DIST_DIR}/object_detection-0.1.tar.gz,gs://${DIST_DIR}/slim-0.1.tar.gz \
--module-name object_detection.train \
--region us-central1 \
--config ${PATH_TO_LOCAL_YAML_FILE} \
-- \
--train_dir=gs://${TRAIN_DIR} \
--pipeline_config_path=gs://${PIPELINE_CONFIG_PATH} \
--num_clones=4
trainingInput:
runtimeVersion: '1.6'
pythonVersion: '3.5'
scaleTier: CUSTOM
masterType: complex_model_m_p100
workerType: complex_model_m_p100
parameterServerType: large_model
workerCount: 1
parameterServerCount: 3
model {
faster_rcnn {
num_classes: 1
image_resizer {
keep_aspect_ratio_resizer {
min_dimension: 600
max_dimension: 1024
}
}
feature_extractor {
type: 'faster_rcnn_resnet101'
first_stage_features_stride: 16
}
first_stage_anchor_generator {
grid_anchor_generator {
scales: [0.25, 0.5, 1.0, 2.0]
aspect_ratios: [0.5, 1.0, 2.0]
height_stride: 16
width_stride: 16
}
}
first_stage_box_predictor_conv_hyperparams {
op: CONV
regularizer {
l2_regularizer {
weight: 0.0
}
}
initializer {
truncated_normal_initializer {
stddev: 0.01
}
}
}
first_stage_nms_score_threshold: 0.0
first_stage_nms_iou_threshold: 0.7
first_stage_max_proposals: 300
first_stage_localization_loss_weight: 2.0
first_stage_objectness_loss_weight: 1.0
initial_crop_size: 14
maxpool_kernel_size: 2
maxpool_stride: 2
second_stage_box_predictor {
mask_rcnn_box_predictor {
use_dropout: false
dropout_keep_probability: 1.0
fc_hyperparams {
op: FC
regularizer {
l2_regularizer {
weight: 0.0
}
}
initializer {
variance_scaling_initializer {
factor: 1.0
uniform: true
mode: FAN_AVG
}
}
}
}
}
second_stage_post_processing {
batch_non_max_suppression {
score_threshold: 0.0
iou_threshold: 0.6
max_detections_per_class: 100
max_total_detections: 300
}
score_converter: SOFTMAX
}
second_stage_localization_loss_weight: 2.0
second_stage_classification_loss_weight: 1.0
}
}
train_config: {
batch_size: 4
optimizer {
momentum_optimizer: {
learning_rate: {
manual_step_learning_rate {
initial_learning_rate: 3e-4
schedule {
step: 900000
learning_rate: 3e-5
}
schedule {
step: 1200000
learning_rate: 3e-6
}
}
}
momentum_optimizer_value: 0.9
}
use_moving_average: false
}
gradient_clipping_by_norm: 10.0
fine_tune_checkpoint: "gs://PATH_TO_BE_CONFIGURED/model.ckpt"
from_detection_checkpoint: true
num_steps: 30000
data_augmentation_options {
random_horizontal_flip {
}
}
}
train_input_reader: {
tf_record_input_reader {
input_path: "gs://PATH_TO_BE_CONFIGURED/agroset_train.record"
}
label_map_path: "gs://PATH_TO_BE_CONFIGURED/agroset_label_map.pbtxt"
}
eval_config: {
num_examples: 25
# Note: The below line limits the evaluation process to 10 evaluations.
# Remove the below line to evaluate indefinitely.
max_evals: 30
visualization_export_dir: "gs://PATH_TO_BE_CONFIGURED/visualization"
}
eval_input_reader: {
tf_record_input_reader {
input_path: "gs://PATH_TO_BE_CONFIGURED/agroset_val.record"
}
label_map_path: "gs://PATH_TO_BE_CONFIGURED/agroset_label_map.pbtxt"
shuffle: false
num_readers: 1
}
worker-replica-0
The replica worker 0 exited with a non-zero status of 1. Termination reason: Error.
Traceback (most recent call last):
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 1361, in _do_call
return fn(*args)
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 1340, in _run_fn
target_list, status, run_metadata)
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/framework/errors_impl.py", line 516, in __exit__
c_api.TF_GetCode(self.status.status))
tensorflow.python.framework.errors_impl.UnavailableError: OS Error
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/usr/lib/python3.5/runpy.py", line 184, in _run_module_as_main
"__main__", mod_spec)
File "/usr/lib/python3.5/runpy.py", line 85, in _run_code
exec(code, run_globals)
File "/root/.local/lib/python3.5/site-packages/object_detection/train.py", line 167, in <module>
tf.app.run()
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/platform/app.py", line 126, in run
_sys.exit(main(argv))
File "/root/.local/lib/python3.5/site-packages/object_detection/train.py", line 163, in main
worker_job_name, is_chief, FLAGS.train_dir)
File "/root/.local/lib/python3.5/site-packages/object_detection/trainer.py", line 370, in train
saver=saver)
File "/usr/local/lib/python3.5/dist-packages/tensorflow/contrib/slim/python/slim/learning.py", line 768, in train
sess, train_op, global_step, train_step_kwargs)
File "/usr/local/lib/python3.5/dist-packages/tensorflow/contrib/slim/python/slim/learning.py", line 487, in train_step
run_metadata=run_metadata)
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 905, in run
run_metadata_ptr)
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 1137, in _run
feed_dict_tensor, options, run_metadata)
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 1355, in _do_run
options, run_metadata)
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 1374, in _do_call
raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.UnavailableError: OS Error
Hi, I managed to run the faster_rcnn kitti model with transfer learning for my own dataset by disabling the workers (so only the standard_gpu master) and two large_model paramater servers up to ~80k steps. I suspect that the issue is caused by an OOM error as the GPU's (tesla) memory is at ~40%
Hi @MoMo-Tech, how removing the workers can alleviate GPU OOM issues?
Hi @pkulzc, if you'll repeat the same experiment but with Cloud ML Engine TF 1.5 instead of 1.6 then you'll reproduce #3757. I believe they are the same issue that manifests itself differently in TF 1.5 and TF 1.6.
On my initial runs I noticed that the workers were assigned on the same machine ID. And with this in mind I changed the GPU type to the next tesla tier and noticed that a few more steps were being computed, but still it failed ~10K epochs. There are also a few other tickets that claim there might be an OOM handled by the OS and thus getting the OS error.
Thanks for the info, we're investigating this as well as 3757 now.
Hi @pkulzc, today I pulled the HEAD of models repository and tried again and the problem still persists. However, I got more informative error messages than I was getting before:
master-replica-0
Error reported to Coordinator: <class 'tensorflow.python.framework.errors_impl.UnavailableError'>, OS Error
[[Node: clip_grads/clip_by_norm_74/truediv_S10171 =
_Recv[client_terminated=false,
recv_device="/job:ps/replica:0/task:2/device:CPU:0",
send_device="/job:master/replica:0/task:0/device:CPU:0",
send_device_incarnation=4406029107278666217,
tensor_name="edge_32391_clip_grads/clip_by_norm_74/truediv",
tensor_type=DT_FLOAT,
_device="/job:ps/replica:0/task:2/device:CPU:0"]()]]
[[Node: Momentum/update/NoOp_2_S10192 =
_Recv[client_terminated=false,
recv_device="/job:master/replica:0/task:0/device:CPU:0",
send_device="/job:ps/replica:0/task:2/device:CPU:0",
send_device_incarnation=7295333245552102225,
tensor_name="edge_32739_Momentum/update/NoOp_2",
tensor_type=DT_FLOAT,
_device="/job:master/replica:0/task:0/device:CPU:0"]()]]
[[Node: clip_grads/clip_by_norm_213/truediv_S9737 =
_Recv[client_terminated=false,
recv_device="/job:ps/replica:0/task:3/device:CPU:0",
send_device="/job:master/replica:0/task:0/device:CPU:0",
send_device_incarnation=4406029107278666217,
tensor_name="edge_26466_clip_grads/clip_by_norm_213/truediv",
tensor_type=DT_FLOAT,
_device="/job:ps/replica:0/task:3/device:CPU:0"]()]]
Traceback (most recent call last):
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 1361, in _do_call
return fn(*args)
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 1340, in _run_fn
target_list, status, run_metadata)
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/framework/errors_impl.py", line 516, in __exit__
c_api.TF_GetCode(self.status.status))
tensorflow.python.framework.errors_impl.UnavailableError: OS Error
[[Node: clip_grads/clip_by_norm_74/truediv_S10171 = _Recv[client_terminated=false, recv_device="/job:ps/replica:0/task:2/device:CPU:0", send_device="/job:master/replica:0/task:0/device:CPU:0", send_device_incarnation=4406029107278666217, tensor_name="edge_32391_clip_grads/clip_by_norm_74/truediv", tensor_type=DT_FLOAT, _device="/job:ps/replica:0/task:2/device:CPU:0"]()]]
[[Node: Momentum/update/NoOp_2_S10192 = _Recv[client_terminated=false, recv_device="/job:master/replica:0/task:0/device:CPU:0", send_device="/job:ps/replica:0/task:2/device:CPU:0", send_device_incarnation=7295333245552102225, tensor_name="edge_32739_Momentum/update/NoOp_2", tensor_type=DT_FLOAT, _device="/job:master/replica:0/task:0/device:CPU:0"]()]]
[[Node: clip_grads/clip_by_norm_213/truediv_S9737 = _Recv[client_terminated=false, recv_device="/job:ps/replica:0/task:3/device:CPU:0", send_device="/job:master/replica:0/task:0/device:CPU:0", send_device_incarnation=4406029107278666217, tensor_name="edge_26466_clip_grads/clip_by_norm_213/truediv", tensor_type=DT_FLOAT, _device="/job:ps/replica:0/task:3/device:CPU:0"]()]]
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/usr/lib/python3.5/runpy.py", line 184, in _run_module_as_main
"__main__", mod_spec)
File "/usr/lib/python3.5/runpy.py", line 85, in _run_code
exec(code, run_globals)
File "/root/.local/lib/python3.5/site-packages/object_detection/train.py", line 167, in <module>
tf.app.run()
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/platform/app.py", line 126, in run
_sys.exit(main(argv))
File "/root/.local/lib/python3.5/site-packages/object_detection/train.py", line 163, in main
worker_job_name, is_chief, FLAGS.train_dir)
File "/root/.local/lib/python3.5/site-packages/object_detection/trainer.py", line 370, in train
saver=saver)
File "/usr/local/lib/python3.5/dist-packages/tensorflow/contrib/slim/python/slim/learning.py", line 768, in train
sess, train_op, global_step, train_step_kwargs)
File "/usr/local/lib/python3.5/dist-packages/tensorflow/contrib/slim/python/slim/learning.py", line 487, in train_step
run_metadata=run_metadata)
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 905, in run
run_metadata_ptr)
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 1137, in _run
feed_dict_tensor, options, run_metadata)
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 1355, in _do_run
options, run_metadata)
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 1374, in _do_call
raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.UnavailableError: OS Error
[[Node: clip_grads/clip_by_norm_74/truediv_S10171 = _Recv[client_terminated=false, recv_device="/job:ps/replica:0/task:2/device:CPU:0", send_device="/job:master/replica:0/task:0/device:CPU:0", send_device_incarnation=4406029107278666217, tensor_name="edge_32391_clip_grads/clip_by_norm_74/truediv", tensor_type=DT_FLOAT, _device="/job:ps/replica:0/task:2/device:CPU:0"]()]]
[[Node: Momentum/update/NoOp_2_S10192 = _Recv[client_terminated=false, recv_device="/job:master/replica:0/task:0/device:CPU:0", send_device="/job:ps/replica:0/task:2/device:CPU:0", send_device_incarnation=7295333245552102225, tensor_name="edge_32739_Momentum/update/NoOp_2", tensor_type=DT_FLOAT, _device="/job:master/replica:0/task:0/device:CPU:0"]()]]
[[Node: clip_grads/clip_by_norm_213/truediv_S9737 = _Recv[client_terminated=false, recv_device="/job:ps/replica:0/task:3/device:CPU:0", send_device="/job:master/replica:0/task:0/device:CPU:0", send_device_incarnation=4406029107278666217, tensor_name="edge_26466_clip_grads/clip_by_norm_213/truediv", tensor_type=DT_FLOAT, _device="/job:ps/replica:0/task:3/device:CPU:0"]()]]
/usr/local/lib/python3.5/dist-packages/tensorflow/python/ops/gradients_impl.py:98: UserWarning: Converting sparse IndexedSlices to a dense Tensor of unknown shape. This may consume a large amount of memory.
Converting sparse IndexedSlices to a dense Tensor of unknown shape.
worker-replica-0
Error reported to Coordinator: <class 'tensorflow.python.framework.errors_impl.UnavailableError'>, OS Error
2018-04-04 16:19:47.928690: E tensorflow/core/distributed_runtime/master_session.cc:1663] Cleanup partition error: Unavailable: OS Error
Traceback (most recent call last):
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 1361, in _do_call
return fn(*args)
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 1340, in _run_fn
target_list, status, run_metadata)
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/framework/errors_impl.py", line 516, in __exit__
c_api.TF_GetCode(self.status.status))
tensorflow.python.framework.errors_impl.UnavailableError: OS Error
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/usr/lib/python3.5/runpy.py", line 184, in _run_module_as_main
"__main__", mod_spec)
File "/usr/lib/python3.5/runpy.py", line 85, in _run_code
exec(code, run_globals)
File "/root/.local/lib/python3.5/site-packages/object_detection/train.py", line 167, in <module>
tf.app.run()
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/platform/app.py", line 126, in run
_sys.exit(main(argv))
File "/root/.local/lib/python3.5/site-packages/object_detection/train.py", line 163, in main
worker_job_name, is_chief, FLAGS.train_dir)
File "/root/.local/lib/python3.5/site-packages/object_detection/trainer.py", line 370, in train
saver=saver)
File "/usr/local/lib/python3.5/dist-packages/tensorflow/contrib/slim/python/slim/learning.py", line 768, in train
sess, train_op, global_step, train_step_kwargs)
File "/usr/local/lib/python3.5/dist-packages/tensorflow/contrib/slim/python/slim/learning.py", line 503, in train_step
if sess.run(train_step_kwargs['should_log']):
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 905, in run
run_metadata_ptr)
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 1137, in _run
feed_dict_tensor, options, run_metadata)
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 1355, in _do_run
options, run_metadata)
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 1374, in _do_call
raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.UnavailableError: OS Error
/usr/local/lib/python3.5/dist-packages/tensorflow/python/ops/gradients_impl.py:98: UserWarning: Converting sparse IndexedSlices to a dense Tensor of unknown shape. This may consume a large amount of memory.
Converting sparse IndexedSlices to a dense Tensor of unknown shape.
ml-engine
The replica master 0 exited with a non-zero status of 1. Termination reason: Error.
Traceback (most recent call last):
[...]
File "/usr/local/lib/python3.5/dist-packages/tensorflow/contrib/slim/python/slim/learning.py", line 487, in train_step
run_metadata=run_metadata)
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 905, in run
run_metadata_ptr)
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 1137, in _run
feed_dict_tensor, options, run_metadata)
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 1355, in _do_run
options, run_metadata)
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 1374, in _do_call
raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.UnavailableError: OS Error
[[Node: clip_grads/clip_by_norm_74/truediv_S10171 = _Recv[client_terminated=false, recv_device="/job:ps/replica:0/task:2/device:CPU:0", send_device="/job:master/replica:0/task:0/device:CPU:0", send_device_incarnation=4406029107278666217, tensor_name="edge_32391_clip_grads/clip_by_norm_74/truediv", tensor_type=DT_FLOAT, _device="/job:ps/replica:0/task:2/device:CPU:0"]()]]
[[Node: Momentum/update/NoOp_2_S10192 = _Recv[client_terminated=false, recv_device="/job:master/replica:0/task:0/device:CPU:0", send_device="/job:ps/replica:0/task:2/device:CPU:0", send_device_incarnation=7295333245552102225, tensor_name="edge_32739_Momentum/update/NoOp_2", tensor_type=DT_FLOAT, _device="/job:master/replica:0/task:0/device:CPU:0"]()]]
[[Node: clip_grads/clip_by_norm_213/truediv_S9737 = _Recv[client_terminated=false, recv_device="/job:ps/replica:0/task:3/device:CPU:0", send_device="/job:master/replica:0/task:0/device:CPU:0", send_device_incarnation=4406029107278666217, tensor_name="edge_26466_clip_grads/clip_by_norm_213/truediv", tensor_type=DT_FLOAT, _device="/job:ps/replica:0/task:3/device:CPU:0"]()]]
The replica worker 0 exited with a non-zero status of 1. Termination reason: Error.
Traceback (most recent call last):
[...]
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/framework/errors_impl.py", line 516, in __exit__
c_api.TF_GetCode(self.status.status))
tensorflow.python.framework.errors_impl.UnavailableError: OS Error
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/usr/lib/python3.5/runpy.py", line 184, in _run_module_as_main
"__main__", mod_spec)
File "/usr/lib/python3.5/runpy.py", line 85, in _run_code
exec(code, run_globals)
File "/root/.local/lib/python3.5/site-packages/object_detection/train.py", line 167, in <module>
tf.app.run()
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/platform/app.py", line 126, in run
_sys.exit(main(argv))
File "/root/.local/lib/python3.5/site-packages/object_detection/train.py", line 163, in main
worker_job_name, is_chief, FLAGS.train_dir)
File "/root/.local/lib/python3.5/site-packages/object_detection/trainer.py", line 370, in train
saver=saver)
File "/usr/local/lib/python3.5/dist-packages/tensorflow/contrib/slim/python/slim/learning.py", line 768, in train
sess, train_op, global_step, train_step_kwargs)
File "/usr/local/lib/python3.5/dist-packages/tensorflow/contrib/slim/python/slim/learning.py", line 503, in train_step
if sess.run(train_step_kwargs['should_log']):
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 905, in run
run_metadata_ptr)
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 1137, in _run
feed_dict_tensor, options, run_metadata)
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 1355, in _do_run
options, run_metadata)
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 1374, in _do_call
raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.UnavailableError: OS Error
Hi @pkulzc, thank you for investigating this issue!
Are you able to please share any updates or information regarding the status of this issue?
Thanks 馃憤
Sorry for the delay, right now the cloud team and tensorflow team are still investigating.
Great!
Thanks for the update
Hi @pkulzc, I think I may have a lead:
On line 357, object_detection/trainer.py calls tf.contrib.slim.learning.train() which uses the deprecated tf.train.Supervisor and should be migrated to tf.train.MonitoredTrainingSession instead, as documented in tf.train.Supervisor
This is already requested in tensorflow/tensorflow#15793 and is reported as a solution to tensorflow/tensorflow#17852 on the last comment of yahoo/TensorFlowOnSpark#245.
@MoMo-Tech I also managed to run transfer learning based on ssd_mobilenet_v1_coco_2017_11_17 models up to about ~80k steps and it stopped with the following error.
The replica worker 4 exited with a non-zero status of 1. Termination reason: Error. Traceback (most recent call last): File "/usr/lib/python2.7/runpy.py", line 174, in _run_module_as_main "__main__", fname, loader, pkg_name) File "/usr/lib/python2.7/runpy.py", line 72, in _run_code exec code in run_globals File "/root/.local/lib/python2.7/site-packages/object_detection/train.py", line 184, in <module> tf.app.run() File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/platform/app.py", line 126, in run _sys.exit(main(argv)) File "/root/.local/lib/python2.7/site-packages/object_detection/train.py", line 180, in main graph_hook_fn=graph_rewriter_fn) File "/root/.local/lib/python2.7/site-packages/object_detection/trainer.py", line 410, in train saver=saver) File "/usr/local/lib/python2.7/dist-packages/tensorflow/contrib/slim/python/slim/learning.py", line 769, in train sess, train_op, global_step, train_step_kwargs) File "/usr/local/lib/python2.7/dist-packages/tensorflow/contrib/slim/python/slim/learning.py", line 487, in train_step run_metadata=run_metadata) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 900, in run run_metadata_ptr) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 1135, in _run feed_dict_tensor, options, run_metadata) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 1316, in _do_run run_metadata) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 1335, in _do_call raise type(e)(node_def, op, message) UnavailableError: OS Error To find out more about why your job exited please check the logs: https://console.cloud.google.com/logs/viewer?project=1466216932&resource=ml_job%2Fjob_id%2Ffreshturf_object_detection_1530117506&advancedFilter=resource.type%3D%22ml_job%22%0Aresource.labels.job_id%3D%22freshturf_object_detection_1530117506%22
The master memory usage was around 38.8 ~ 49.7 %
I had the same error when the tensorflow instances trying to contact each other. I found out that the problem is that GRPC uses the native "epoll" polling engine for communication. Changing this to a portable polling engine solved this issue for me. The way to do is to set the environment variable, "GRPC_POLL_STRATEGY=poll" before running the tensorflow programs. This solved this issue for me. For reference, see, https://github.com/grpc/grpc/blob/master/doc/environment_variables.md.
May be this can help.
I believe after our switching to tf.estimator framework this issue is already gone. Closing this.
@pkulzc Can you please clarify how this issue is solved? I am still facing the same problem
@moussas1 Have you synced to latest?
Most helpful comment
I had the same error when the tensorflow instances trying to contact each other. I found out that the problem is that GRPC uses the native "epoll" polling engine for communication. Changing this to a portable polling engine solved this issue for me. The way to do is to set the environment variable, "GRPC_POLL_STRATEGY=poll" before running the tensorflow programs. This solved this issue for me. For reference, see, https://github.com/grpc/grpc/blob/master/doc/environment_variables.md.
May be this can help.