Sagemaker-python-sdk: Cannot colocate nodes, Cannot merge devices with incompatible jobs: '/job:master/task:0' and '/job:ps/task:1'

Created on 31 Jul 2018  路  4Comments  路  Source: aws/sagemaker-python-sdk

Please fill out the form below.

System Information

  • Framework (e.g. TensorFlow) / Algorithm (e.g. KMeans): Tensorflow (Keras)
  • Framework Version: 1.8
  • Python Version: 3.6
  • CPU or GPU: CPU
  • Python SDK Version:
  • Are you using a custom image: No

Describe the problem

I created a keras_model_fn and am trying to train the model on 3 c4 instances. Unfortunately, I get the following error (detailed below).
Stackoverflow suggest using soft_placement (dont know what that means, or how to use it)
Help!

Minimal repro / logs

InvalidArgumentError (see above for traceback): Cannot colocate nodes 'embedding_1/embeddings' and 'training/Adam/gradients/embedding_1/GatherV2_grad/Shape: Cannot merge devices with incompatible jobs: '/job:master/task:0' and '/job:ps/task:1'

011 [[Node: embedding_1/embeddings = VariableV2[_class=["loc:@embedding_1/embeddings"], container="", dtype=DT_FLOAT, shape=[28,300], shared_name="", _device="/job:ps/task:1"]()]]

All 4 comments

sess = tf.Session(config=tf.ConfigProto(allow_soft_placement=True, log_device_placement=False))

Hello,

This will be difficult to diagnose without getting a minimal repro.

Thanks!

Distributed tensorflow training is not currently supported if you use the keras_model_fn.
You need to convert your model to use a tensorflow estimator through model_fn.

See the following:
https://github.com/aws/sagemaker-python-sdk/tree/master/src/sagemaker/tensorflow#using-a-keras-model-instead-of-a-model_fn

@khu834
Thanks for clarification!

I apologize that I wasn't able to recognize that this was the problem for @gautiese.

I'll close this issue, as it doesn't seem we can resolve the problem.

Was this page helpful?
0 / 5 - 0 ratings