Please fill out the form below.
I created a keras_model_fn and am trying to train the model on 3 c4 instances. Unfortunately, I get the following error (detailed below).
Stackoverflow suggest using soft_placement (dont know what that means, or how to use it)
Help!
InvalidArgumentError (see above for traceback): Cannot colocate nodes 'embedding_1/embeddings' and 'training/Adam/gradients/embedding_1/GatherV2_grad/Shape: Cannot merge devices with incompatible jobs: '/job:master/task:0' and '/job:ps/task:1'
sess = tf.Session(config=tf.ConfigProto(allow_soft_placement=True, log_device_placement=False))
Hello,
This will be difficult to diagnose without getting a minimal repro.
Thanks!
Distributed tensorflow training is not currently supported if you use the keras_model_fn.
You need to convert your model to use a tensorflow estimator through model_fn.
See the following:
https://github.com/aws/sagemaker-python-sdk/tree/master/src/sagemaker/tensorflow#using-a-keras-model-instead-of-a-model_fn
@khu834
Thanks for clarification!
I apologize that I wasn't able to recognize that this was the problem for @gautiese.
I'll close this issue, as it doesn't seem we can resolve the problem.