Keras: multi_gpu_model on seq2seq model results in error: Incompatible shapes

Created on 21 Feb 2018  路  8Comments  路  Source: keras-team/keras

Hi!

I've run into an error when using multi_gpu_model for the seq2seq model referenced here:
https://blog.keras.io/a-ten-minute-introduction-to-sequence-to-sequence-learning-in-keras.html

Specifically, this is the error received when running on 2 GPUs:

InvalidArgumentError: Incompatible shapes: [16,128] vs. [32,128]
 [[Node: replica_0/model_1/lstm_2/while/add = Add[T=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:GPU:0"](replica_0/model_1/lstm_2/while/BiasAdd, replica_0/model_1/lstm_2/while/MatMul_4)]]
 [[Node: loss/mul/_193 = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device_incarnation=1, tensor_name="edge_4498_loss/mul", tensor_type=DT_FLOAT, 
_device="/job:localhost/replica:0/task:0/device:CPU:0"]()]]

32 is my batch size and 128 is the latent dim of the encoding space. Running on 4 GPUs gives the same error but [8,128] vs. [32,128] instead.

The full error log is here (I must run the code in Google ML Engine as I don't have multiple GPUs myself).
Python code is here and the error can be reproduced by running

python seq2seq.py --nbr-gpus 2

Tensorflow version: 1.4.1
Keras version: 2.1.4

8976 looks like a very similar error, but was supposedly addressed in PR #9031 which did not fix my issue.

Hope someone knows what's wrong as I've spent the last few days trying to fix this without success.
Any ideas?

Most helpful comment

Reproducing code (edited to clarify cause being combination of initial state and multi_gpu_model):

import numpy as np

import keras
from keras import layers as L

from keras.models import Sequential, Model

from keras.utils.multi_gpu_utils import multi_gpu_model

x = L.Input((4,3))
y = L.SimpleRNN(3,return_sequences=True)(x)

_x = np.random.randn(2,4,3)
_y = np.random.randn(2,4,3)

m = Model(x,y)
m.compile(loss='mean_squared_error',optimizer='adam')
m.train_on_batch(_x,_y)

print("Success!")

m2 = multi_gpu_model(m,2)
m2.compile(loss='mean_squared_error',optimizer='adam')
m2.train_on_batch(_x,_y)

print("Success 2!")

x = L.Input((4,3))
init_state = L.Input((3,))
y = L.SimpleRNN(3,return_sequences=True)(x,initial_state=init_state)

_x = [np.random.randn(2,4,3),np.random.randn(2,3)]
_y = np.random.randn(2,4,3)

m = Model([x,init_state],y)
m.compile(loss='mean_squared_error',optimizer='adam')
m.train_on_batch(_x,_y)

print("Success 3!")

m2 = multi_gpu_model(m,2)
m2.compile(loss='mean_squared_error',optimizer='adam')
m2.train_on_batch(_x,_y)

print("Success 4!")

Success!
Success 2!
Success 3!
...

InvalidArgumentError (see above for traceback): Incompatible shapes: [4,4,3] vs. [2,4,3]
         [[Node: training_1/Adam/gradients/loss_1/simple_rnn_1_loss/sub_grad/BroadcastGradientArgs = BroadcastGradientArgs[T=DT_INT32, _class=["loc:@loss_1/simple_rnn_1_loss/sub"], _device="/job:localhost/replica:0/task:0/device:GPU:0"](training_1/Adam/gradients/loss_1/simple_rnn_1_loss/sub_grad/Shape/_143, training_1/Adam/gradients/loss_1/simple_rnn_1_loss/sub_grad/Shape_1)]]
         [[Node: training_1/Adam/gradients/simple_rnn_1_1/concat_grad/Slice_1/_189 = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:GPU:1", send_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device_incarnation=1, tensor_name="edge_545_training_1/Adam/gradients/simple_rnn_1_1/concat_grad/Slice_1", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:GPU:1"]()]]

All 8 comments

I see a similar issue - I believe the common theme is setting initial state on recurrent layers.

Reproducing code (edited to clarify cause being combination of initial state and multi_gpu_model):

import numpy as np

import keras
from keras import layers as L

from keras.models import Sequential, Model

from keras.utils.multi_gpu_utils import multi_gpu_model

x = L.Input((4,3))
y = L.SimpleRNN(3,return_sequences=True)(x)

_x = np.random.randn(2,4,3)
_y = np.random.randn(2,4,3)

m = Model(x,y)
m.compile(loss='mean_squared_error',optimizer='adam')
m.train_on_batch(_x,_y)

print("Success!")

m2 = multi_gpu_model(m,2)
m2.compile(loss='mean_squared_error',optimizer='adam')
m2.train_on_batch(_x,_y)

print("Success 2!")

x = L.Input((4,3))
init_state = L.Input((3,))
y = L.SimpleRNN(3,return_sequences=True)(x,initial_state=init_state)

_x = [np.random.randn(2,4,3),np.random.randn(2,3)]
_y = np.random.randn(2,4,3)

m = Model([x,init_state],y)
m.compile(loss='mean_squared_error',optimizer='adam')
m.train_on_batch(_x,_y)

print("Success 3!")

m2 = multi_gpu_model(m,2)
m2.compile(loss='mean_squared_error',optimizer='adam')
m2.train_on_batch(_x,_y)

print("Success 4!")

Success!
Success 2!
Success 3!
...

InvalidArgumentError (see above for traceback): Incompatible shapes: [4,4,3] vs. [2,4,3]
         [[Node: training_1/Adam/gradients/loss_1/simple_rnn_1_loss/sub_grad/BroadcastGradientArgs = BroadcastGradientArgs[T=DT_INT32, _class=["loc:@loss_1/simple_rnn_1_loss/sub"], _device="/job:localhost/replica:0/task:0/device:GPU:0"](training_1/Adam/gradients/loss_1/simple_rnn_1_loss/sub_grad/Shape/_143, training_1/Adam/gradients/loss_1/simple_rnn_1_loss/sub_grad/Shape_1)]]
         [[Node: training_1/Adam/gradients/simple_rnn_1_1/concat_grad/Slice_1/_189 = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:GPU:1", send_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device_incarnation=1, tensor_name="edge_545_training_1/Adam/gradients/simple_rnn_1_1/concat_grad/Slice_1", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:GPU:1"]()]]

I am having exactly the same problem adopting the code from https://github.com/keras-team/keras/blob/master/examples/lstm_seq2seq.py#L144 to use 2 GPUs

Same problem. Any idea to solve it?

Exactly the same problem as well, also with code from https://github.com/keras-team/keras/blob/master/examples/lstm_seq2seq.py#L144 adapted to use 2 GPUS.

same problem help me~

Exactly the same problem! How to solve with initial_state and multi_gpu_model?

@bezigon @kwonyoungjoo @jaimevargast @LeZhengThu @zyxue @mharradon @oskarjonefors This bug has been fixed by #10845 , please check out the latest code. If it still can't work well, please let me know. Thanks.

Was this page helpful?
0 / 5 - 0 ratings