Keras: multi_gpu_model on seq2seq model results in error: Incompatible shapes

Created on 21 Feb 2018 · 8Comments · Source: keras-team/keras

Hi!

I've run into an error when using multi_gpu_model for the seq2seq model referenced here:
https://blog.keras.io/a-ten-minute-introduction-to-sequence-to-sequence-learning-in-keras.html

Specifically, this is the error received when running on 2 GPUs:

InvalidArgumentError: Incompatible shapes: [16,128] vs. [32,128]
 [[Node: replica_0/model_1/lstm_2/while/add = Add[T=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:GPU:0"](replica_0/model_1/lstm_2/while/BiasAdd, replica_0/model_1/lstm_2/while/MatMul_4)]]
 [[Node: loss/mul/_193 = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device_incarnation=1, tensor_name="edge_4498_loss/mul", tensor_type=DT_FLOAT, 
_device="/job:localhost/replica:0/task:0/device:CPU:0"]()]]

32 is my batch size and 128 is the latent dim of the encoding space. Running on 4 GPUs gives the same error but [8,128] vs. [32,128] instead.

The full error log is here (I must run the code in Google ML Engine as I don't have multiple GPUs myself).
Python code is here and the error can be reproduced by running

python seq2seq.py --nbr-gpus 2

Tensorflow version: 1.4.1
Keras version: 2.1.4

8976 looks like a very similar error, but was supposedly addressed in PR #9031 which did not fix my issue.

Hope someone knows what's wrong as I've spent the last few days trying to fix this without success.
Any ideas?

Source

oskarjonefors

😕2 👍2

Most helpful comment

Reproducing code (edited to clarify cause being combination of initial state and multi_gpu_model):

import numpy as np

import keras
from keras import layers as L

from keras.models import Sequential, Model

from keras.utils.multi_gpu_utils import multi_gpu_model

x = L.Input((4,3))
y = L.SimpleRNN(3,return_sequences=True)(x)

_x = np.random.randn(2,4,3)
_y = np.random.randn(2,4,3)

m = Model(x,y)
m.compile(loss='mean_squared_error',optimizer='adam')
m.train_on_batch(_x,_y)

print("Success!")

m2 = multi_gpu_model(m,2)
m2.compile(loss='mean_squared_error',optimizer='adam')
m2.train_on_batch(_x,_y)

print("Success 2!")

x = L.Input((4,3))
init_state = L.Input((3,))
y = L.SimpleRNN(3,return_sequences=True)(x,initial_state=init_state)

_x = [np.random.randn(2,4,3),np.random.randn(2,3)]
_y = np.random.randn(2,4,3)

m = Model([x,init_state],y)
m.compile(loss='mean_squared_error',optimizer='adam')
m.train_on_batch(_x,_y)

print("Success 3!")

m2 = multi_gpu_model(m,2)
m2.compile(loss='mean_squared_error',optimizer='adam')
m2.train_on_batch(_x,_y)

print("Success 4!")

Success!
Success 2!
Success 3!
...

InvalidArgumentError (see above for traceback): Incompatible shapes: [4,4,3] vs. [2,4,3]
         [[Node: training_1/Adam/gradients/loss_1/simple_rnn_1_loss/sub_grad/BroadcastGradientArgs = BroadcastGradientArgs[T=DT_INT32, _class=["loc:@loss_1/simple_rnn_1_loss/sub"], _device="/job:localhost/replica:0/task:0/device:GPU:0"](training_1/Adam/gradients/loss_1/simple_rnn_1_loss/sub_grad/Shape/_143, training_1/Adam/gradients/loss_1/simple_rnn_1_loss/sub_grad/Shape_1)]]
         [[Node: training_1/Adam/gradients/simple_rnn_1_1/concat_grad/Slice_1/_189 = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:GPU:1", send_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device_incarnation=1, tensor_name="edge_545_training_1/Adam/gradients/simple_rnn_1_1/concat_grad/Slice_1", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:GPU:1"]()]]

mharradon on 27 Mar 2018

👍3

All 8 comments

I see a similar issue - I believe the common theme is setting initial state on recurrent layers.

mharradon on 27 Mar 2018

Reproducing code (edited to clarify cause being combination of initial state and multi_gpu_model):

import numpy as np

import keras
from keras import layers as L

from keras.models import Sequential, Model

from keras.utils.multi_gpu_utils import multi_gpu_model

x = L.Input((4,3))
y = L.SimpleRNN(3,return_sequences=True)(x)

_x = np.random.randn(2,4,3)
_y = np.random.randn(2,4,3)

m = Model(x,y)
m.compile(loss='mean_squared_error',optimizer='adam')
m.train_on_batch(_x,_y)

print("Success!")

m2 = multi_gpu_model(m,2)
m2.compile(loss='mean_squared_error',optimizer='adam')
m2.train_on_batch(_x,_y)

print("Success 2!")

x = L.Input((4,3))
init_state = L.Input((3,))
y = L.SimpleRNN(3,return_sequences=True)(x,initial_state=init_state)

_x = [np.random.randn(2,4,3),np.random.randn(2,3)]
_y = np.random.randn(2,4,3)

m = Model([x,init_state],y)
m.compile(loss='mean_squared_error',optimizer='adam')
m.train_on_batch(_x,_y)

print("Success 3!")

m2 = multi_gpu_model(m,2)
m2.compile(loss='mean_squared_error',optimizer='adam')
m2.train_on_batch(_x,_y)

print("Success 4!")

Success!
Success 2!
Success 3!
...

InvalidArgumentError (see above for traceback): Incompatible shapes: [4,4,3] vs. [2,4,3]
         [[Node: training_1/Adam/gradients/loss_1/simple_rnn_1_loss/sub_grad/BroadcastGradientArgs = BroadcastGradientArgs[T=DT_INT32, _class=["loc:@loss_1/simple_rnn_1_loss/sub"], _device="/job:localhost/replica:0/task:0/device:GPU:0"](training_1/Adam/gradients/loss_1/simple_rnn_1_loss/sub_grad/Shape/_143, training_1/Adam/gradients/loss_1/simple_rnn_1_loss/sub_grad/Shape_1)]]
         [[Node: training_1/Adam/gradients/simple_rnn_1_1/concat_grad/Slice_1/_189 = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:GPU:1", send_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device_incarnation=1, tensor_name="edge_545_training_1/Adam/gradients/simple_rnn_1_1/concat_grad/Slice_1", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:GPU:1"]()]]

mharradon on 27 Mar 2018

👍3

I am having exactly the same problem adopting the code from https://github.com/keras-team/keras/blob/master/examples/lstm_seq2seq.py#L144 to use 2 GPUs

zyxue on 5 Apr 2018

Same problem. Any idea to solve it?

LeZhengThu on 8 Apr 2018

Exactly the same problem as well, also with code from https://github.com/keras-team/keras/blob/master/examples/lstm_seq2seq.py#L144 adapted to use 2 GPUS.

jaimevargast on 11 Apr 2018

same problem help me~

kwonyoungjoo on 25 Jul 2018

👍1

Exactly the same problem! How to solve with initial_state and multi_gpu_model?

bezigon on 1 Aug 2018

@bezigon @kwonyoungjoo @jaimevargast @LeZhengThu @zyxue @mharradon @oskarjonefors This bug has been fixed by #10845 , please check out the latest code. If it still can't work well, please let me know. Thanks.