If you try to train the standard VAE model given in the examples (https://github.com/keras-team/keras/blob/master/examples/variational_autoencoder.py) embedded within the multi_gpu_model example: https://github.com/keras-team/keras/blob/c8bef99ec7a2032b9bea6e9a1260d05a2b6a80f1/keras/utils/training_utils.py#L56-L93 there are mismatches in tensor sizes.
The input data x gets sliced down by the number of GPUs, but the tensor z representing the latent variable z is not.
When the loss function in the custom loss layer runs, it calculates the loss in two parts: one from the input data x that has shape[0] of batch_size/nGPUs and another part from the latent representation z which remains of size batch_size. Thus when the loss function attempts to add these parts together there is a mismatch.
Clearly this is not an ideal outcome when using the multi_gpu_model function with a standard, reference, model. See my minimal example here: https://gist.github.com/MatthewWilletts/7eef6a201413f936dff55378b4a14ecf
For a batch_size of 12 and with 3 GPUs the errors are of the form:
2018-01-24 20:54:38.466550: W tensorflow/core/framework/op_kernel.cc:1192] Invalid argument: Incompatible shapes: [4] vs. [12]
[[Node: replica_0/model_1/custom_variational_layer_1/add_1 = Add[T=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:GPU:0"](replica_0/model_1/custom_variational_layer_1/mul, replica_0/model_1/custom_variational_layer_1/mul_1)]]
You can see a complete error message at:
https://gist.github.com/MatthewWilletts/0c4332d6f7092a7acfa5fff5a29e868c
What changes need to be made, either to the function multi_gpu_model or to the reference implementation of the VAE, to remove this shape incompatibility?
Thank you!
i'm hitting the same problem. Did you find a solution?
Hi, Yes I did. The problem is that the layers representing the latent variables do not get sliced up into minibatches over the GPUs like the inputs and outputs are. I think this is a bit of a design flaw in the current multi_gpu approach in keras - it breaks custom loss layers that depend on intermediate layers for calculation. The easy fix is to add these layers as outputs of the model. This then means they are split by the same process.
Ie in the old version of the VAE, with the custom loss layer (https://github.com/keras-team/keras/blob/ce4947cbaf380589a63def4cc6eb3e460c41254f/examples/variational_autoencoder.py) we replace:
vae = Model(x, y)
with
vae = Model(x, [y, z_mean, z_log_var] )
Hope that helps!
@MatthewWilletts Thanks. I had a very similar problem in my model and solved it last night thanks to your hint about the loss dimension not being correct.
I used to have the same problem. Trying to make it work was huge pain for me. In the end I can recommend using https://github.com/uber/horovod - horovod was something that finally worked for me.
I am having a similar problem, but my error message already comes before starting to fit the model. This is why I am unsure whether the problem is the same. Here is my model, and if I run the code without multi_gpu_model, it goes just fine:
# merge the outputs of the embeddings, and everything that belongs to the most recent activity executions
main_output = concatenate(models, axis=2)
main_output = LSTM(25*32, batch_input_shape=(1,), stateful=True)(main_output)
# main_output = LSTM(25*32, batch_input_shape=(1,25*32), stateful=True)(main_output)
# after LSTM has learned on the sequence, bring in the SP2/PFS features, like in Shibatas paper
main_output = concatenate([main_output, sequence_embedding])
main_output = Dense(20*32, activation='relu', name='dense_join')(main_output)
main_output = Dense(len(feature_dict["concept:name"]["to_int"]), activation='sigmoid', name='dense_final')(main_output)
full_model = Model(inputs=model_inputs, outputs=[main_output])
full_model = multi_gpu_model(full_model, gpus=ngpus)
full_model.compile(loss='categorical_crossentropy', optimizer='adam')
And here is the super-long error trace that originates when I try multi_gpu_model (ngpus = 8). What do you think, is the model malformed or does this belong to the referenced issue?
InvalidArgumentError Traceback (most recent call last)
~/anaconda3/envs/thesis/lib/python3.6/site-packages/tensorflow/python/framework/ops.py in set_shape(self, shape)
524 dim_list,
--> 525 unknown_shape)
526 except errors.InvalidArgumentError as e:
InvalidArgumentError: Dimension 0 in both shapes must be equal, but are 0 and 1. Shapes are [0,800] and [1,800].
During handling of the above exception, another exception occurred:
ValueError Traceback (most recent call last)
<ipython-input-45-d22ed3259708> in <module>
55
56 full_model = Model(inputs=model_inputs, outputs=[main_output])
---> 57 full_model = multi_gpu_model(full_model)
58 full_model.compile(loss='categorical_crossentropy', optimizer='adam')
~/anaconda3/envs/thesis/lib/python3.6/site-packages/keras/utils/multi_gpu_utils.py in multi_gpu_model(model, gpus, cpu_merge, cpu_relocation)
225 # Apply model on slice
226 # (creating a model replica on the target device).
--> 227 outputs = model(inputs)
228 outputs = to_list(outputs)
229
~/anaconda3/envs/thesis/lib/python3.6/site-packages/keras/engine/base_layer.py in __call__(self, inputs, **kwargs)
455 # Actually call the layer,
456 # collecting output(s), mask(s), and shape(s).
--> 457 output = self.call(inputs, **kwargs)
458 output_mask = self.compute_mask(inputs, previous_mask)
459
~/anaconda3/envs/thesis/lib/python3.6/site-packages/keras/engine/network.py in call(self, inputs, mask)
562 return self._output_tensor_cache[cache_key]
563 else:
--> 564 output_tensors, _, _ = self.run_internal_graph(inputs, masks)
565 return output_tensors
566
~/anaconda3/envs/thesis/lib/python3.6/site-packages/keras/engine/network.py in run_internal_graph(self, inputs, masks)
719 kwargs['mask'] = computed_mask
720 output_tensors = to_list(
--> 721 layer.call(computed_tensor, **kwargs))
722 output_masks = layer.compute_mask(computed_tensor,
723 computed_mask)
~/anaconda3/envs/thesis/lib/python3.6/site-packages/keras/layers/recurrent.py in call(self, inputs, mask, training, initial_state)
2192 mask=mask,
2193 training=training,
-> 2194 initial_state=initial_state)
2195
2196 @property
~/anaconda3/envs/thesis/lib/python3.6/site-packages/keras/layers/recurrent.py in call(self, inputs, mask, training, initial_state, constants)
647 mask=mask,
648 unroll=self.unroll,
--> 649 input_length=timesteps)
650 if self.stateful:
651 updates = []
~/anaconda3/envs/thesis/lib/python3.6/site-packages/keras/backend/tensorflow_backend.py in rnn(step_function, inputs, initial_states, go_backwards, mask, constants, unroll, input_length)
3009 parallel_iterations=32,
3010 swap_memory=True,
-> 3011 maximum_iterations=input_length)
3012 last_time = final_outputs[0]
3013 output_ta = final_outputs[1]
~/anaconda3/envs/thesis/lib/python3.6/site-packages/tensorflow/python/ops/control_flow_ops.py in while_loop(cond, body, loop_vars, shape_invariants, parallel_iterations, back_prop, swap_memory, name, maximum_iterations, return_same_structure)
3230 ops.add_to_collection(ops.GraphKeys.WHILE_CONTEXT, loop_context)
3231 result = loop_context.BuildLoop(cond, body, loop_vars, shape_invariants,
-> 3232 return_same_structure)
3233 if maximum_iterations is not None:
3234 return result[1]
~/anaconda3/envs/thesis/lib/python3.6/site-packages/tensorflow/python/ops/control_flow_ops.py in BuildLoop(self, pred, body, loop_vars, shape_invariants, return_same_structure)
2950 with ops.get_default_graph()._mutation_lock(): # pylint: disable=protected-access
2951 original_body_result, exit_vars = self._BuildLoop(
-> 2952 pred, body, original_loop_vars, loop_vars, shape_invariants)
2953 finally:
2954 self.Exit()
~/anaconda3/envs/thesis/lib/python3.6/site-packages/tensorflow/python/ops/control_flow_ops.py in _BuildLoop(self, pred, body, original_loop_vars, loop_vars, shape_invariants)
2885 flat_sequence=vars_for_body_with_tensor_arrays)
2886 pre_summaries = ops.get_collection(ops.GraphKeys._SUMMARY_COLLECTION) # pylint: disable=protected-access
-> 2887 body_result = body(*packed_vars_for_body)
2888 post_summaries = ops.get_collection(ops.GraphKeys._SUMMARY_COLLECTION) # pylint: disable=protected-access
2889 if not nest.is_sequence(body_result):
~/anaconda3/envs/thesis/lib/python3.6/site-packages/tensorflow/python/ops/control_flow_ops.py in <lambda>(i, lv)
3199 cond = lambda i, lv: ( # pylint: disable=g-long-lambda
3200 math_ops.logical_and(i < maximum_iterations, orig_cond(*lv)))
-> 3201 body = lambda i, lv: (i + 1, orig_body(*lv))
3202
3203 if context.executing_eagerly():
~/anaconda3/envs/thesis/lib/python3.6/site-packages/keras/backend/tensorflow_backend.py in _step(time, output_ta_t, *states)
2999 uses_learning_phase = True
3000 for state, new_state in zip(states, new_states):
-> 3001 new_state.set_shape(state.get_shape())
3002 output_ta_t = output_ta_t.write(time, output)
3003 return (time + 1, output_ta_t) + tuple(new_states)
~/anaconda3/envs/thesis/lib/python3.6/site-packages/tensorflow/python/framework/ops.py in set_shape(self, shape)
526 except errors.InvalidArgumentError as e:
527 # Convert to ValueError for backwards compatibility.
--> 528 raise ValueError(str(e))
529
530 @property
ValueError: Dimension 0 in both shapes must be equal, but are 0 and 1. Shapes are [0,800] and [1,800].
Might this be related to #8397?
@flxw hello I got same problem. Do you have found solution?
@flxw hello I got same problem. Do you have found solution?
Same problem..... Works fine with 1 GPU, breaks with multiple GPUs
@MatthewWilletts hi, i'm hitting the same problem. But i am confuse about the "latent variables ", what is it mean? And how can i know how to add which layers to output? thanks
Most helpful comment
I am having a similar problem, but my error message already comes before starting to fit the model. This is why I am unsure whether the problem is the same. Here is my model, and if I run the code without
multi_gpu_model, it goes just fine:And here is the super-long error trace that originates when I try multi_gpu_model (ngpus = 8). What do you think, is the model malformed or does this belong to the referenced issue?