Keras: Why model averaging does not work for multi-gpu parallism?

Created on 8 Jul 2016 · 4Comments · Source: keras-team/keras

I'm trying multi-GPU parallism by averaging models returned from each GPU after each training batch, same as @zhangtemplar @https://github.com/fchollet/keras/issues/106.

Say I have 4 GPU named as ['gpu0', 'gpu1', 'gpu2', 'gpu3'], at the beginning, each model is initialized by set_weights to be the same with model at 'gpu0'. Then during the training stage, each model is fed with the exactly same train data. After a train batch, these 4 models are averaged by get_weights and set_weights.

In theory, these 4 models should be equal since they're initialized and trained by the same data. However, after mode averaging, the train loss keeps going larger, and finally NaN.

So, my question is, what's wrong about above model averaging mechanism? I'm using Keras 1.0.4 and Theano 0.9.0.

Please make sure that the boxes below are checked before you submit your issue. Thank you!

[ ] Check that you are up-to-date with the master branch of Keras. You can update with:
pip install git+git://github.com/fchollet/keras.git --upgrade --no-deps
[ ] If running on Theano, check that you are up-to-date with the master branch of Theano. You can update with:
pip install git+git://github.com/Theano/Theano.git --upgrade --no-deps
[ ] Provide a link to a GitHub Gist of a Python script that can reproduce your issue (or just copy the script here if it is short).

stale

Source

ghost

Most helpful comment

@daweileng Hi David, did you finally achieve data parallelism with keras+tensorflow? If yes, could you please share more details? Thank you.

pengpaiSH on 21 Aug 2016

👍7

All 4 comments

I think you will have more luck after reading and implemening this paper:
http://arxiv.org/abs/1412.6651

the-moliver on 8 Jul 2016

Well, not all processes are deterministic. Dropout and other noise layers are stochastic, as well as by default, your training data gets shuffled for training.
Unless you'll have the same seed, even with the same data you can get different results.

lemuriandezapada on 14 Jul 2016

@daweileng Hi David, did you finally achieve data parallelism with keras+tensorflow? If yes, could you please share more details? Thank you.

pengpaiSH on 21 Aug 2016

👍7

This issue has been automatically marked as stale because it has not had recent activity. It will be closed after 30 days if no further activity occurs, but feel free to re-open a closed issue if needed.