Has anyone tried running Keras at a larger scale, such as on a cluster or in an HPC environment? I have access to such an environment and would be interested to understand whether Keras can be used effectively in such an environment? Perhaps it would require using intermediate outputs stored on disk such as with joblib to break up the processing?
What sort of parallelism are you looking to do? Data parallelism would be very easy to setup. Model parallelism would require some changes.
I ran a derivative of the char-rnn successfully on our cluster via a PBS queue system, but only on a single machine with 16 cpu cores (we have no GPUs here yet). I would be interested to find out if it is possible to train a model on several nodes at once, but it might be that this is not supported by theano.
You can check out the Elephas project for Keras parallelization on Spark: https://github.com/maxpumperla/elephas
thanks, I will check it out!
Dear, sir, How to train a model on several nodes?@tleeuwenburg
Most helpful comment
You can check out the Elephas project for Keras parallelization on Spark: https://github.com/maxpumperla/elephas