Environment:
Checklist:
Your question:
I am trying to train a few keras models in a cluster with 3 CPUs to see if I can get an enhancement in training time (while maintaining the same accuracy). I have models with Dense layers and models with LSTM layers.
However, my fitting time does not decrease much (120s to 118s) and sometimes even increases (3.5 to 3.6s). Could that be that my model is way to small for it to benefit from data parallelization? Also my data set may be too small? Just to give numbers I have m = 11k training examples and P = 2 - 30 k params in each model.
When I ran tests with m = 350 k training examples and with a model having P = 700 k - 4 M params, I was able to observe some gain (up to 40% in training time), although I would have to run some extra epochs to compensate for the decreased accuracy.
Thanks in advance!
Thiago.
I am interested to know what @alsrgv thinks about this.
Typically Horovod will give you the most benefit when the bottleneck in your training process is the compute unit (the CPU or GPU).
In your case, what you may be seeing is that the compute time is relatively insignificant in comparison to, say, the I/O, and if you're running Horovod on a single machine with a single disk, then Horovod will not be able to help with that problem.
Similarly, if you're running across multiple machines and you have a slow network connection between them, then it's possible for that to be the bottleneck in your training process.
One way to find out more is to look at the Horovod timeline, which will show you how much time is being spent in the gradient aggregation at various points.
But in general, yes, it's entirely possible for things to scale poorly because running computations on the model isn't the bottleneck.
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
Most helpful comment
Typically Horovod will give you the most benefit when the bottleneck in your training process is the compute unit (the CPU or GPU).
In your case, what you may be seeing is that the compute time is relatively insignificant in comparison to, say, the I/O, and if you're running Horovod on a single machine with a single disk, then Horovod will not be able to help with that problem.
Similarly, if you're running across multiple machines and you have a slow network connection between them, then it's possible for that to be the bottleneck in your training process.
One way to find out more is to look at the Horovod timeline, which will show you how much time is being spent in the gradient aggregation at various points.
But in general, yes, it's entirely possible for things to scale poorly because running computations on the model isn't the bottleneck.