Horovod: Can only large sets of data and large models benefit from parallelization?

Created on 25 Jul 2019 · 3Comments · Source: horovod/horovod

Environment:

Framework: Keras
Framework version: 2.2.4
Horovod version: 0.16.4
MPI version: OpenMPI 4.0.1
CUDA version: -
NCCL version: -
Python version: 3.5.2
OS and version: Ubuntu 16.04.5 LTS x86_64
GCC version: 5.4.0 20160609

Checklist:

Did you search issues to find if somebody asked this question before? Yes.
If your question is about hang, did you read this doc? N/A.
If your question is about docker, did you read this doc? Yes.
Did you check if you question is answered in the troubleshooting guide? Yes.

Your question:

I am trying to train a few keras models in a cluster with 3 CPUs to see if I can get an enhancement in training time (while maintaining the same accuracy). I have models with Dense layers and models with LSTM layers.

However, my fitting time does not decrease much (120s to 118s) and sometimes even increases (3.5 to 3.6s). Could that be that my model is way to small for it to benefit from data parallelization? Also my data set may be too small? Just to give numbers I have m = 11k training examples and P = 2 - 30 k params in each model.

When I ran tests with m = 350 k training examples and with a model having P = 700 k - 4 M params, I was able to observe some gain (up to 40% in training time), although I would have to run some extra epochs to compensate for the decreased accuracy.

Thanks in advance!
Thiago.

question wontfix

Source

tvovalentin

Most helpful comment

Typically Horovod will give you the most benefit when the bottleneck in your training process is the compute unit (the CPU or GPU).

In your case, what you may be seeing is that the compute time is relatively insignificant in comparison to, say, the I/O, and if you're running Horovod on a single machine with a single disk, then Horovod will not be able to help with that problem.

Similarly, if you're running across multiple machines and you have a slow network connection between them, then it's possible for that to be the bottleneck in your training process.

One way to find out more is to look at the Horovod timeline, which will show you how much time is being spent in the gradient aggregation at various points.

But in general, yes, it's entirely possible for things to scale poorly because running computations on the model isn't the bottleneck.

tgaddair on 25 Jul 2019

👍2

All 3 comments

I am interested to know what @alsrgv thinks about this.

abditag2 on 25 Jul 2019

Typically Horovod will give you the most benefit when the bottleneck in your training process is the compute unit (the CPU or GPU).

Similarly, if you're running across multiple machines and you have a slow network connection between them, then it's possible for that to be the bottleneck in your training process.

One way to find out more is to look at the Horovod timeline, which will show you how much time is being spent in the gradient aggregation at various points.

But in general, yes, it's entirely possible for things to scale poorly because running computations on the model isn't the bottleneck.

tgaddair on 25 Jul 2019

👍2

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.