Keras: Why keras apps using multi_gpu_model is slower than single gpu?

Created on 27 Jan 2018 · 38Comments · Source: keras-team/keras

multi_gpu_model is from keras.utils and it wraps the application model to use multiple GPU to train. However, it seems that using multi_gpu_model makes the training heavier and slower, Is it as expected?
The GPU I am using is NVIDIA Tesla P100.

Source

ghostplant

👍2

Most helpful comment

Hi, I found some explanations to this issue.

It seems that not all models will benefit for multi_gpu_model.
Different models have different scalability due to the overhead of weight synchronization.

ResnetV1 and ResnetV2 are a pair of typical examples to prove that, while ResnetV2 have a better scalability than ResnetV1.

There is a balance between training one mini-batch and weights synchronization. InceptionV3 has heavy computational cost on training one mini-batch while it has sparse weights need to synchronize, so this model will gain a decent boost on multi_gpu_model.

However, any model with large Dense layer usually contributes to a bad scalability, just like mnist_mlp, which have a light computational cost on training one mini-batch while its weights are too large to synchronize efficiently, so in the example of mnist_mlp, the time spent to do one weights synchronization is even able to finish training MANY turns of mini-batch by single GPU, so mnist_mlp will not benefit for multi_gpu_model due to its dense network design to result in a bad scalability.

It also indicates that models to train on different GPU architectures will also have a different answer about whether it will benefit for multi_gpu_model, since it largely depends on whether the GPU is fast enough to perform training one mini-batch than a weights synchronization. So another conclusion is that the faster a GPU is, the less likely that multi_gpu_model can boost your model.

ghostplant on 28 Mar 2018

👍14 ❤5

All 38 comments

Besides, the version of Tensorflow I use is 1.4.0, and keras version is 2.1.3. All settings for them are default. The testing example is cifar10_cnn.py

ghostplant on 27 Jan 2018

I reopen this post because the issue still exists. Whatever I put model weights on CPU or GPU, most model examples such as cifar10_cnn.py have worse performance than using single GPU.

ghostplant on 27 Jan 2018

What are the benchmarks you are seeing? Is the code you are running without modifications i.e is it exactly the same as the example cifar10_cnn?

anj-s on 31 Jan 2018

I did the benchmarks by myself, and the example is exactly from keras/examples folder.
For example, as for mnist_cnn.py, and I added the following line before model.compile:

model = keras.utils.training_utils.multi_gpu_model(model, 4)

And here is the benchmark for mnist_cnn.py on the NVDIA Tesla P100 (4GPU, 16GB-Mem per device):

original_single: gpu=1, perf = 5s/epoch 75us/step
multi_gpu_model: gpu=2, perf = 5s/epoch 74us/step
multi_gpu_model: gpu=3, perf = 5s/epoch 81us/step
multi_gpu_model: gpu=4, perf = 6s/epoch 103us/step

As for cifar_cnn.py, I alsa added one line above, and the performance is as follow:

# By default, data_augmentation = True in cifar_cnn.py
original_single: gpu=1, perf = 23s 15ms/step
multi_gpu_model: gpu=2, perf = 23s 15ms/step
multi_gpu_model: gpu=3, perf = 24s 15ms/step
multi_gpu_model: gpu=4, perf = 22s 14ms/step

We see there is hardly any difference because CPU-side data_augment is really the bottleneck.
If we turned off the data_augmentation, the the performance is as follow:

# data_augmentation = False
original_single: gpu=1, perf = 14s 286us/step
multi_gpu_model: gpu=2, perf = 16s 325us/step
multi_gpu_model: gpu=3, perf = 19s 389us/step
multi_gpu_model: gpu=4, perf = 22s 445us/step

We see all the performance of multi_gpu_model is hardly better but worse.

ghostplant on 2 Feb 2018

👍13

@ghostplant @anj-s I'm having the same issue! #9502 Any updates on this?

spate141 on 2 Mar 2018

Any updates yet ? @ghostplant

mohapatras on 6 Mar 2018

@mohapatras It seems that CPU-side data-preprocessing can be one of the reason that greatly slow down the multi-GPU training, do you try disabling some pre-processing options such as data-augmentation and then see any boost?

Besides, the current version of multi_gpu_model seems to benefit large NN-models only, such as Xception, since weights synchronization is not the bottleneck. When it is wrapped to simple model such as mnist_cnn and cifar_cnn, weights synchronization is pretty frequent and makes the whole time much slower.

I'm also on the way to implement a customized version for multi GPU to see any ways better.

ghostplant on 6 Mar 2018

👍6

Hi, I found some explanations to this issue.

It seems that not all models will benefit for multi_gpu_model.
Different models have different scalability due to the overhead of weight synchronization.

ResnetV1 and ResnetV2 are a pair of typical examples to prove that, while ResnetV2 have a better scalability than ResnetV1.

ghostplant on 28 Mar 2018

👍14 ❤5

It's true that different models have different challenges in scalability.
But also note that with ResNetV1, https://github.com/tensorflow/benchmarks can scale with >7.5x speedup on 8 P100s or V100s.

ppwwyyxx on 28 Mar 2018

@ppwwyyxx Are they testing ResnetV1 with same deep layer? Besides, if using distrbuted TF as the backend just like the link you added, maybe their providing benchmarks are based on RDMA or other high-efficiency network media, which will shorten the overhead of weights synchronization.

ghostplant on 29 Mar 2018

I was talking about ResNet50, that's what they are mainly testing with, e.g. https://www.tensorflow.org/performance/benchmarks.
I was not talking about distributed backend. The tensorflow code I linked to, as well as pytorch, caffe2, mxnet, can all scale ResNet50 training to at least 7.5x on 8 P100s on a single machine, under some good parameters (batch size, etc).

ppwwyyxx on 29 Mar 2018

If using ResNet50 provided from keras.applications, I can see nearly 2x boost using 2 Tesla P100 GPUs. How about your benchmark?

ghostplant on 29 Mar 2018

@ghostplant could you share your code somewhere?

ppwwyyxx on 29 Mar 2018

python import tensorflow as tf from keras.applications import ResNet50 from keras.utils import multi_gpu_model import numpy as np num_samples = 1000 height = 224 width = 224 num_classes = 1000 with tf.device('/cpu:0'): model = ResNet50(weights=None, input_shape=(height, width, 3), classes=num_classes) parallel_model = multi_gpu_model(model, 2) parallel_model.compile(loss='categorical_crossentropy', optimizer='rmsprop') x = np.random.random((num_samples, height, width, 3)) y = np.random.random((num_samples, num_classes)) parallel_model.fit(x, y, epochs=20, batch_size=256)

ghostplant on 29 Mar 2018

👍1

Your code uses NHWC image layout which is slower than NCHW. As you've pointed out, the slower the code is, the better it scales.
https://www.tensorflow.org/performance/benchmarks shows 422 images/s for ResNet50 training on two P100s. So you should expect your code to finish each epoch (1000 samples) in 2.36s. I assume it's not the case.

ppwwyyxx on 29 Mar 2018

In my experiments, if using NHWC format, it can finish training 1000 samples in 5 sec/epoch using 2 GPUs, and 3 sec/epoch using 4 GPUs.
According to https://github.com/keras-team/keras/blob/master/keras/applications/resnet50.py, it says that channels_last for Tensorflow will have the best performance.

ghostplant on 29 Mar 2018

@ghostplant Interesting conclusions regarding whether multi_gpu could benefit a model.

I'm also finding that training (on custom models) is taking longer with 2 or 3 Tesla K80s than just 1.

Am I correct in thinking that multi_gpu still has an advantage that you have more GPU memory available and can therefore run more data through training or larger batch sizes?

TristanJM on 30 Mar 2018

@TristanJM Yes, regardless of whether the model can get boosted by multi GPU (computational scaling), another case we have to use multi_gpu_model is for single GPU not able to train a model with large batch_size, thus the model can benefit from multi_gpu_model for its memory occupation scaling.

ghostplant on 30 Mar 2018

👍2

In my experiments, if using NHWC format, it can finish training 1000 samples in 5 sec/epoch using 2 GPUs, and 3 sec/epoch using 4 GPUs.

So it's already 2x slower than what it should be, then certainly it'll be able to scale better. The slower the model is, the better it scales.

it says that channels_last for Tensorflow will have the best performance.

This is definitely not true for Tensorflow on a P100. Cudnn implementation on every GPU architecture before Volta favors NCHW over NHWC.
It may be true for Keras, however. For example, there is a recent performance fix in https://github.com/keras-team/keras/pull/8785 that makes it use a faster batchnorm kernel for NCHW. I won't be surprised if NHWC is faster than NCHW before this PR. But the PR seems to suggest that NCHW is faster now.

ppwwyyxx on 1 Apr 2018

@ppwwyyxx
I update the channel mode and resnet50v1 and it does 1-sec faster (only on 2 GPUs):

       import tensorflow as tf
       from keras.applications import ResNet50
       from keras.utils import multi_gpu_model
       import numpy as np
       num_samples = 1000
       height = 224
       width = 224
       num_classes = 1000
       with tf.device('/cpu:0'):
           model = ResNet50(weights=None,
                            input_shape=(3, height, width),
                            classes=num_classes)
       parallel_model = multi_gpu_model(model)
       parallel_model.compile(loss='categorical_crossentropy',
                              optimizer='rmsprop')
       x = np.random.random((num_samples, 3, height, width))
       y = np.random.random((num_samples, num_classes))
       parallel_model.fit(x, y, epochs=20, batch_size=256)

After the changes, it can finish training 1000 samples in 4 sec/epoch using 2 GPUs, but still 3 sec/epoch using 4 GPUs.

ghostplant on 2 Apr 2018

HI, could you give me a help. I installed tensorflow 1.4 and my keras is 2.15. When I specify the input shape NCWH, an error occurs. It seems that it dose not support the format.

hellojialee on 22 Apr 2018

Do you set channels_first in ~/.keras/keras.json?

ghostplant on 22 Apr 2018

Yes. I have changed it already.

hellojialee on 22 Apr 2018

The resnet50 in keras.applications says:
if K.image_data_format() == 'channels_first' and K.backend() == 'tensorflow': warnings.warn('You are using the TensorFlow backend, yet you '
Which tensorflow version do you use?

hellojialee on 22 Apr 2018

My environment: pip3 install tensorflow-gpu==1.4.1 keras==2.1.5, and there is no such warning using the code I pasted above.

ghostplant on 22 Apr 2018

@USTClj
I think Keras multi_gpu_model is inefficient in its input splitting. According to NV profiling, any Keras model using 4GPU pushes huge amount of data inputs to every GPU which makes 4-times larger memory copy than Tensorflow benchmark script, this is at least one bottleneck that reduces Keras' scalability.

ghostplant on 17 May 2018

It would be better if there will be any example to show to How to use Tensorflow native data tensors (TFrecord) as the input of Keras multi_gpu_model, which might be faster than current inefficient get_slice method.

ghostplant on 17 May 2018

👍2

@ppwwyyxx I found partial reasons why keras is around 2x slower than tensorflow when training Resnet50.
Firstly, Keras Conv2D uses ConvBias weights for every CNN layer which brings about extra computing complexity for bias forward and backward.

Secondly, even though Keras is working as channels_first image format, it still needs to do extra GPU matrix computing to frequently swap between NCHW and NHWC format. In other words, it is not fully working on NCHW computing. Thus, a lot of tensor conversion also brings about much bias computing.

Thirdly, after Keras==2.2.0, multi_gpu_model using cpu_merge=True will allow in-place data split before pushed to GPU memory, so it slightly reduce the overhead from 2x to 1.8x~1.9x as well, which is the third reason.

Maybe there are still more other reasons not found yet.

ghostplant on 15 Jun 2018

multi_gpu_model using cpu_merge=True will allow in-place data split before pushed to GPU memory

The optimal solution is to not split at all. As I put in tensorpack docs here:

Splitting a tensor for data-parallel training makes no sense at all, only to put unnecessary shape constraints on the data. By letting each GPU train on its own input tensors, they can train on inputs of different shapes simultaneously.

ppwwyyxx on 15 Jun 2018

❤2

The problem is the full data is at CPU, and different GPUs cannot directly get access to the data without fully copy the data to each GPU, which brings a huge I/O between host and device. I have tested it doesn't just put constraits on the data, and the PCIe traffic is largely reduced as well. You can see there will be a slight boost.

ghostplant on 15 Jun 2018

I agree that if there is a way to feed data instead of even using split, it will be much efficient.

ghostplant on 15 Jun 2018

I also experienced this problem yesterday.When I increased batch_size,multi-gpu is faster than single.Maybe It is because increasing batch_size can make the GPU computational cost larger,but the communication-cost between CPU and GPU don't change.

majiali1995 on 27 Jun 2018

👍6

That's because you're looking at the wrong metric. You're looking at seconds/step. It makes sense that this would be slower than that for non distributed training. With distributed training it's doing more operations. Each individual gpu has to do backpropagation on a batch, send the gradients back to RAM, apply the gradients to the parameters you're storing in RAM, and finally sync the values of the parameters stored in RAM with those store in GPU memory. With non distributed training it only has to apply gradients to the parameters store in GPU memory, so it makes sense that it would take less time to train a single batch.

Distributed training sees a speedup when you look at the number of global steps / sec (i.e. the number of batches trained by all the GPU workers). This will cause the model to converge faster.

mattdornfeld on 10 Aug 2018

@mattdornfeld No. The batch is evenly split to each GPU so it's not the wrong metric.

ppwwyyxx on 10 Aug 2018

I also experienced this problem yesterday.When I increased batch_size,multi-gpu is faster than single.Maybe It is because increasing batch_size can make the GPU computational cost larger,but the communication-cost between CPU and GPU don't change.

That's because Keras will split each batch according to the total number of gpus used in multi_gpu_model. A model trained with a single gpu and batch size of 64 will be approx. as fast as that same model trained using multi_gpu_model with 2 gpus and the same batch size, as each gpu will process 32 samples at once. So to be able to compare results you should multiply the original batch size by the number of gpus used.

Thus, if 2 gpus are used, your batch size should be 128. Keras will split the samples into 2 groups of 64 samples for each gpu. That way your processing should be ~2x faster than using a single gpu.

manuelblancovalentin on 20 Sep 2018

@TristanJM Yes, regardless of whether the model can get boosted by multi GPU (computational scaling), another case we have to use multi_gpu_model is for single GPU not able to train a model with large batch_size, thus the model can benefit from multi_gpu_model for its memory occupation scaling.

Hello, I encountered OOM (out of memory) error recently then I used two gpus with almost the same implementation way of your code shared here (with tf.device cpu, parallel model, gpu=2). But I still have the same OOM error even though I used two gpus. I don't understand how to share the memory pressure with multi gpus? My assumption is that with the keras multi-gpus function, it basically just make a same replica on another gpu but doesn't deal with the memory problem. So do you know how to deal with the OOM error with multi-gpus?

Thanks!
Jianing

jianing-sun on 7 Nov 2018

Hi @jianing-sun , how did it go with the OOM problem? I have the same issue here.

Edit: I ended up by reducing the depth and width of the neural net to fit it into GPU cards.

qifengzhou on 17 Feb 2019

Hi
I also ran into OOM issue with the batch size fit perfectly in single card.(4 is good for single card but 8 with 2 cards will raise OOM). I also find out when I train the model with multi gpu, and a larger total batch size(from 4 to 14, where I use 7 cards and batch size of 2 for each card), the total number of steps per epoch is not reduced but increased. Any idea why? Thanks in advance!