Models: [slim] performance reduce when train cifarnet with multi-gpu

Created on 2 Oct 2016 · 14Comments · Source: tensorflow/models

I want to train cifarnet on single machine with 4 gpus, but the performance reduces comparing with training with only one gpu.

[slim] Train cifarnet using the default script slim/scripts/train_cifarnet_on_cifar10.sh

When using the default script the speed is as follow:

INFO:tensorflow:global step 13900: loss = 0.7609 (0.06 sec/step)

Modify slim/scripts/train_cifarnet_on_cifar10.sh by set num_clones=4

The speed become slow (I also try change num_preprocessing_threads = 1/2/4/8/16, num_readers=4/8, useless)

INFO:tensorflow:global step 14000: loss = 0.7438 (0.26 sec/step)
INFO:tensorflow:global step 14100: loss = 0.6690 (0.26 sec/step)

Hardware

Four Titan X
02:00.0 VGA compatible controller: NVIDIA Corporation Device 17c2 (rev a1)
03:00.0 VGA compatible controller: NVIDIA Corporation Device 17c2 (rev a1)
06:00.0 VGA compatible controller: ASPEED Technology, Inc. ASPEED Graphics Family (rev 30)
82:00.0 VGA compatible controller: NVIDIA Corporation Device 17c2 (rev a1)
83:00.0 VGA compatible controller: NVIDIA Corporation Device 17c2 (rev a1)
+------------------------------------------------------+
| NVIDIA-SMI 352.30 Driver Version: 352.30 |
|-----------------------------------+----------------------+------------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
|=========================+=================+=================|
| 0 GeForce GTX TIT... On | 0000:02:00.0 Off | N/A |
| 28% 67C P2 75W / 250W | 228MiB / 12287MiB | 0% Default |
+----------------------------------+-----------------------+------------------------+
。。。

32 processor each as follow ：
processor : 0
vendor_id : GenuineIntel
cpu family : 6
model : 63
model name : Intel(R) Xeon(R) CPU E5-2630 v3 @ 2.40GHz

Finally

Could anyone give some advices?
I read some issues about multi-gpu in this rep, but still can't solve this.
I think it caused by IO, because I notice that when train on single gpu the GPU-Util is above 90%( when train with 4 gpus, the GPU-Util is about 20% ).
And I don't think it's due to the hardware performance of my machine.

awaiting model gardener bug

Source

D-X-Y

All 14 comments

@sguada: Sergio, could you provide some advice on this? Thanks.

concretevitamin on 3 Oct 2016

@concretevitamin @sguada Hi, is there anyone could help give some advices?

D-X-Y on 17 Oct 2016

In Tensorboard, check the 'queue' scalar summaries under Events. For me, I think the bottleneck is the input pipeline? With 3 gpus, I am getting ~.2 seconds per step. My CPU cannot keep the queues in the input pipeline sufficiently stocked with enough data to feed 3 GPUs. However when I use 2 GPUs, I get ~.15 seconds per step, and the queues are almost always filled.

Since synchronization between the GPUs occurs every step , GPUs which get allocated their data first are forced to wait for GPUs who had to wait for the queue to fill. I'm not sure how to test this though.
Edit: When using 1 GPU, I get ~.15 seconds per step -- the same as I was getting with 2 GPUs. I think this supports my input bottleneck theory.

I am using Titans as well, so the trend would make sense that you see ~.24 seconds per step with your 4x Titans. Moving the data from a HDD to a SSD did not seem to help, presumably because the bottleneck is not in reading the data but in the processing that goes on afterwards.

Edit 2: Something doesn't add up though. Using 2 GPUs and a batch size of 256, my queues stay filled. Which means I am producing a total of 512 images per synchronized run. However, if I use 3 GPUs and lower the batch size to 128, I cannot keep the queues filled. This doesn't make sense to me, since 128*3 = 384 and 384 < 512. If I can produce 512 images for two GPUs to run simultaneously, I should be able to produce 384 for three.

Edit3: I just remembered that my CPU is 40 lanes. The lanes get divided among the 3 GPUs as follows: 16x/16x/8x. The third GPU has half as many lanes. I believe this is why adding the third GPU does not help.

vonclites on 18 Oct 2016

@tfboyd do you have any suggestions here?

aselle on 10 Feb 2017

Yup, add me as the owner. I have some fixes and a new version on the way
in a couple days.

On Fri, Feb 10, 2017 at 1:47 PM, Andrew Selle notifications@github.com
wrote:

@tfboyd https://github.com/tfboyd do you have any suggestions here?

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/tensorflow/models/issues/490#issuecomment-279075152,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AWZesl02kJ7Jzfap-KZoXO2gN72yHvnFks5rbNsDgaJpZM4KMCkj
.

tfboyd on 10 Feb 2017

🎉1

Apologies for the sloppy comment. We have a change coming that will increase the single GPU likely by 6X+ and fix the multi-GPU. Give use one more week.

tfboyd on 10 Feb 2017

First I want to apologize for getting overly excited and failing to read the issue in detail. I am getting use to this forum. Now for some good news and bad news.

Bad news and me embarrassing myself.
I thought you were referencing this CIFAR-10 examples, which has an issue where the image preprocessing ends up on the GPU resulting in ~800 images/sec on a GTX 1080. Moving preprocessing to the CPU boosts performance to 6000 maybe 7000 img/sec on one GPU. I did a quick drive-by of the model code in slim that you actually referenced. The code in the slim examples looks to correctly place preprocessing on the CPU, but you can always double check.

Good news
It may be a couple of weeks (there are always unexpected issues) but we are about to release code showing a new technique for implementing input pipelines as well as some other best practices. Our testing is almost exclusively with ImageNet as the CIFAR images are small (32x32). Using the new input pipeline technique should boost your ability to feed your Titans X's with CIFAR images. I do not know if will completely saturate them on your setup. The initial batch of code will likely show loading ImageNet but repurposing the technique to CIFAR should not be hard.

Thoughts for @vonclites
I am taking a wild guess but based on you changing around your batch sizes, the issue is likely not the CPU keeping up with preprocessing (image manipulation and such) it is the CPU trying to keep the GPUs filled with data (simply move the data onto the GPU or GPUs in this instance). The new input pipeline I reference in my "Good news" section above, has shown to significantly improve saturation of GPUs with data. I need to hedge myself and say each setup is different. I did testing on an AWS p2.8xlarge and saw significant (I cannot share the data just yet) improvement. K80s are not Titan Xs. But I also know some testing occurred on 8x Tesla P100s. Again using ImageNet.

I will try to be less wordy in the future and also not comment to quickly. While the CIFAR dataset is not that interesting to most people for performance, I think it is fun to see how fast it can go.

Thank you for understanding and posting, I am leaving this open so I can come back and paste links to the samples as soon as they are available.

tfboyd on 11 Feb 2017

❤1

Thank you @tfboyd for your response! I realize my post is not very helpful in troubleshooting this. This aspect of Tensorflow is still very much a black box to me, and have had much time to figure out how to really trace the bottleneck. It sounds like this issue may resolve itself as you all introduce the new pipelines.

Also, I'm very excited to hear this good news. I've been trying to migrate my code away from slim and more towards tf.learn's input functions. I was drawn to this because of tf.learn's support of simultaneous training and testing using the eval monitors.

I'm wondering if I should hold off porting my input pipeline code so that it will interface better with tf.learn. Is the code for your new pipelines going to be very different? Is it meant to interface with tf.learn? or with slim? I'm unsure what I should hold off implementing...

Basically, the stage I'm at now is that I've created a SQL database containing my image metadata and image filepaths. My idea was to run queries on my SQL database so I can quickly generate different 'views' of my data. I would then create a training and testing set from the queried view and feed the data into an input pipeline which ultimately hooks up with tf.learn code.

Any guidance is appreciated! Thanks again,

vonclites on 11 Feb 2017

I'm guessing we're now seeing the beginnings of what @tfboyd referenced. https://github.com/tensorflow/tensorflow/issues/7951

vonclites on 1 Mar 2017

Even before Derek's work is done we will have code in a few weeks that will show you how to use existing aspects of TF to gain substantial gains. @vonclites , Sorry I did not see your response. The goal is for Estimator (tf.learn) to be as fast (or very nearly) as fast as the benchmark code to be released.

Your other question: Should you move to Estimators? This is totally my personal opinion. I think I would wait for now. Estimator (tf.learn) will be the right way to run everything and is currently being worked on to be included into core. If you can wait, I would keep with slim for a little longer and avoid two refactors.

tfboyd on 1 Mar 2017

@tfboyd, awesome. Thanks for tip ;)

Edit: For anyone interested in @tfboyd's announcements, here's a good thread to follow: https://github.com/tensorflow/tensorflow/issues/7679

vonclites on 1 Mar 2017

@vonclites The benchmark code is out. It does not use Estimators. But good news, Estimators is now in core and has a team that is working to add features and add flexility so people can get great performance on a range of models. If you follow that API you should have a good experience and be in good company.

tfboyd on 10 May 2017

The benchmark code looks to use the old methods, rather than the TF-Slim code. When using the methods from models/inception/inception_train.py, I am able to match the performance that I get while running the benchmark script. When using the scripts in models/slim though, I am only getting about half the performance, due to the GPUs being starved for data. Can the new methods for loading data to saturate GPUs be used with Slim code?

psyon on 24 Sep 2017

@psyon
the slim code has a variety of issues that are being resolved as a high priority problem.

1) it uses NHWC which is not optimal on GPUs
2) it does not use fused batch_norm. You can fixed this by setting fused=true :-)
3) it does not have good variable management options it's only option is to put all variables on GPU:0 and that is not usually optimal
4) it uses queues, which is ok for 8xK80s but we need to move it to use tf.data.datasets (the name of tf.contrib.data) for tf 1.4.

I hope to see all of the SLIM models reworked by the end of Q4 at the latest.

tfboyd on 25 Sep 2017

Was this page helpful?

0 / 5 - 0 ratings