Incubator-mxnet: SSD Example: Multi-GPU training does not scale well with more GPUs

Created on 6 Dec 2016 · 32Comments · Source: apache/incubator-mxnet

Training a SSD model with multiple GPUs does not scale well if more than two GPUs are used.

Here are the estimated samples per second with increasing number of GPUs and batch size:

| No. of GPUs | Batch Size | Speed |
| :---: | :---: | :---: |
| 1 | 32 | 11.4 samples/sec |
| 2 | 64 | 18.6 samples/sec |
| 4 | 128 | 20.9 samples/sec |
| 8 | 256 | 22.3 samples/sec |

This issue can be reproduced by training a SSD model as described here [1].

[1] https://github.com/dmlc/mxnet/tree/master/example/ssd

Source

gerhardneuhold

👍1

Most helpful comment

@PhanTask There are so many reasons that a network don't converge at all, especially for more complex network such as ssd.

I would suggest the following steps to check:

Does smooth-l1 look good? If so, check classification sub-network.
Make sure data and labels are correct, as well as the pretrained parameters.
Tune learning rates and batch size
Modify parameters, such as scaling and normalization, this is where you could focus more on, since you are using different base network (squeezenet) than vgg16. This is tricky, squeezenet's feature is a bit thinner than vgg16(192/256 filters vs 256/512), and has different hierarchy, and output scale might be different. So make sure you know what you are doing when designing the new structure. Try enabling normalization at all feature layers that have very large outputs.

Enable monitor and observe the outputs/weights/gradients could give you more insights of the network and training status. Good luck.
Please report if you have good results using squeezenet, thanks very much.

zhreshold on 11 Dec 2016

👍2

All 32 comments

@zhreshold - can you take a look here?

jspisak on 6 Dec 2016

I guess CPU/disk IO may be the killing factor when data is loaded to the network.

SSD example require various data augmentations which are computed on CPU, this might be a bottleneck depending how good your CPU is.
Again for the same reason, images are not packed to record file so IO speed(especially random access) is another factor.

I suggest you to use smaller batch-size to verify, e.g. batch 8 for 1 gpu, 16 for 2 gpu, etc...
BTW, using very large batch-size didn't beat smaller batch-size or maybe even worse based on my experience with ssd example.

zhreshold on 6 Dec 2016

@zhreshold Could you take a look at mx.image in the nnvm branch? Its a fast image preprocessing pipeline

piiswrong on 6 Dec 2016

Isn't mx.image a wrapper for cv2?
I guess cv2 and numpy operations are good in terms of speed for now, the priority is to pack single images into record file and feed the iterator with ImagerecordIter.
I hope someone could confirm this behavior before I could do something. I'm too busy these days.

zhreshold on 6 Dec 2016

With smaller batch sizes I get the following throughput:

| No. of GPUs | Batch Size | Speed |
| :---: | :---: | :---: |
| 1 | 8 | 9.9 samples/sec |
| 2 | 16 | 14.0 samples/sec |
| 4 | 32 | 21.0 samples/sec |
| 8 | 64 | 21.5 samples/sec |

With deactivated file I/O and data augmentation I get the following throughput:

| No. of GPUs | Batch Size | Speed |
| :---: | :---: | :---: |
| 1 | 8 | 9.8 samples/sec |
| 1 | 32 | 11.4 samples/sec |
| 1 | 64 | 11.5 samples/sec |
| 2 | 16 | 14.5 samples/sec |
| 2 | 64 | 19.4 samples/sec |
| 2 | 128 | 20.7 samples/sec |
| 4 | 32 | 25.3 samples/sec |
| 4 | 128 | 38.5 samples/sec |
| 4 | 256 | 40.9 samples/sec |
| 8 | 64 | 38.9 samples/sec |
| 8 | 256 | 68.3 samples/sec |
| 8 | 512 | 74.5 samples/sec |

As you said @zhreshold, the throughput is much better without file I/O and data augmentation. Would you optimize something else too to further speedup the multi-gpu training?

gerhardneuhold on 6 Dec 2016

@gerhardneuhold Could you disable data augmentation only in config/config.py:

cfg.TRAIN.RAND_SAMPLERS = []
cfg.TRAIN.RAND_MIRROR = False
cfg.TRAIN.INIT_SHUFFLE = False
cfg.TRAIN.EPOCH_SHUFFLE = False # shuffle training list after each epoch

Then I could have a sense of how I/O affecting the throughput, appreciated very much.
BTW, it is train on HDD instance right?

zhreshold on 6 Dec 2016

Sure @zhreshold, here are some throughput rates with disabled data augmentation only:

| No. of GPUs | Batch Size | Speed |
| :---: | :---: | :---: |
| 1 | 8 | 9.9 samples/sec |
| 1 | 32 | 11.4 samples/sec |
| 2 | 16 | 14.2 samples/sec |
| 2 | 64 | 18.2 samples/sec |
| 4 | 32 | 25.5 samples/sec |
| 4 | 128 | 27.7 samples/sec |
| 8 | 64 | 28.6 samples/sec |
| 8 | 256 | 29.5 samples/sec |

Data is read from SSD (EBS io1).

gerhardneuhold on 6 Dec 2016

@zhreshold Its a wrapper. But it bypasses GIL by pushing image decoding to engine. It also uses mx.nd instead of numpy for computation. So its much faster when you increase NUM_CPU_WORKER.

ImagerecordIter is fast not because it read from single file (it helps a little, but not that much). Its because it uses 4 thread for decoding by default.

piiswrong on 6 Dec 2016

I see. Here's my thought

[ ] High priority: pack separate images to record file, modify data_iter to take images from internal ImageRecordIter. HDD users could benefit a lot from this.
[ ] Replace with mx.image functions for data augmentation

zhreshold on 6 Dec 2016

@zhreshold Looks good. About 1. you don't need ImageRecordIter to read from .rec. mx.recordio provide parsers for rec files

piiswrong on 7 Dec 2016

@gerhardneuhold what gpu are you using? I'm getting 29/sec on single 1080

piiswrong on 7 Dec 2016

@zhreshold BTW my training log looks like this. Do you know why acc is stuck at 0.75000?

INFO:root:Epoch[0] Batch [380]  Speed: 29.19 samples/sec    Train-Acc=0.749015
INFO:root:Epoch[0] Batch [380]  Speed: 29.19 samples/sec    Train-ObjectAcc=0.002752
INFO:root:Epoch[0] Batch [380]  Speed: 29.19 samples/sec    Train-SmoothL1=76.203662
INFO:root:Epoch[0] Batch [400]  Speed: 28.91 samples/sec    Train-Acc=0.749785
INFO:root:Epoch[0] Batch [400]  Speed: 28.91 samples/sec    Train-ObjectAcc=0.000429
INFO:root:Epoch[0] Batch [400]  Speed: 28.91 samples/sec    Train-SmoothL1=87.126371
INFO:root:Epoch[0] Batch [420]  Speed: 29.19 samples/sec    Train-Acc=0.750000
INFO:root:Epoch[0] Batch [420]  Speed: 29.19 samples/sec    Train-ObjectAcc=0.000000
INFO:root:Epoch[0] Batch [420]  Speed: 29.19 samples/sec    Train-SmoothL1=76.761614
INFO:root:Epoch[0] Batch [440]  Speed: 29.20 samples/sec    Train-Acc=0.750000
INFO:root:Epoch[0] Batch [440]  Speed: 29.20 samples/sec    Train-ObjectAcc=0.000000
INFO:root:Epoch[0] Batch [440]  Speed: 29.20 samples/sec    Train-SmoothL1=82.783834
INFO:root:Epoch[0] Batch [460]  Speed: 29.22 samples/sec    Train-Acc=0.750000
INFO:root:Epoch[0] Batch [460]  Speed: 29.22 samples/sec    Train-ObjectAcc=0.000000
INFO:root:Epoch[0] Batch [460]  Speed: 29.22 samples/sec    Train-SmoothL1=79.285356
INFO:root:Epoch[0] Batch [480]  Speed: 29.17 samples/sec    Train-Acc=0.750000
INFO:root:Epoch[0] Batch [480]  Speed: 29.17 samples/sec    Train-ObjectAcc=0.000000
INFO:root:Epoch[0] Batch [480]  Speed: 29.17 samples/sec    Train-SmoothL1=77.857881
INFO:root:Epoch[0] Batch [500]  Speed: 28.83 samples/sec    Train-Acc=0.750000
INFO:root:Epoch[0] Batch [500]  Speed: 28.83 samples/sec    Train-ObjectAcc=0.000000
INFO:root:Epoch[0] Batch [500]  Speed: 28.83 samples/sec    Train-SmoothL1=77.706379
INFO:root:Epoch[0] Batch [520]  Speed: 29.24 samples/sec    Train-Acc=0.750000
INFO:root:Epoch[0] Batch [520]  Speed: 29.24 samples/sec    Train-ObjectAcc=0.000000
INFO:root:Epoch[0] Batch [520]  Speed: 29.24 samples/sec    Train-SmoothL1=78.605439
INFO:root:Epoch[0] Batch [540]  Speed: 28.46 samples/sec    Train-Acc=0.750000
INFO:root:Epoch[0] Batch [540]  Speed: 28.46 samples/sec    Train-ObjectAcc=0.000000
INFO:root:Epoch[0] Batch [540]  Speed: 28.46 samples/sec    Train-SmoothL1=78.759398
INFO:root:Epoch[0] Batch [560]  Speed: 29.21 samples/sec    Train-Acc=0.750000
INFO:root:Epoch[0] Batch [560]  Speed: 29.21 samples/sec    Train-ObjectAcc=0.000000
INFO:root:Epoch[0] Batch [560]  Speed: 29.21 samples/sec    Train-SmoothL1=80.474814
INFO:root:Epoch[0] Batch [580]  Speed: 29.19 samples/sec    Train-Acc=0.750000
INFO:root:Epoch[0] Batch [580]  Speed: 29.19 samples/sec    Train-ObjectAcc=0.000000
INFO:root:Epoch[0] Batch [580]  Speed: 29.19 samples/sec    Train-SmoothL1=81.232420
INFO:root:Epoch[0] Batch [600]  Speed: 29.20 samples/sec    Train-Acc=0.750000
INFO:root:Epoch[0] Batch [600]  Speed: 29.20 samples/sec    Train-ObjectAcc=0.000000
INFO:root:Epoch[0] Batch [600]  Speed: 29.20 samples/sec    Train-SmoothL1=80.704420
INFO:root:Epoch[0] Batch [620]  Speed: 29.23 samples/sec    Train-Acc=0.750000
INFO:root:Epoch[0] Batch [620]  Speed: 29.23 samples/sec    Train-ObjectAcc=0.000000
INFO:root:Epoch[0] Batch [620]  Speed: 29.23 samples/sec    Train-SmoothL1=75.355073

piiswrong on 7 Dec 2016

@piiswrong For this test I used K80s. On a single pascal based Titan X I get 38 samples/sec.

gerhardneuhold on 7 Dec 2016

@piiswrong That was because the ratio of negative sample vs. positive sample is manually set to 3:1. When training starts, negative samples are much easier to converge, so it will stuck at 75% for a while, then slowly climb up when network successfully learned about positive samples.

zhreshold on 7 Dec 2016

👍2

Hi @zhreshold ,I replaced vgg16_reduced with squeezenet and train this ssd on VOC2007, and it was stuck at 75 too. I waited for a long time, but after 80 epoches it is still stuck at 75... Does it mean there is something wrong about my network structure? Could you give me some suggestions? Thank you!

PhanTask on 11 Dec 2016

@PhanTask There are so many reasons that a network don't converge at all, especially for more complex network such as ssd.

I would suggest the following steps to check:

Does smooth-l1 look good? If so, check classification sub-network.
Make sure data and labels are correct, as well as the pretrained parameters.
Tune learning rates and batch size
Modify parameters, such as scaling and normalization, this is where you could focus more on, since you are using different base network (squeezenet) than vgg16. This is tricky, squeezenet's feature is a bit thinner than vgg16(192/256 filters vs 256/512), and has different hierarchy, and output scale might be different. So make sure you know what you are doing when designing the new structure. Try enabling normalization at all feature layers that have very large outputs.

zhreshold on 11 Dec 2016

👍2

@zhreshold Thank you very much for your suggestions!
BTW after the training is stuck, smooth-L1 looks pretty random (but still have value, always in 50 ~ 60, with learning rate=0.0001). When I set a bigger learning rate(e.g. 0.01) the training will be also stuck but smooth-L1 will be 'nan'.
I will follow your suggestions to try to find the reason why it doesn't works. If I get any good results I will share them in there.

PhanTask on 12 Dec 2016

@zhreshold Could you confirm that ssd converges on nnvm branch?
If you don't have free gpus we can open a P2 for you.

piiswrong on 12 Dec 2016

@piiswrong I could test the convergence very quickly. But a P2 would be very helpful cause I do need a environment to test multi-gpu performance.

zhreshold on 12 Dec 2016

@mli Could you open a P2 for him?

piiswrong on 12 Dec 2016

@zhreshold can you send your ssh key to me?

mli on 12 Dec 2016

@mli I've sent an email to your gmail, thanks!

zhreshold on 12 Dec 2016

I have two GPU cards. Single card can produce about 7.85samples/sec, and two cards only gives 6.94samples/sec. However I cannot come to the conclusion that I/O is bottleneck. rather it is more a single thread issue.

I use iostat (from sysstat pkg) to monitor disk utilization. it remains around around 0 at most time. I believe the data is already loaded into memory cache.

however one CPU is stuck in 100%. i bet this is the root cause. we only have single thread in python code to do data argumentation.

howard0su on 24 Dec 2016

@howard0su You are correct. Rewriting that part is on schedule, but maybe weeks later.

zhreshold on 24 Dec 2016

I can do that if you can brief your ideas to me.

Joshua Z. Zhang notifications@github.com于2016年12月24日周六上午10:31写道：

@howard0su https://github.com/howard0su You are correct. Rewriting that
part is on schedule, but maybe weeks later.

—
You are receiving this because you were mentioned.

Reply to this email directly, view it on GitHub
https://github.com/dmlc/mxnet/issues/4120#issuecomment-269064959, or mute
the thread
https://github.com/notifications/unsubscribe-auth/AAuzRKwgEQMSZxZ3ePSYty2nD9THrVyTks5rLIQKgaJpZM4LFlK6
.

>

-Howard

howard0su on 24 Dec 2016

@howard0su First step: rewrite datasets/iterator.py, by replacing cv2 and numpy operations with mx.image.functions, so we can bypass GIL lock.

zhreshold on 24 Dec 2016

Got u. Let me first covert to mx.image.

In the meantime, i am thinking about the new design of data augmenter for detection. please reference issue #4366

howard0su on 25 Dec 2016

@piiswrong i also have the same output like you when using multi GPUs when applying the change #4375. I am wondering if you already fixed it

INFO:root:Epoch[0] Batch [20] Speed: 22.77 samples/sec Train-Acc=0.471971
INFO:root:Epoch[0] Batch [20] Speed: 22.77 samples/sec Train-ObjectAcc=0.024562
INFO:root:Epoch[0] Batch [20] Speed: 22.77 samples/sec Train-SmoothL1=nan
INFO:root:Epoch[0] Batch [40] Speed: 21.83 samples/sec Train-Acc=0.750000
INFO:root:Epoch[0] Batch [40] Speed: 21.83 samples/sec Train-ObjectAcc=0.000000
INFO:root:Epoch[0] Batch [40] Speed: 21.83 samples/sec Train-SmoothL1=nan
INFO:root:Epoch[0] Batch [60] Speed: 23.59 samples/sec Train-Acc=0.750000
INFO:root:Epoch[0] Batch [60] Speed: 23.59 samples/sec Train-ObjectAcc=0.000000
INFO:root:Epoch[0] Batch [60] Speed: 23.59 samples/sec Train-SmoothL1=nan