Training a SSD model with multiple GPUs does not scale well if more than two GPUs are used.
Here are the estimated samples per second with increasing number of GPUs and batch size:
| No. of GPUs | Batch Size | Speed |
| :---: | :---: | :---: |
| 1 | 32 | 11.4 samples/sec |
| 2 | 64 | 18.6 samples/sec |
| 4 | 128 | 20.9 samples/sec |
| 8 | 256 | 22.3 samples/sec |
This issue can be reproduced by training a SSD model as described here [1].
@zhreshold - can you take a look here?
I guess CPU/disk IO may be the killing factor when data is loaded to the network.
I suggest you to use smaller batch-size to verify, e.g. batch 8 for 1 gpu, 16 for 2 gpu, etc...
BTW, using very large batch-size didn't beat smaller batch-size or maybe even worse based on my experience with ssd example.
@zhreshold Could you take a look at mx.image in the nnvm branch? Its a fast image preprocessing pipeline
Isn't mx.image a wrapper for cv2?
I guess cv2 and numpy operations are good in terms of speed for now, the priority is to pack single images into record file and feed the iterator with ImagerecordIter.
I hope someone could confirm this behavior before I could do something. I'm too busy these days.
With smaller batch sizes I get the following throughput:
| No. of GPUs | Batch Size | Speed |
| :---: | :---: | :---: |
| 1 | 8 | 9.9 samples/sec |
| 2 | 16 | 14.0 samples/sec |
| 4 | 32 | 21.0 samples/sec |
| 8 | 64 | 21.5 samples/sec |
With deactivated file I/O and data augmentation I get the following throughput:
| No. of GPUs | Batch Size | Speed |
| :---: | :---: | :---: |
| 1 | 8 | 9.8 samples/sec |
| 1 | 32 | 11.4 samples/sec |
| 1 | 64 | 11.5 samples/sec |
| 2 | 16 | 14.5 samples/sec |
| 2 | 64 | 19.4 samples/sec |
| 2 | 128 | 20.7 samples/sec |
| 4 | 32 | 25.3 samples/sec |
| 4 | 128 | 38.5 samples/sec |
| 4 | 256 | 40.9 samples/sec |
| 8 | 64 | 38.9 samples/sec |
| 8 | 256 | 68.3 samples/sec |
| 8 | 512 | 74.5 samples/sec |
As you said @zhreshold, the throughput is much better without file I/O and data augmentation. Would you optimize something else too to further speedup the multi-gpu training?
@gerhardneuhold Could you disable data augmentation only in config/config.py:
cfg.TRAIN.RAND_SAMPLERS = []
cfg.TRAIN.RAND_MIRROR = False
cfg.TRAIN.INIT_SHUFFLE = False
cfg.TRAIN.EPOCH_SHUFFLE = False # shuffle training list after each epoch
Then I could have a sense of how I/O affecting the throughput, appreciated very much.
BTW, it is train on HDD instance right?
Sure @zhreshold, here are some throughput rates with disabled data augmentation only:
| No. of GPUs | Batch Size | Speed |
| :---: | :---: | :---: |
| 1 | 8 | 9.9 samples/sec |
| 1 | 32 | 11.4 samples/sec |
| 2 | 16 | 14.2 samples/sec |
| 2 | 64 | 18.2 samples/sec |
| 4 | 32 | 25.5 samples/sec |
| 4 | 128 | 27.7 samples/sec |
| 8 | 64 | 28.6 samples/sec |
| 8 | 256 | 29.5 samples/sec |
Data is read from SSD (EBS io1).
@zhreshold Its a wrapper. But it bypasses GIL by pushing image decoding to engine. It also uses mx.nd instead of numpy for computation. So its much faster when you increase NUM_CPU_WORKER.
ImagerecordIter is fast not because it read from single file (it helps a little, but not that much). Its because it uses 4 thread for decoding by default.
I see. Here's my thought
[ ] High priority: pack separate images to record file, modify data_iter to take images from internal ImageRecordIter. HDD users could benefit a lot from this.
[ ] Replace with mx.image functions for data augmentation
@zhreshold Looks good. About 1. you don't need ImageRecordIter to read from .rec. mx.recordio provide parsers for rec files
@gerhardneuhold what gpu are you using? I'm getting 29/sec on single 1080
@zhreshold BTW my training log looks like this. Do you know why acc is stuck at 0.75000?
INFO:root:Epoch[0] Batch [380] Speed: 29.19 samples/sec Train-Acc=0.749015
INFO:root:Epoch[0] Batch [380] Speed: 29.19 samples/sec Train-ObjectAcc=0.002752
INFO:root:Epoch[0] Batch [380] Speed: 29.19 samples/sec Train-SmoothL1=76.203662
INFO:root:Epoch[0] Batch [400] Speed: 28.91 samples/sec Train-Acc=0.749785
INFO:root:Epoch[0] Batch [400] Speed: 28.91 samples/sec Train-ObjectAcc=0.000429
INFO:root:Epoch[0] Batch [400] Speed: 28.91 samples/sec Train-SmoothL1=87.126371
INFO:root:Epoch[0] Batch [420] Speed: 29.19 samples/sec Train-Acc=0.750000
INFO:root:Epoch[0] Batch [420] Speed: 29.19 samples/sec Train-ObjectAcc=0.000000
INFO:root:Epoch[0] Batch [420] Speed: 29.19 samples/sec Train-SmoothL1=76.761614
INFO:root:Epoch[0] Batch [440] Speed: 29.20 samples/sec Train-Acc=0.750000
INFO:root:Epoch[0] Batch [440] Speed: 29.20 samples/sec Train-ObjectAcc=0.000000
INFO:root:Epoch[0] Batch [440] Speed: 29.20 samples/sec Train-SmoothL1=82.783834
INFO:root:Epoch[0] Batch [460] Speed: 29.22 samples/sec Train-Acc=0.750000
INFO:root:Epoch[0] Batch [460] Speed: 29.22 samples/sec Train-ObjectAcc=0.000000
INFO:root:Epoch[0] Batch [460] Speed: 29.22 samples/sec Train-SmoothL1=79.285356
INFO:root:Epoch[0] Batch [480] Speed: 29.17 samples/sec Train-Acc=0.750000
INFO:root:Epoch[0] Batch [480] Speed: 29.17 samples/sec Train-ObjectAcc=0.000000
INFO:root:Epoch[0] Batch [480] Speed: 29.17 samples/sec Train-SmoothL1=77.857881
INFO:root:Epoch[0] Batch [500] Speed: 28.83 samples/sec Train-Acc=0.750000
INFO:root:Epoch[0] Batch [500] Speed: 28.83 samples/sec Train-ObjectAcc=0.000000
INFO:root:Epoch[0] Batch [500] Speed: 28.83 samples/sec Train-SmoothL1=77.706379
INFO:root:Epoch[0] Batch [520] Speed: 29.24 samples/sec Train-Acc=0.750000
INFO:root:Epoch[0] Batch [520] Speed: 29.24 samples/sec Train-ObjectAcc=0.000000
INFO:root:Epoch[0] Batch [520] Speed: 29.24 samples/sec Train-SmoothL1=78.605439
INFO:root:Epoch[0] Batch [540] Speed: 28.46 samples/sec Train-Acc=0.750000
INFO:root:Epoch[0] Batch [540] Speed: 28.46 samples/sec Train-ObjectAcc=0.000000
INFO:root:Epoch[0] Batch [540] Speed: 28.46 samples/sec Train-SmoothL1=78.759398
INFO:root:Epoch[0] Batch [560] Speed: 29.21 samples/sec Train-Acc=0.750000
INFO:root:Epoch[0] Batch [560] Speed: 29.21 samples/sec Train-ObjectAcc=0.000000
INFO:root:Epoch[0] Batch [560] Speed: 29.21 samples/sec Train-SmoothL1=80.474814
INFO:root:Epoch[0] Batch [580] Speed: 29.19 samples/sec Train-Acc=0.750000
INFO:root:Epoch[0] Batch [580] Speed: 29.19 samples/sec Train-ObjectAcc=0.000000
INFO:root:Epoch[0] Batch [580] Speed: 29.19 samples/sec Train-SmoothL1=81.232420
INFO:root:Epoch[0] Batch [600] Speed: 29.20 samples/sec Train-Acc=0.750000
INFO:root:Epoch[0] Batch [600] Speed: 29.20 samples/sec Train-ObjectAcc=0.000000
INFO:root:Epoch[0] Batch [600] Speed: 29.20 samples/sec Train-SmoothL1=80.704420
INFO:root:Epoch[0] Batch [620] Speed: 29.23 samples/sec Train-Acc=0.750000
INFO:root:Epoch[0] Batch [620] Speed: 29.23 samples/sec Train-ObjectAcc=0.000000
INFO:root:Epoch[0] Batch [620] Speed: 29.23 samples/sec Train-SmoothL1=75.355073
@piiswrong For this test I used K80s. On a single pascal based Titan X I get 38 samples/sec.
@piiswrong That was because the ratio of negative sample vs. positive sample is manually set to 3:1. When training starts, negative samples are much easier to converge, so it will stuck at 75% for a while, then slowly climb up when network successfully learned about positive samples.
Hi @zhreshold ,I replaced vgg16_reduced with squeezenet and train this ssd on VOC2007, and it was stuck at 75 too. I waited for a long time, but after 80 epoches it is still stuck at 75... Does it mean there is something wrong about my network structure? Could you give me some suggestions? Thank you!
@PhanTask There are so many reasons that a network don't converge at all, especially for more complex network such as ssd.
I would suggest the following steps to check:
Enable monitor and observe the outputs/weights/gradients could give you more insights of the network and training status. Good luck.
Please report if you have good results using squeezenet, thanks very much.
@zhreshold Thank you very much for your suggestions!
BTW after the training is stuck, smooth-L1 looks pretty random (but still have value, always in 50 ~ 60, with learning rate=0.0001). When I set a bigger learning rate(e.g. 0.01) the training will be also stuck but smooth-L1 will be 'nan'.
I will follow your suggestions to try to find the reason why it doesn't works. If I get any good results I will share them in there.
@zhreshold Could you confirm that ssd converges on nnvm branch?
If you don't have free gpus we can open a P2 for you.
@piiswrong I could test the convergence very quickly. But a P2 would be very helpful cause I do need a environment to test multi-gpu performance.
@mli Could you open a P2 for him?
@zhreshold can you send your ssh key to me?
@mli I've sent an email to your gmail, thanks!
I have two GPU cards. Single card can produce about 7.85samples/sec, and two cards only gives 6.94samples/sec. However I cannot come to the conclusion that I/O is bottleneck. rather it is more a single thread issue.
I use iostat (from sysstat pkg) to monitor disk utilization. it remains around around 0 at most time. I believe the data is already loaded into memory cache.
however one CPU is stuck in 100%. i bet this is the root cause. we only have single thread in python code to do data argumentation.
@howard0su You are correct. Rewriting that part is on schedule, but maybe weeks later.
I can do that if you can brief your ideas to me.
Joshua Z. Zhang notifications@github.com于2016年12月24日 周六上午10:31写道:
@howard0su https://github.com/howard0su You are correct. Rewriting that
part is on schedule, but maybe weeks later.—
You are receiving this because you were mentioned.Reply to this email directly, view it on GitHub
https://github.com/dmlc/mxnet/issues/4120#issuecomment-269064959, or mute
the thread
https://github.com/notifications/unsubscribe-auth/AAuzRKwgEQMSZxZ3ePSYty2nD9THrVyTks5rLIQKgaJpZM4LFlK6
.>
-Howard
@howard0su First step: rewrite datasets/iterator.py, by replacing cv2 and numpy operations with mx.image.functions, so we can bypass GIL lock.
Got u. Let me first covert to mx.image.
In the meantime, i am thinking about the new design of data augmenter for detection. please reference issue #4366
@piiswrong i also have the same output like you when using multi GPUs when applying the change #4375. I am wondering if you already fixed it
INFO:root:Epoch[0] Batch [20] Speed: 22.77 samples/sec Train-Acc=0.471971
INFO:root:Epoch[0] Batch [20] Speed: 22.77 samples/sec Train-ObjectAcc=0.024562
INFO:root:Epoch[0] Batch [20] Speed: 22.77 samples/sec Train-SmoothL1=nan
INFO:root:Epoch[0] Batch [40] Speed: 21.83 samples/sec Train-Acc=0.750000
INFO:root:Epoch[0] Batch [40] Speed: 21.83 samples/sec Train-ObjectAcc=0.000000
INFO:root:Epoch[0] Batch [40] Speed: 21.83 samples/sec Train-SmoothL1=nan
INFO:root:Epoch[0] Batch [60] Speed: 23.59 samples/sec Train-Acc=0.750000
INFO:root:Epoch[0] Batch [60] Speed: 23.59 samples/sec Train-ObjectAcc=0.000000
INFO:root:Epoch[0] Batch [60] Speed: 23.59 samples/sec Train-SmoothL1=nan
Hi, I also encounter the "SmoothL1=nan" problem.
I've tried running:
python train.py
and:
python train.py --gpus 0,1
The first one works fine but the second one gets "SmoothL1=nan".
Shouldn't the two command return the same results assuming batch size, learning rate and all other parameters are the same?
I tried squeezenet + ssd and mAP is very low...I will continue trying...
I too have the SmoothL1=nan issue that @assafmus described. I would really like to use multiple GPUs for training here.
This issue is closed due to lack of activity in the last 90 days. Feel free to ping me to reopen if this is still an active issue. Thanks!
Most helpful comment
@PhanTask There are so many reasons that a network don't converge at all, especially for more complex network such as ssd.
I would suggest the following steps to check:
Enable monitor and observe the outputs/weights/gradients could give you more insights of the network and training status. Good luck.
Please report if you have good results using squeezenet, thanks very much.