Incubator-mxnet: Distributed training is slow

Created on 14 Aug 2017  路  18Comments  路  Source: apache/incubator-mxnet

Environment info

Operating System: Ubuntu 16.4

Compiler: gcc 5.4

Package used (Python/R/Scala/Julia): Python

MXNet version: Last code

Or if installed from source: installed from source

MXNet commit hash (git rev-parse HEAD): 1a3faa

If you are using python package, please provide

Python version and distribution: Python 2.7.13 :: Anaconda custom (64-bit)

I tried to train image classification model using two servers with infiniband cards. But the speed is a little slow, just like using one server. I used the code of example/image-classifaction.

when training on one server, the training command is

python train_imagenet.py --benchmark 1 --gpus 0,1,2,3,4,5,6,7 --kv-store device --network inception-v3 --batch-size 256   --image-shape 3,299,299

the speed is

INFO:root:start with arguments Namespace(batch_size=256, benchmark=1, data_nthreads=4, data_train=None, data_val=None, disp_batches=20, dtype='float32', gpus='0,1,2,3,4,5,6,7', image_shape='3,299,299', kv_store='device', load_epoch=None, lr=0.1, lr_factor=0.1, lr_step_epochs='30,60', max_random_aspect_ratio=0.25, max_random_h=36, max_random_l=50, max_random_rotate_angle=10, max_random_s=50, max_random_scale=1, max_random_shear_ratio=0.1, min_random_scale=1, model_prefix=None, mom=0.9, monitor=0, network='inception-v3', num_classes=1000, num_epochs=80, num_examples=1281167, num_layers=50, optimizer='sgd', pad_size=0, random_crop=1, random_mirror=1, rgb_mean='123.68,116.779,103.939', test_io=0, top_k=0, wd=0.0001)
[22:35:19] src/operator/././cudnn_algoreg-inl.h:112: Running performance tests to find the best convolution algorithm, this can take a while... (setting env variable MXNET_CUDNN_AUTOTUNE_DEFAULT to 0 to disable)
[22:35:40] src/kvstore/././comm.h:327: only 24 out of 56 GPU pairs are enabled direct access. It may affect the performance. You can set MXNET_ENABLE_GPU_P2P=0 to turn it off
[22:35:40] src/kvstore/././comm.h:336: .vvv....
[22:35:40] src/kvstore/././comm.h:336: v.vv....
[22:35:40] src/kvstore/././comm.h:336: vv.v....
[22:35:40] src/kvstore/././comm.h:336: vvv.....
[22:35:40] src/kvstore/././comm.h:336: .....vvv
[22:35:40] src/kvstore/././comm.h:336: ....v.vv
[22:35:40] src/kvstore/././comm.h:336: ....vv.v
[22:35:40] src/kvstore/././comm.h:336: ....vvv.
INFO:root:Epoch[0] Batch [20]   Speed: 1065.93 samples/sec      accuracy=0.165365
INFO:root:Epoch[0] Batch [40]   Speed: 1033.22 samples/sec      accuracy=0.989648
INFO:root:Epoch[0] Batch [60]   Speed: 1029.90 samples/sec      accuracy=1.000000
INFO:root:Epoch[0] Batch [80]   Speed: 1029.80 samples/sec      accuracy=1.000000
INFO:root:Epoch[0] Batch [100]  Speed: 1028.05 samples/sec      accuracy=1.000000
INFO:root:Epoch[0] Batch [120]  Speed: 1019.75 samples/sec      accuracy=1.000000
INFO:root:Epoch[0] Batch [140]  Speed: 1025.79 samples/sec      accuracy=1.000000
INFO:root:Epoch[0] Batch [160]  Speed: 1027.82 samples/sec      accuracy=1.000000
INFO:root:Epoch[0] Batch [180]  Speed: 1021.11 samples/sec      accuracy=1.000000
INFO:root:Epoch[0] Batch [200]  Speed: 1025.14 samples/sec      accuracy=1.000000
INFO:root:Epoch[0] Batch [220]  Speed: 1017.72 samples/sec      accuracy=1.000000
INFO:root:Epoch[0] Batch [240]  Speed: 1021.09 samples/sec      accuracy=1.000000
INFO:root:Epoch[0] Batch [260]  Speed: 1024.25 samples/sec      accuracy=1.000000

When training with 2 servers, the command is

 python ../../tools/launch.py -n 2 --launcher ssh -H hosts python train_imagenet.py --benchmark 1 --gpus 0,1,2,3,4,5,6,7 --kv-store dist_sync --network inception-v3 --num-layers 50 --batch-size 256 --sync-dst-dir /tmp/mxnet  --image-shape 3,299,299

And the speed is

INFO:root:Epoch[0] Batch [20]   Speed: 609.31 samples/sec       accuracy=0.056920
INFO:root:Epoch[0] Batch [20]   Speed: 610.12 samples/sec       accuracy=0.050967
INFO:root:Epoch[0] Batch [40]   Speed: 608.68 samples/sec       accuracy=0.854883
INFO:root:Epoch[0] Batch [40]   Speed: 608.19 samples/sec       accuracy=0.868164
INFO:root:Epoch[0] Batch [60]   Speed: 602.48 samples/sec       accuracy=1.000000
INFO:root:Epoch[0] Batch [60]   Speed: 603.86 samples/sec       accuracy=1.000000
INFO:root:Epoch[0] Batch [80]   Speed: 603.11 samples/sec       accuracy=1.000000
INFO:root:Epoch[0] Batch [80]   Speed: 603.87 samples/sec       accuracy=1.000000
INFO:root:Epoch[0] Batch [100]  Speed: 607.30 samples/sec       accuracy=1.000000
INFO:root:Epoch[0] Batch [100]  Speed: 606.54 samples/sec       accuracy=1.000000
INFO:root:Epoch[0] Batch [120]  Speed: 604.53 samples/sec       accuracy=1.000000
INFO:root:Epoch[0] Batch [120]  Speed: 602.63 samples/sec       accuracy=1.000000
INFO:root:Epoch[0] Batch [140]  Speed: 601.27 samples/sec       accuracy=1.000000
INFO:root:Epoch[0] Batch [140]  Speed: 603.67 samples/sec       accuracy=1.000000
INFO:root:Epoch[0] Batch [160]  Speed: 603.64 samples/sec       accuracy=1.000000
INFO:root:Epoch[0] Batch [160]  Speed: 602.81 samples/sec       accuracy=1.000000
INFO:root:Epoch[0] Batch [180]  Speed: 606.20 samples/sec       accuracy=1.000000
INFO:root:Epoch[0] Batch [180]  Speed: 606.28 samples/sec       accuracy=1.000000
INFO:root:Epoch[0] Batch [200]  Speed: 604.40 samples/sec       accuracy=1.000000
INFO:root:Epoch[0] Batch [200]  Speed: 604.28 samples/sec       accuracy=1.000000
INFO:root:Epoch[0] Batch [220]  Speed: 605.54 samples/sec       accuracy=1.000000
INFO:root:Epoch[0] Batch [220]  Speed: 605.21 samples/sec       accuracy=1.000000

It seems that using distributed training with 2 servers the speed is only a little better than standalone training(600x2 samples/sec VS 1000 samples/sec).

I tried to test the IB bandwidth using iperf, it can get 24.0 Gbits/sec using 1 thread, so I think the IB's bandwidth is not the bottleneck.

Does anyone can give any suggestion about distributed training using @@MxNet?

Most helpful comment

Delete "send_buf.WaitToRead();" in line 217 of the file 'src/kvstore/kvstore_dist.h' can solve the problem.
The compute can't cover the time of data communicate in backward.

All 18 comments

Your network seems too small. Is it MNIST? It doesn't make sense to train MNIST on multiple machines

@piiswrong , I am training inception-v3 using ImageNet dataset, so it's not a small network.

How does it look for dist_sync_device? Or even dist_async?

@szha , I have tried dist_sync_device, and I got almost the same result. For dist_async, it using the async SGD, i don't think it can be comparable.

actually, the speed is decided by the slowest pc when you use "sync" mode. you can use "async" alternatively. "async" mode is the approximation on "sgd", but still can give a available result.
And... 2 servers per pc is not a good choice. you can try set more servers on each pcs, such as 6. it will speed up obviously when the parameters are big.

Mind sharing a bit more on the machines (e.g. GPU types, homogenous or heterogeneous hardware, locality, nvlink, disk I/O speed, etc.)?

@szha , every server has 8 TitanXp GPUs and 2 Intel Xeon CPU E5-2650 v2@ 2.60GHz.
The two servers are connected with IB cards.
The test is using --benchmark = 1 configuration, so there is no disk I/O operation.

@starimpact , I have tried to use 4 servers per machine, I got almost the same result.

I am using mxnet0.8.0, HAHAHA...
I noticed that your "one server " is actually "local", because that your "kvstore=device". the kvstore will use gpu to update parameters.
And , your "two server" is really the distributed mode. in the "dist ..." mode, kvstore will use cpu to update the parameters.
So... your speed descending is normal.

Delete "send_buf.WaitToRead();" in line 217 of the file 'src/kvstore/kvstore_dist.h' can solve the problem.
The compute can't cover the time of data communicate in backward.

Follow @solin319 's suggestion, I change the code, now the speed seems normal. Thanks @solin319 .

@piiswrong @mli , it that a bug of mxnet?

@leoxiaobin the speed-up can be attributed to the deletion of the barrier before send in kvstore, so it's unlikely a symptom of the bug. If you have to stay in the synchronous land, further increase to the batch size could help. Since switching to dist_sync_device doesn't help the speed, I guess the locality assumption for GPUs doesn't apply and the bandwidth among the GPUs may even be less than from your CPU to other machines (which is through IB). In this case the bottleneck is the GPU communication and IB won't help much, and you need higher ratio of compute to communication.

@szha That is not entirely true - I agree with @solin319 that this WaitToRead should not be necessary (the actual communication is done in the lambda pushed to the engine that has send_buf as read dependency, so it will wait for it to be ready). What is more, this basically delays scheduling other copies from GPU to CPU for subsequent communications, thus limiting scaling.
The PR introducing that line mentions crashes when using kvstore in imperative mode. I'm not familiar really how much does imperative way differs from symbolic as far as engine is concerned, but I don't think it should be that different that the dependencies stop working. This is definitely a bug.

Thanks, @ptrendx. @madjam for more context.

For context, that barrier was added since an operation such as:

kv.init(2, mx.nd.zeros((50, 50)))

would access memory that is not fully initialized and therefore causes a segfault.

@madjam 's test case is that send_buf maybe not ready to get data()

agree with @ptrendx that we should remove this WaitToRead. One solution is moving https://github.com/madjam/mxnet/blob/0012f7722d97238a84c33f1bee8cd2926707a7e9/src/kvstore/kvstore_dist.h#L221 into the captured function.

Can someone help contribute a PR for it?

in mxnet0.8.0 there is no "send_buf.WaitToReadd()".
lucky for me.^_^
https://github.com/starimpact/mxnet_v0.8.0/blob/bProxy_Weight/src/kvstore/kvstore_dist.h#L412
my mxnet support partial parameters update. welcome to use it.
haha....

Was this page helpful?
0 / 5 - 0 ratings