Incubator-mxnet: program can't finished normally in dist_sync mode

Created on 29 Jan 2018 · 9Comments · Source: apache/incubator-mxnet

Description

I train resnet50 with 4 nodes by using example/image-classification/train_imagenet.py file. At beginning, it run normally. But at the last epoch, when one of workers finished its' total iterations named n. However, the others are still waiting for update in (n+1) iter.

The difference of the number of total iterations between workers result in this bug

Configuration

the bash contenets as follows and was ran on gn2

python -u /THL5/home/daodao/cxt/mxnet0.11/tools/launch.py -n 4 --launcher ssh -H hosts-4 "python -u /THL5/home/daodao/cxt/mxnet0.11/example/image-classification/train_imagenet.py --gpus 0,1,2,3 --kv-store dist_sync "

hosts-4 contents

gn3
gn4
gn5

configuration for training

num_examples:120000
num_epochs:80

the deatils of configuration for training

 INFO:root:start with arguments Namespace(batch_size=256, benchmark=0, data_nthreads=4, data_train='/THL5/home/daodao       /imagenet/mxnet_cxt/train1200000.rec', data_val='/THL5/home/daodao/imagenet/mxnet_cxt/val40000.rec', disp_batches=1,        dtype='float32', gpus='0,1,2,3', image_shape='3,224,224', kv_store='dist_sync', load_epoch=None, lr=0.4, lr_factor=       0.1, lr_step_epochs='30,60', max_random_aspect_ratio=0.25, max_random_h=36, max_random_l=50, max_random_rotate_angle       =10, max_random_s=50, max_random_scale=1, max_random_shear_ratio=0.1, min_random_scale=1, model_prefix=None, mom=0.9       , monitor=0, network='resnet', num_classes=1000, num_epochs=80, num_examples=120000, num_layers=50, optimizer='sgd',        pad_size=0, random_crop=1, random_mirror=1, rgb_mean='123.68,116.779,103.939', test_io=0, top_k=0, wd=0.0001)

Analysis

the root of the bug

due to the numbers of data between workers are different which are partitioned using the API of mx.io.ImageRecordIter, this cause the number of iterations in each epoch between workers is different. And the total iterations during the whole trainning stage is still different.

When the least number of total iterations worker named worker1 finished the last iter n, the other workers still need to train on iter n+1,n+2... During the update of n+1 iter, kvstore must recieve the grads from worker1, And worker1 has finished the total iterations, won't push grads to kvstore, So kvstore still waiting for the push of worker1, This cause the program can't finished normally.

### evidences

the number of iterations in each epoch between workers

| epoch | worker0 | worker1 | worker2 | worker3 |
| ----- | ------- | ------- | ------- | ------- |
| 1 | 1172 | 1173 | 1172 | 1172 |
| 2 | 1172 | 1172 | 1172 | 1172 |
| 3 | 1171 | 1172 | 1172 | 1172 |
| 4 | 1172 | 1172 | 1172 | 1172 |
| 5 | 1171 | 1172 | 1172 | 1172 |
| 6 | 1172 | 1172 | 1172 | 1171 |
| 7 | 1172 | 1173 | 1172 | 1172 |
| 8 | 1171 | 1172 | 1172 | 1172 |
| 9 | 1172 | 1172 | 1172 | 1172 |
| 10 | 1171 | 1172 | 1172 | 1172 |
| 11 | 1172 | 1172 | 1172 | 1171 |
| 12 | 1171 | 1172 | 1172 | 1172 |
| ... | ... | ... | ... | ... |

log info

INFO:root:Epoch[0] Rank[3] Batch[1171] TotalIter[1171] Speed: 187.74 samples/sec   
INFO:root:Epoch[0] Train-accuracy=nan
INFO:root:Epoch[0] Time cost=1585.830
INFO:root:Epoch[0] Rank[0] Batch[1171] TotalIter[1171] Speed: 187.42 samples/sec
INFO:root:Epoch[0] Train-accuracy=nan
INFO:root:Epoch[0] Time cost=1585.830
INFO:root:Epoch[0] Rank[2] Batch[1171] TotalIter[1171] Speed: 188.36 samples/sec   INFO:root:Epoch[0] Train-accuracy=nan
INFO:root:Epoch[0] Time cost=1585.818
INFO:root:Epoch[0] Rank[1] Batch[1172] TotalIter[1172] Speed: 189.53 samples/sec   INFO:root:Epoch[0] Train-accuracy=nan
INFO:root:Epoch[0] Time cost=1587.169
INFO:root:Epoch[1] Rank[3] Batch[1] TotalIter[1173] Speed: 189.76 samples/sec   
INFO:root:Epoch[1] Rank[0] Batch[1] TotalIter[1173] Speed: 187.95 samples/sec
INFO:root:Epoch[1] Rank[2] Batch[1] TotalIter[1173] Speed: 189.15 samples/sec   
INFO:root:Epoch[1] Rank[3] Batch[2] TotalIter[1174] Speed: 188.17 samples/sec   
INFO:root:Epoch[1] Rank[1] Batch[1] TotalIter[1174] Speed: 189.03 samples/sec   
INFO:root:Epoch[1] Rank[0] Batch[2] TotalIter[1174] Speed: 189.08 samples/sec   
INFO:root:Epoch[1] Rank[2] Batch[2] TotalIter[1174] Speed: 188.24 samples/sec   
......
INFO:root:Epoch[79] Rank[3] Batch[1151] TotalIter[93724] Speed: 189.38 samples/sec  
INFO:root:Epoch[79] Rank[1] Batch[1123] TotalIter[93724] Speed: 189.85 samples/sec
INFO:root:Epoch[79] Rank[2] Batch[1139] TotalIter[93724] Speed: 189.56 samples/sec  
INFO:root:Epoch[79] Rank[0] Batch[1169] TotalIter[93724] Speed: 189.05 samples/sec   
INFO:root:Epoch[79] Rank[1] Batch[1124] TotalIter[93725] Speed: 189.96 samples/sec  
INFO:root:Epoch[79] Rank[3] Batch[1152] TotalIter[93725] Speed: 188.91 samples/sec 
INFO:root:Epoch[79] Rank[2] Batch[1140] TotalIter[93725] Speed: 190.17 samples/sec   
INFO:root:Epoch[79] Rank[0] Batch[1170] TotalIter[93725] Speed: 190.06 samples/sec  
INFO:root:Epoch[79] Train-accuracy=nan
INFO:root:Epoch[79] Time cost=1582.509
Killed by signal 15.^M
Killed by signal 15.^M
Killed by signal 15.^M
Killed by signal 15.^M
Killed by signal 15.^M
Killed by signal 15.^M
Killed by signal 15.^M
Killed by signal 15.^M

According to the log and table info.

We can know that in the end of epoch 79, worker0 finished the 1171-th update, then print the summary of the epoch.
And with the table info, it's clearly the others haven't finish this epoch, still waiting update in the 1172-th iteration.
the TotalIter shows each worker had execute 93725 iterations during the whole training stage, And each iteration is strictly synchronous.

ps info and gpu usage

before i kill the relative process on each node, the info of process as below:

gn2: the node of running the launch.py
gn2-ps

gn2-smi

gn3

gn3-ps

gn3-smi

gn4 and gn5 are the same as gn3

We can know those process still in waiting stage, because the usage of gpus is 0%.

Suggestion

The root of this bug is io.ImageRecordIter part which haven't splited the whole dataset evenly. And cause the different number of iterations in each epoch between workers. Maybe io.ImageRecordIter has more consideration of performance and efficiency.

The easiest way is:

Assumptions:

num_examples:the number of examples in the whole dataset.
num_workers: the number of workers
num_examples_each_worker: the number of examples splited to each worker.

Then

num_examples_each_worker = num_examples//num_workers + num_examples%num_workers

This is the easiest way to guarantee the same number of iterations in each epoch between workers.

Another drawback

when i set display_batchs=20 ,and haven't print Rank,TotalIterinfo, The output info is out of order, This will mistake people that is not strictly synchronized in dist_sync mode. as belows:

INFO:root:Epoch[55] Batch [80]  Speed: 138.85 samples/sec   accuracy=0.674023
INFO:root:Epoch[55] Batch [20]  Speed: 191.34 samples/sec   accuracy=0.670945
INFO:root:Epoch[55] Batch [100] Speed: 190.75 samples/sec   accuracy=0.670312
INFO:root:Epoch[55] Batch [40]  Speed: 190.62 samples/sec   accuracy=0.664258
INFO:root:Epoch[55] Batch [40]  Speed: 190.42 samples/sec   accuracy=0.669531
INFO:root:Epoch[55] Batch [80]  Speed: 190.35 samples/sec   accuracy=0.655273
INFO:root:Epoch[55] Batch [60]  Speed: 190.31 samples/sec   accuracy=0.661133
INFO:root:Epoch[55] Batch [120] Speed: 190.26 samples/sec   accuracy=0.668945
INFO:root:Epoch[55] Batch [100] Speed: 190.23 samples/sec   accuracy=0.661523
INFO:root:Epoch[55] Batch [40]  Speed: 190.10 samples/sec   accuracy=0.671289
INFO:root:Epoch[55] Batch [120] Speed: 189.92 samples/sec   accuracy=0.670117
INFO:root:Epoch[55] Batch [60]  Speed: 189.77 samples/sec   accuracy=0.668164
INFO:root:Epoch[55] Batch [60]  Speed: 189.87 samples/sec   accuracy=0.666406
INFO:root:Epoch[55] Batch [80]  Speed: 189.79 samples/sec   accuracy=0.651758
INFO:root:Epoch[55] Batch [100] Speed: 189.63 samples/sec   accuracy=0.676758
INFO:root:Epoch[55] Batch [140] Speed: 189.74 samples/sec   accuracy=0.677344
INFO:root:Epoch[55] Batch [120] Speed: 189.76 samples/sec   accuracy=0.664258
INFO:root:Epoch[55] Batch [60]  Speed: 189.84 samples/sec   accuracy=0.678516
...............................
INFO:root:Epoch[119] Batch [220]    Speed: 189.30 samples/sec   accuracy=0.692578
INFO:root:Epoch[119] Batch [260]    Speed: 189.36 samples/sec   accuracy=0.697461
INFO:root:Epoch[119] Batch [140]    Speed: 189.32 samples/sec   accuracy=0.705664
INFO:root:Epoch[119] Batch [180]    Speed: 189.32 samples/sec   accuracy=0.696484
INFO:root:Epoch[119] Batch [160]    Speed: 189.26 samples/sec   accuracy=0.708984
INFO:root:Epoch[119] Batch [320]    Speed: 189.30 samples/sec   accuracy=0.704297
INFO:root:Epoch[119] Batch [300]    Speed: 189.24 samples/sec   accuracy=0.715625
INFO:root:Epoch[119] Batch [140]    Speed: 189.33 samples/sec   accuracy=0.692969
INFO:root:Epoch[119] Batch [240]    Speed: 189.33 samples/sec   accuracy=0.702148
INFO:root:Epoch[119] Batch [280]    Speed: 189.26 samples/sec   accuracy=0.688477
INFO:root:Epoch[119] Batch [160]    Speed: 189.38 samples/sec   accuracy=0.708789
INFO:root:Epoch[119] Batch [200]    Speed: 189.36 samples/sec   accuracy=0.706641
INFO:root:Epoch[119] Batch [180]    Speed: 189.39 samples/sec   accuracy=0.693750
INFO:root:Epoch[119] Batch [320]    Speed: 189.39 samples/sec   accuracy=0.695898
INFO:root:Epoch[119] Batch [340]    Speed: 189.38 samples/sec   accuracy=0.702930
INFO:root:Epoch[119] Batch [160]    Speed: 189.41 samples/sec   accuracy=0.694727
INFO:root:Epoch[119] Batch [260]    Speed: 189.39 samples/sec   accuracy=0.710156
INFO:root:Epoch[119] Batch [300]    Speed: 189.40 samples/sec   accuracy=0.709961
INFO:root:Epoch[119] Batch [180]    Speed: 189.43 samples/sec   accuracy=0.693555

Distributed Example Python

Source

XiaotaoChen

Most helpful comment

Hello, request you all to look at the above PR as an example on how we can avoid this hang

rahul003 on 11 Apr 2018

👍2

All 9 comments

For distributed training, it's better to set the total iteration of each node to the same number. The previous model provide epoch-size arg, which is useful