I train resnet50 with 4 nodes by using example/image-classification/train_imagenet.py file. At beginning, it run normally. But at the last epoch, when one of workers finished its' total iterations named n. However, the others are still waiting for update in (n+1) iter.
The difference of the number of total iterations between workers result in this bug
the bash contenets as follows and was ran on gn2
python -u /THL5/home/daodao/cxt/mxnet0.11/tools/launch.py -n 4 --launcher ssh -H hosts-4 "python -u /THL5/home/daodao/cxt/mxnet0.11/example/image-classification/train_imagenet.py --gpus 0,1,2,3 --kv-store dist_sync "
hosts-4 contents
gn3
gn4
gn5
configuration for training
num_examples:120000
num_epochs:80
the deatils of configuration for training
INFO:root:start with arguments Namespace(batch_size=256, benchmark=0, data_nthreads=4, data_train='/THL5/home/daodao /imagenet/mxnet_cxt/train1200000.rec', data_val='/THL5/home/daodao/imagenet/mxnet_cxt/val40000.rec', disp_batches=1, dtype='float32', gpus='0,1,2,3', image_shape='3,224,224', kv_store='dist_sync', load_epoch=None, lr=0.4, lr_factor= 0.1, lr_step_epochs='30,60', max_random_aspect_ratio=0.25, max_random_h=36, max_random_l=50, max_random_rotate_angle =10, max_random_s=50, max_random_scale=1, max_random_shear_ratio=0.1, min_random_scale=1, model_prefix=None, mom=0.9 , monitor=0, network='resnet', num_classes=1000, num_epochs=80, num_examples=120000, num_layers=50, optimizer='sgd', pad_size=0, random_crop=1, random_mirror=1, rgb_mean='123.68,116.779,103.939', test_io=0, top_k=0, wd=0.0001)
the root of the bug
due to the numbers of data between workers are different which are partitioned using the API of mx.io.ImageRecordIter, this cause the number of iterations in each epoch between workers is different. And the total iterations during the whole trainning stage is still different.
When the least number of total iterations worker named worker1 finished the last iter n, the other workers still need to train on iter n+1,n+2... During the update of n+1 iter, kvstore must recieve the grads from worker1, And worker1 has finished the total iterations, won't push grads to kvstore, So kvstore still waiting for the push of worker1, This cause the program can't finished normally.
### evidences
the number of iterations in each epoch between workers
| epoch | worker0 | worker1 | worker2 | worker3 |
| ----- | ------- | ------- | ------- | ------- |
| 1 | 1172 | 1173 | 1172 | 1172 |
| 2 | 1172 | 1172 | 1172 | 1172 |
| 3 | 1171 | 1172 | 1172 | 1172 |
| 4 | 1172 | 1172 | 1172 | 1172 |
| 5 | 1171 | 1172 | 1172 | 1172 |
| 6 | 1172 | 1172 | 1172 | 1171 |
| 7 | 1172 | 1173 | 1172 | 1172 |
| 8 | 1171 | 1172 | 1172 | 1172 |
| 9 | 1172 | 1172 | 1172 | 1172 |
| 10 | 1171 | 1172 | 1172 | 1172 |
| 11 | 1172 | 1172 | 1172 | 1171 |
| 12 | 1171 | 1172 | 1172 | 1172 |
| ... | ... | ... | ... | ... |
log info
INFO:root:Epoch[0] Rank[3] Batch[1171] TotalIter[1171] Speed: 187.74 samples/sec
INFO:root:Epoch[0] Train-accuracy=nan
INFO:root:Epoch[0] Time cost=1585.830
INFO:root:Epoch[0] Rank[0] Batch[1171] TotalIter[1171] Speed: 187.42 samples/sec
INFO:root:Epoch[0] Train-accuracy=nan
INFO:root:Epoch[0] Time cost=1585.830
INFO:root:Epoch[0] Rank[2] Batch[1171] TotalIter[1171] Speed: 188.36 samples/sec INFO:root:Epoch[0] Train-accuracy=nan
INFO:root:Epoch[0] Time cost=1585.818
INFO:root:Epoch[0] Rank[1] Batch[1172] TotalIter[1172] Speed: 189.53 samples/sec INFO:root:Epoch[0] Train-accuracy=nan
INFO:root:Epoch[0] Time cost=1587.169
INFO:root:Epoch[1] Rank[3] Batch[1] TotalIter[1173] Speed: 189.76 samples/sec
INFO:root:Epoch[1] Rank[0] Batch[1] TotalIter[1173] Speed: 187.95 samples/sec
INFO:root:Epoch[1] Rank[2] Batch[1] TotalIter[1173] Speed: 189.15 samples/sec
INFO:root:Epoch[1] Rank[3] Batch[2] TotalIter[1174] Speed: 188.17 samples/sec
INFO:root:Epoch[1] Rank[1] Batch[1] TotalIter[1174] Speed: 189.03 samples/sec
INFO:root:Epoch[1] Rank[0] Batch[2] TotalIter[1174] Speed: 189.08 samples/sec
INFO:root:Epoch[1] Rank[2] Batch[2] TotalIter[1174] Speed: 188.24 samples/sec
......
INFO:root:Epoch[79] Rank[3] Batch[1151] TotalIter[93724] Speed: 189.38 samples/sec
INFO:root:Epoch[79] Rank[1] Batch[1123] TotalIter[93724] Speed: 189.85 samples/sec
INFO:root:Epoch[79] Rank[2] Batch[1139] TotalIter[93724] Speed: 189.56 samples/sec
INFO:root:Epoch[79] Rank[0] Batch[1169] TotalIter[93724] Speed: 189.05 samples/sec
INFO:root:Epoch[79] Rank[1] Batch[1124] TotalIter[93725] Speed: 189.96 samples/sec
INFO:root:Epoch[79] Rank[3] Batch[1152] TotalIter[93725] Speed: 188.91 samples/sec
INFO:root:Epoch[79] Rank[2] Batch[1140] TotalIter[93725] Speed: 190.17 samples/sec
INFO:root:Epoch[79] Rank[0] Batch[1170] TotalIter[93725] Speed: 190.06 samples/sec
INFO:root:Epoch[79] Train-accuracy=nan
INFO:root:Epoch[79] Time cost=1582.509
Killed by signal 15.^M
Killed by signal 15.^M
Killed by signal 15.^M
Killed by signal 15.^M
Killed by signal 15.^M
Killed by signal 15.^M
Killed by signal 15.^M
Killed by signal 15.^M
According to the log and table info.
ps info and gpu usage
before i kill the relative process on each node, the info of process as below:
gn2: the node of running the launch.py


gn3


gn4 and gn5 are the same as gn3
We can know those process still in waiting stage, because the usage of gpus is 0%.
The root of this bug is io.ImageRecordIter part which haven't splited the whole dataset evenly. And cause the different number of iterations in each epoch between workers. Maybe io.ImageRecordIter has more consideration of performance and efficiency.
The easiest way is:
Assumptions:
num_examples:the number of examples in the whole dataset.
num_workers: the number of workers
num_examples_each_worker: the number of examples splited to each worker.
Then
num_examples_each_worker = num_examples//num_workers + num_examples%num_workers
This is the easiest way to guarantee the same number of iterations in each epoch between workers.
when i set display_batchs=20 ,and haven't print Rank,TotalIterinfo, The output info is out of order, This will mistake people that is not strictly synchronized in dist_sync mode. as belows:
INFO:root:Epoch[55] Batch [80] Speed: 138.85 samples/sec accuracy=0.674023
INFO:root:Epoch[55] Batch [20] Speed: 191.34 samples/sec accuracy=0.670945
INFO:root:Epoch[55] Batch [100] Speed: 190.75 samples/sec accuracy=0.670312
INFO:root:Epoch[55] Batch [40] Speed: 190.62 samples/sec accuracy=0.664258
INFO:root:Epoch[55] Batch [40] Speed: 190.42 samples/sec accuracy=0.669531
INFO:root:Epoch[55] Batch [80] Speed: 190.35 samples/sec accuracy=0.655273
INFO:root:Epoch[55] Batch [60] Speed: 190.31 samples/sec accuracy=0.661133
INFO:root:Epoch[55] Batch [120] Speed: 190.26 samples/sec accuracy=0.668945
INFO:root:Epoch[55] Batch [100] Speed: 190.23 samples/sec accuracy=0.661523
INFO:root:Epoch[55] Batch [40] Speed: 190.10 samples/sec accuracy=0.671289
INFO:root:Epoch[55] Batch [120] Speed: 189.92 samples/sec accuracy=0.670117
INFO:root:Epoch[55] Batch [60] Speed: 189.77 samples/sec accuracy=0.668164
INFO:root:Epoch[55] Batch [60] Speed: 189.87 samples/sec accuracy=0.666406
INFO:root:Epoch[55] Batch [80] Speed: 189.79 samples/sec accuracy=0.651758
INFO:root:Epoch[55] Batch [100] Speed: 189.63 samples/sec accuracy=0.676758
INFO:root:Epoch[55] Batch [140] Speed: 189.74 samples/sec accuracy=0.677344
INFO:root:Epoch[55] Batch [120] Speed: 189.76 samples/sec accuracy=0.664258
INFO:root:Epoch[55] Batch [60] Speed: 189.84 samples/sec accuracy=0.678516
...............................
INFO:root:Epoch[119] Batch [220] Speed: 189.30 samples/sec accuracy=0.692578
INFO:root:Epoch[119] Batch [260] Speed: 189.36 samples/sec accuracy=0.697461
INFO:root:Epoch[119] Batch [140] Speed: 189.32 samples/sec accuracy=0.705664
INFO:root:Epoch[119] Batch [180] Speed: 189.32 samples/sec accuracy=0.696484
INFO:root:Epoch[119] Batch [160] Speed: 189.26 samples/sec accuracy=0.708984
INFO:root:Epoch[119] Batch [320] Speed: 189.30 samples/sec accuracy=0.704297
INFO:root:Epoch[119] Batch [300] Speed: 189.24 samples/sec accuracy=0.715625
INFO:root:Epoch[119] Batch [140] Speed: 189.33 samples/sec accuracy=0.692969
INFO:root:Epoch[119] Batch [240] Speed: 189.33 samples/sec accuracy=0.702148
INFO:root:Epoch[119] Batch [280] Speed: 189.26 samples/sec accuracy=0.688477
INFO:root:Epoch[119] Batch [160] Speed: 189.38 samples/sec accuracy=0.708789
INFO:root:Epoch[119] Batch [200] Speed: 189.36 samples/sec accuracy=0.706641
INFO:root:Epoch[119] Batch [180] Speed: 189.39 samples/sec accuracy=0.693750
INFO:root:Epoch[119] Batch [320] Speed: 189.39 samples/sec accuracy=0.695898
INFO:root:Epoch[119] Batch [340] Speed: 189.38 samples/sec accuracy=0.702930
INFO:root:Epoch[119] Batch [160] Speed: 189.41 samples/sec accuracy=0.694727
INFO:root:Epoch[119] Batch [260] Speed: 189.39 samples/sec accuracy=0.710156
INFO:root:Epoch[119] Batch [300] Speed: 189.40 samples/sec accuracy=0.709961
INFO:root:Epoch[119] Batch [180] Speed: 189.43 samples/sec accuracy=0.693555
For distributed training, it's better to set the total iteration of each node to the same number. The previous model provide epoch-size arg, which is useful
@sandeep-krishnamurthy Can you add labels
Can you tell me the details of setting epoch-size? @feiyuvl
this docs says it's better to set epoch_size explicitly in dist_sync. but haven't tell how to set.
@XiaotaoChen sorry, mxnet has removed the epoch-size arg, which only exists in old version
And then, how to solve the problem that the total iteration of each worker is different in the current version? @feiyuvl
@XiaotaoChen This doc http://newdocs.readthedocs.io is outdated, it was created by a contributor long time ago but not updated.
I'll ping the internal team to check how to resolve it
@mli has there been any updates? This issue is also causing annoyance for me.
Hello, request you all to look at the above PR as an example on how we can avoid this hang
Thanks @rahul003
Most helpful comment
Hello, request you all to look at the above PR as an example on how we can avoid this hang