Apex: Training gets stuck when using SyncBN

Created on 19 Dec 2018 · 18Comments · Source: NVIDIA/apex

DistributedDataParallel works great for me. But when I use it together with the synchronized batch normalization, either the Python version or the optimized version, the training will get stuck after a few iterations and the code gives the following warning:

/home/heilaw/.conda/envs/CornerNet/lib/python3.6/multiprocessing/semaphore_tracker.py:143: UserWarning: semaphore_tracker: There appear to be 1 leaked semaphores to clean up at shutdown
len(cache))

Any idea how I should debug it?

Source

heilaw

Most helpful comment

I think this issue is related to process_group = group_creator() in optimized_sync_batchnorm_kernel.py. In parallel/__init__.py, you set group_creator to new_group if get_default_group is not available. However, I don't think that's a good idea. get_default_group is not available in PyTorch 1.0, so that creates a new group every time we call the sync BN forward function! It looks like we are using the default group anyway. We may not need that line.

After I removed that line and process_group in both torch.distributed.all_reduce and torch.distributed.all_gather, the training now works, even with tqdm.

heilaw on 20 Dec 2018

👍3

All 18 comments

https://github.com/pytorch/pytorch/issues/11727 matches your warnings, but shouldn't be related to sync batchnorm. Does the problem occur when you move from non-sync batchnorm to sync batchnorm, with no other changes?

mcarilli on 20 Dec 2018

Yes, this occurs when I change from non-sync BN to sync BN with no other changes.

heilaw on 20 Dec 2018

@jjsjann123 have you ever observed this behavior?

mcarilli on 20 Dec 2018

No I haven't seen this before. @heilaw any chance that you can share a repro script?

jjsjann123 on 20 Dec 2018

I will see if I can share a script that reproduces this issue.

I just found that if I remove tqdm from my code, the code won't give the warning but the training still gets stuck after a few iterations.

heilaw on 20 Dec 2018

After I removed that line and process_group in both torch.distributed.all_reduce and torch.distributed.all_gather, the training now works, even with tqdm.

heilaw on 20 Dec 2018

👍3

Good catch!
I remember the discussion we had between Teng & Carilli on torch.distributed APIs last week when they were updating the API. I haven't got back and update apex SyncBatchNorm yet.

Thanks for sharing your fix. I'll push a fix to apex soon.

jjsjann123 on 20 Dec 2018

I think this issue is related to process_group = group_creator() in optimized_sync_batchnorm_kernel.py. In parallel/__init__.py, you set group_creator to new_group if get_default_group is not available. However, I don't think that's a good idea. get_default_group is not available in PyTorch 1.0, so that creates a new group every time we call the sync BN forward function! It looks like we are using the default group anyway. We may not need that line.

After I removed that line and process_group in both torch.distributed.all_reduce and torch.distributed.all_gather, the training now works, even with tqdm.

I still have the same problem with the latest version. Then I used this method and solved it.
Very good

wangxiaodong1021 on 31 Dec 2018

Met the same problem with the latest version of apex! How can I fix torch.distributed.all_reduce and torch.distributed.all_gather? @heilaw @jjsjann123 @mcarilli Thanks for your repley!!!

donnyyou on 6 Sep 2019

@donnyyou I also got the same problem. Did you find a way to fix it?

kkjh0723 on 23 Sep 2019

hmmm, the problem process groups should have been fixed a while back. What issues are we having here? Could we elaborate more?

jjsjann123 on 23 Sep 2019

@donnyyou I also got the same problem. Did you find a way to fix it?

https://github.com/donnyyou/torchcv/blob/acbb8f68e5c6f63a0f30e41267481f523ea3a234/scripts/seg/ade20k/run_fs_res101_annn_ade20k_seg.sh#L30
Please refer the scripts! @kkjh0723

donnyyou on 24 Sep 2019

@donnyyou thank you for your help! I will try your way and check if it works for me.
@jjsjann123 it seems the problem is same as @heilaw but without warning since I don't use tqdm. The training gets stuck at exactly same iteration if the network model is same. If I change the model, it is stuck at a different iteration. It seems the number of BN, which the model has, is more, the training is stuck earlier.
Actually, the issue occurred after I started to use DDP, amp and convert_syncbn_model at the same time. I'm testing without syncbn now to make sure that the problem caused by syncbn. it will take a few days...

kkjh0723 on 24 Sep 2019

@jjsjann123 I found that training does not stop when I remove convert_syncbn_model function.

kkjh0723 on 27 Sep 2019

@jjsjann123 I found that training does not stop when I remove convert_syncbn_model function.

Same as you. You could refer to my solution!

donnyyou on 27 Sep 2019

@kkjh0723 @donnyyou maybe dead lock with nccl? I don't know how that could be deterministic. It's mysterious how disabling low latency NCCL algo solves that. I'll ask NCCL guys about it.

We can also try setting delay_allreduce to True in apex::DDP (https://nvidia.github.io/apex/parallel.html), this delays the all reduce of gradients towards the end of BW path, hence rules out the possibility of SyncBN reduce call dead locking nccl calls.

jjsjann123 on 28 Sep 2019

has this been fixed? i have the same issue