Apex: Training gets stuck when using SyncBN

Created on 19 Dec 2018  路  18Comments  路  Source: NVIDIA/apex

DistributedDataParallel works great for me. But when I use it together with the synchronized batch normalization, either the Python version or the optimized version, the training will get stuck after a few iterations and the code gives the following warning:

/home/heilaw/.conda/envs/CornerNet/lib/python3.6/multiprocessing/semaphore_tracker.py:143: UserWarning: semaphore_tracker: There appear to be 1 leaked semaphores to clean up at shutdown
len(cache))

Any idea how I should debug it?

Most helpful comment

I think this issue is related to process_group = group_creator() in optimized_sync_batchnorm_kernel.py. In parallel/__init__.py, you set group_creator to new_group if get_default_group is not available. However, I don't think that's a good idea. get_default_group is not available in PyTorch 1.0, so that creates a new group every time we call the sync BN forward function! It looks like we are using the default group anyway. We may not need that line.

After I removed that line and process_group in both torch.distributed.all_reduce and torch.distributed.all_gather, the training now works, even with tqdm.

All 18 comments

https://github.com/pytorch/pytorch/issues/11727 matches your warnings, but shouldn't be related to sync batchnorm. Does the problem occur when you move from non-sync batchnorm to sync batchnorm, with no other changes?

Yes, this occurs when I change from non-sync BN to sync BN with no other changes.

@jjsjann123 have you ever observed this behavior?

No I haven't seen this before. @heilaw any chance that you can share a repro script?

I will see if I can share a script that reproduces this issue.

I just found that if I remove tqdm from my code, the code won't give the warning but the training still gets stuck after a few iterations.

I think this issue is related to process_group = group_creator() in optimized_sync_batchnorm_kernel.py. In parallel/__init__.py, you set group_creator to new_group if get_default_group is not available. However, I don't think that's a good idea. get_default_group is not available in PyTorch 1.0, so that creates a new group every time we call the sync BN forward function! It looks like we are using the default group anyway. We may not need that line.

After I removed that line and process_group in both torch.distributed.all_reduce and torch.distributed.all_gather, the training now works, even with tqdm.

Good catch!
I remember the discussion we had between Teng & Carilli on torch.distributed APIs last week when they were updating the API. I haven't got back and update apex SyncBatchNorm yet.

Thanks for sharing your fix. I'll push a fix to apex soon.

I think this issue is related to process_group = group_creator() in optimized_sync_batchnorm_kernel.py. In parallel/__init__.py, you set group_creator to new_group if get_default_group is not available. However, I don't think that's a good idea. get_default_group is not available in PyTorch 1.0, so that creates a new group every time we call the sync BN forward function! It looks like we are using the default group anyway. We may not need that line.

After I removed that line and process_group in both torch.distributed.all_reduce and torch.distributed.all_gather, the training now works, even with tqdm.

I still have the same problem with the latest version. Then I used this method and solved it.
Very good

Met the same problem with the latest version of apex! How can I fix torch.distributed.all_reduce and torch.distributed.all_gather? @heilaw @jjsjann123 @mcarilli Thanks for your repley!!!

@donnyyou I also got the same problem. Did you find a way to fix it?

hmmm, the problem process groups should have been fixed a while back. What issues are we having here? Could we elaborate more?

@donnyyou I also got the same problem. Did you find a way to fix it?

https://github.com/donnyyou/torchcv/blob/acbb8f68e5c6f63a0f30e41267481f523ea3a234/scripts/seg/ade20k/run_fs_res101_annn_ade20k_seg.sh#L30
Please refer the scripts! @kkjh0723

@donnyyou thank you for your help! I will try your way and check if it works for me.
@jjsjann123 it seems the problem is same as @heilaw but without warning since I don't use tqdm. The training gets stuck at exactly same iteration if the network model is same. If I change the model, it is stuck at a different iteration. It seems the number of BN, which the model has, is more, the training is stuck earlier.
Actually, the issue occurred after I started to use DDP, amp and convert_syncbn_model at the same time. I'm testing without syncbn now to make sure that the problem caused by syncbn. it will take a few days...

@jjsjann123 I found that training does not stop when I remove convert_syncbn_model function.

@jjsjann123 I found that training does not stop when I remove convert_syncbn_model function.

Same as you. You could refer to my solution!

@kkjh0723 @donnyyou maybe dead lock with nccl? I don't know how that could be deterministic. It's mysterious how disabling low latency NCCL algo solves that. I'll ask NCCL guys about it.

We can also try setting delay_allreduce to True in apex::DDP (https://nvidia.github.io/apex/parallel.html), this delays the all reduce of gradients towards the end of BW path, hence rules out the possibility of SyncBN reduce call dead locking nccl calls.

has this been fixed? i have the same issue

Same issue for me, backward hangs when using sync batchnorm

Was this page helpful?
0 / 5 - 0 ratings

Related issues

michaelklachko picture michaelklachko  路  4Comments

TheRevanchist picture TheRevanchist  路  3Comments

LightToYang picture LightToYang  路  4Comments

dave-epstein picture dave-epstein  路  3Comments

jbraeburn picture jbraeburn  路  4Comments