Vision: Segmentation reference train script

Created on 21 Jul 2019  路  4Comments  路  Source: pytorch/vision

I am training deeplabv3 model from torchvision for 91 classes of coco. I am using training script from references. I use following command to train the model.
python -m torch.distributed.launch --nproc_per_node=4 --use_env train.py --dataset coco --model deeplabv3_resnet101 --output-dir output/ -b 2 -j 4

However the training gets stuck at with no error or anything
_Epoch: [6] [ 8160/14525] eta: 2:14:49 lr: 0.008007956757936691 loss: 0.3471 (0.5539) time: 1.2725 data: 0.0021 max mem: 4783_

awaiting response models reference scripts needs reproduction semantic segmentation

Most helpful comment

@fmassa In https://github.com/pytorch/pytorch/pull/22907 @pritamdamania87 is working on NCCL failure detection. On top of it, we can also honor the timeouts set on the process group, and we'd raise a timeout error in a case like this. Then when someone sees that error, and none of the processes have crashed, they can use whatever will be built in https://github.com/pytorch/pytorch/issues/22071 to figure out if there is a mismatch in the collectives that get called. None of this is available today though.

All 4 comments

You might be facing a NCCL / CUDNN deadlock, and this is unfortunately very hard to debug/reproduce.

@pietern maybe it would be a good thing to add some more utility functions / checks in PyTorch to simplify the identification of those problems?

@fmassa My concern is it got stuck at the same batch and epoch twice i.e Epoch 6 and 8160. Also when I resume with the resume arg with last checkpoint, the current script starts with that checkpoint and from epoch 0.

This is probably something that has to do with the particular type of data you are feeding in epoch 6, iteration 8160. Try seeing if there is anything special wrt that. To make things faster, you can probably try switching off the model and just run through the data.

@fmassa In https://github.com/pytorch/pytorch/pull/22907 @pritamdamania87 is working on NCCL failure detection. On top of it, we can also honor the timeouts set on the process group, and we'd raise a timeout error in a case like this. Then when someone sees that error, and none of the processes have crashed, they can use whatever will be built in https://github.com/pytorch/pytorch/issues/22071 to figure out if there is a mismatch in the collectives that get called. None of this is available today though.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

JingyunLiang picture JingyunLiang  路  26Comments

soldierofhell picture soldierofhell  路  36Comments

timonbimon picture timonbimon  路  28Comments

mcleonard picture mcleonard  路  26Comments

fmassa picture fmassa  路  30Comments