Mmdetection: the training process may get into stuck

Created on 11 Dec 2018  路  15Comments  路  Source: open-mmlab/mmdetection

After training some iterations, the GPU-Util may increased from about 50% to 100%, and then the training get into totally stuck and can not training any iterations, and the code can not throw any error.

Most helpful comment

In my opinion, it is sometimes related to cuda interations. I have some suggestion for you.

  1. If you are using cuda9, you should upgrade nvidia driver to 396.51 since many people reported to resolve random hang in multi-GPU training.
  2. If option 1 doesnot work, i recommend you to install from a clean docker, nvidia/cuda:9.0-cudnn7-devel-ubuntu16.04 can be a good point to start. See #159

All 15 comments

My System is : CUDA 9 and Python 3.5 and Pytorch 0.4.1

99

thks, i will try build the pytorch from source

@hellock
I build the pytorch by following steps:
git clone -b v0.4.1 --recursive https://github.com/pytorch/pytorch
python setup.py install

My system info: python3.5 + cuda9 + cudnn7.0.5
By using the build torch, the training still get into stuck, it cannot train any iterations.
333

can you show me how you build the pytorch, thank you?

Though it is a known issue that PyTorch sometimes gets stuck on V100, your case looks weird. I will take a look tomorrow.

In my opinion, it is sometimes related to cuda interations. I have some suggestion for you.

  1. If you are using cuda9, you should upgrade nvidia driver to 396.51 since many people reported to resolve random hang in multi-GPU training.
  2. If option 1 doesnot work, i recommend you to install from a clean docker, nvidia/cuda:9.0-cudnn7-devel-ubuntu16.04 can be a good point to start. See #159

In my opinion, it is sometimes related to cuda interations. I have some suggestion for you.

  1. If you are using cuda9, you should upgrade nvidia driver to 396.51 since many people reported to resolve random hang in multi-GPU training.
  2. If option 1 doesnot work, i recommend you to install from a clean docker, nvidia/cuda:9.0-cudnn7-devel-ubuntu16.04 can be a good point to start. See #159

Thanks, I will try it.

@hellock @thangvubk Thanks, the training process works well after upgrade the nvidia driver to 410.xx.

@miracle-fmh I met the same problem.
For Faster-r50-fpn, it throws "UserWarning: semaphore_tracker: There appear to be 8 leaked semaphores to clean up at shutdown."
For Mask-r50-fpn, it stops without any error.
For Retinanet-r50, it runs properly.
I use 4 Tesla P40, CUDA90, pytorch0.4.1 (installed from "pip install pytorch_0.4.1_xxxx.whl "), driver= 410(I have also tried 390, the same situation )
Can you tell me some methods else?

Hi, I also meet similar problems, have you already solved it? @KimSoybean

When I compile pytorch from source, I have not stuck anymore

Hi, my environment: titan X + 410 Driver + cuda10 + pytorch 1.1
I encounter the problem too.

@thangvubk yeah, you are right, I try new docker image(you recommend) and it works!

The problem is still there: the training got stuck, without any progress after loading dataset.

After I installed driver and toolkit for cuda10.1, installed Pytorch 1.2 from the newest source, followed the guid from nvcc and installed nvcc2, apt-get installed gcc with version 7.4.0.
Conda installed some other dependences like mmcv and cython.
os: ubuntu 18.04
I tried both ./tools/dist-train.sh and single GPUs train like below.

$ python ./tools/train.py --work_dir work_dirs/ --validate --gpus 4 configs/retinanet_r101_fpn_1x.py
2019-07-19 07:50:51,719 - INFO - Distributed training: False
2019-07-19 07:50:52,319 - INFO - load model from: modelzoo://resnet101
2019-07-19 07:50:53,156 - WARNING - unexpected key in source state_dict: fc.weight, fc.bias

missing keys in source state_dict: layer3.19.bn2.num_batches_tracked, layer1.2.bn1.num_batches_tracked, layer2.1.bn1.num_batches_tracked, layer4.0.downsample.1.num_batches_tracked, layer4.1.bn3.num_batches_tracked, layer3.7.bn3.num_batches_tracked, layer3.11.bn1.num_batches_tracked, layer2.3.bn1.num_batches_tracked, layer4.2.bn2.num_batches_tracked, layer3.16.bn2.num_batches_tracked, layer3.5.bn3.num_batches_tracked, layer3.15.bn2.num_batches_tracked, layer3.14.bn2.num_batches_tracked, layer3.4.bn2.num_batches_tracked, layer3.14.bn1.num_batches_tracked, layer3.12.bn2.num_batches_tracked, layer4.1.bn1.num_batches_tracked, layer1.1.bn3.num_batches_tracked, layer3.0.bn1.num_batches_tracked, layer3.6.bn2.num_batches_tracked, layer3.3.bn3.num_batches_tracked, layer3.1.bn1.num_batches_tracked, layer3.19.bn1.num_batches_tracked, layer2.2.bn2.num_batches_tracked, layer2.2.bn1.num_batches_tracked, layer3.8.bn2.num_batches_tracked, layer3.1.bn3.num_batches_tracked, layer3.10.bn1.num_batches_tracked, layer2.1.bn2.num_batches_tracked, layer3.13.bn2.num_batches_tracked, layer3.2.bn2.num_batches_tracked, layer3.5.bn2.num_batches_tracked, layer3.15.bn3.num_batches_tracked, layer3.18.bn1.num_batches_tracked, layer3.5.bn1.num_batches_tracked, layer4.0.bn2.num_batches_tracked, layer3.0.bn3.num_batches_tracked, layer3.16.bn1.num_batches_tracked, layer3.21.bn3.num_batches_tracked, layer3.4.bn3.num_batches_tracked, layer3.7.bn1.num_batches_tracked, layer2.1.bn3.num_batches_tracked, layer3.7.bn2.num_batches_tracked, layer1.2.bn2.num_batches_tracked, bn1.num_batches_tracked, layer2.3.bn2.num_batches_tracked, layer3.0.downsample.1.num_batches_tracked, layer3.18.bn2.num_batches_tracked, layer3.22.bn2.num_batches_tracked, layer4.2.bn1.num_batches_tracked, layer3.2.bn3.num_batches_tracked, layer3.22.bn3.num_batches_tracked, layer1.2.bn3.num_batches_tracked, layer3.4.bn1.num_batches_tracked, layer2.0.bn1.num_batches_tracked, layer3.20.bn3.num_batches_tracked, layer3.13.bn1.num_batches_tracked, layer3.22.bn1.num_batches_tracked, layer3.12.bn3.num_batches_tracked, layer1.1.bn1.num_batches_tracked, layer1.0.bn3.num_batches_tracked, layer3.11.bn3.num_batches_tracked, layer3.12.bn1.num_batches_tracked, layer4.0.bn3.num_batches_tracked, layer3.20.bn2.num_batches_tracked, layer2.2.bn3.num_batches_tracked, layer3.2.bn1.num_batches_tracked, layer3.3.bn1.num_batches_tracked, layer3.6.bn1.num_batches_tracked, layer3.0.bn2.num_batches_tracked, layer3.9.bn2.num_batches_tracked, layer4.0.bn1.num_batches_tracked, layer2.0.bn3.num_batches_tracked, layer3.8.bn3.num_batches_tracked, layer3.3.bn2.num_batches_tracked, layer3.21.bn1.num_batches_tracked, layer2.0.bn2.num_batches_tracked, layer3.17.bn3.num_batches_tracked, layer2.3.bn3.num_batches_tracked, layer3.17.bn2.num_batches_tracked, layer3.8.bn1.num_batches_tracked, layer3.20.bn1.num_batches_tracked, layer3.17.bn1.num_batches_tracked, layer3.14.bn3.num_batches_tracked, layer3.6.bn3.num_batches_tracked, layer1.0.downsample.1.num_batches_tracked, layer3.13.bn3.num_batches_tracked, layer1.0.bn2.num_batches_tracked, layer3.16.bn3.num_batches_tracked, layer4.1.bn2.num_batches_tracked, layer3.11.bn2.num_batches_tracked, layer3.10.bn2.num_batches_tracked, layer3.21.bn2.num_batches_tracked, layer3.1.bn2.num_batches_tracked, layer3.19.bn3.num_batches_tracked, layer3.18.bn3.num_batches_tracked, layer3.9.bn1.num_batches_tracked, layer3.10.bn3.num_batches_tracked, layer3.15.bn1.num_batches_tracked, layer2.0.downsample.1.num_batches_tracked, layer4.2.bn3.num_batches_tracked, layer3.9.bn3.num_batches_tracked, layer1.1.bn2.num_batches_tracked, layer1.0.bn1.num_batches_tracked

loading annotations into memory...
Done (t=0.64s)
creating index...
index created!
2019-07-19 07:50:56,395 - INFO - Start running, host: adam@adam-train-1, work_dir: /home/adam/code/mmdetection/work_dirs
2019-07-19 07:50:56,397 - INFO - workflow: [('train', 1)], max: 60 epochs

The problem is still there: the training got stuck, without any progress after loading dataset.

After I installed driver and toolkit for cuda10.1, installed Pytorch 1.2 from the newest source, followed the guid from nvcc and installed nvcc2, apt-get installed gcc with version 7.4.0.
Conda installed some other dependences like mmcv and cython.
os: ubuntu 18.04
I tried both ./tools/dist-train.sh and single GPUs train like below.

$ python ./tools/train.py --work_dir work_dirs/ --validate --gpus 4 configs/retinanet_r101_fpn_1x.py
2019-07-19 07:50:51,719 - INFO - Distributed training: False
2019-07-19 07:50:52,319 - INFO - load model from: modelzoo://resnet101
2019-07-19 07:50:53,156 - WARNING - unexpected key in source state_dict: fc.weight, fc.bias

missing keys in source state_dict: layer3.19.bn2.num_batches_tracked, layer1.2.bn1.num_batches_tracked, layer2.1.bn1.num_batches_tracked, layer4.0.downsample.1.num_batches_tracked, layer4.1.bn3.num_batches_tracked, layer3.7.bn3.num_batches_tracked, layer3.11.bn1.num_batches_tracked, layer2.3.bn1.num_batches_tracked, layer4.2.bn2.num_batches_tracked, layer3.16.bn2.num_batches_tracked, layer3.5.bn3.num_batches_tracked, layer3.15.bn2.num_batches_tracked, layer3.14.bn2.num_batches_tracked, layer3.4.bn2.num_batches_tracked, layer3.14.bn1.num_batches_tracked, layer3.12.bn2.num_batches_tracked, layer4.1.bn1.num_batches_tracked, layer1.1.bn3.num_batches_tracked, layer3.0.bn1.num_batches_tracked, layer3.6.bn2.num_batches_tracked, layer3.3.bn3.num_batches_tracked, layer3.1.bn1.num_batches_tracked, layer3.19.bn1.num_batches_tracked, layer2.2.bn2.num_batches_tracked, layer2.2.bn1.num_batches_tracked, layer3.8.bn2.num_batches_tracked, layer3.1.bn3.num_batches_tracked, layer3.10.bn1.num_batches_tracked, layer2.1.bn2.num_batches_tracked, layer3.13.bn2.num_batches_tracked, layer3.2.bn2.num_batches_tracked, layer3.5.bn2.num_batches_tracked, layer3.15.bn3.num_batches_tracked, layer3.18.bn1.num_batches_tracked, layer3.5.bn1.num_batches_tracked, layer4.0.bn2.num_batches_tracked, layer3.0.bn3.num_batches_tracked, layer3.16.bn1.num_batches_tracked, layer3.21.bn3.num_batches_tracked, layer3.4.bn3.num_batches_tracked, layer3.7.bn1.num_batches_tracked, layer2.1.bn3.num_batches_tracked, layer3.7.bn2.num_batches_tracked, layer1.2.bn2.num_batches_tracked, bn1.num_batches_tracked, layer2.3.bn2.num_batches_tracked, layer3.0.downsample.1.num_batches_tracked, layer3.18.bn2.num_batches_tracked, layer3.22.bn2.num_batches_tracked, layer4.2.bn1.num_batches_tracked, layer3.2.bn3.num_batches_tracked, layer3.22.bn3.num_batches_tracked, layer1.2.bn3.num_batches_tracked, layer3.4.bn1.num_batches_tracked, layer2.0.bn1.num_batches_tracked, layer3.20.bn3.num_batches_tracked, layer3.13.bn1.num_batches_tracked, layer3.22.bn1.num_batches_tracked, layer3.12.bn3.num_batches_tracked, layer1.1.bn1.num_batches_tracked, layer1.0.bn3.num_batches_tracked, layer3.11.bn3.num_batches_tracked, layer3.12.bn1.num_batches_tracked, layer4.0.bn3.num_batches_tracked, layer3.20.bn2.num_batches_tracked, layer2.2.bn3.num_batches_tracked, layer3.2.bn1.num_batches_tracked, layer3.3.bn1.num_batches_tracked, layer3.6.bn1.num_batches_tracked, layer3.0.bn2.num_batches_tracked, layer3.9.bn2.num_batches_tracked, layer4.0.bn1.num_batches_tracked, layer2.0.bn3.num_batches_tracked, layer3.8.bn3.num_batches_tracked, layer3.3.bn2.num_batches_tracked, layer3.21.bn1.num_batches_tracked, layer2.0.bn2.num_batches_tracked, layer3.17.bn3.num_batches_tracked, layer2.3.bn3.num_batches_tracked, layer3.17.bn2.num_batches_tracked, layer3.8.bn1.num_batches_tracked, layer3.20.bn1.num_batches_tracked, layer3.17.bn1.num_batches_tracked, layer3.14.bn3.num_batches_tracked, layer3.6.bn3.num_batches_tracked, layer1.0.downsample.1.num_batches_tracked, layer3.13.bn3.num_batches_tracked, layer1.0.bn2.num_batches_tracked, layer3.16.bn3.num_batches_tracked, layer4.1.bn2.num_batches_tracked, layer3.11.bn2.num_batches_tracked, layer3.10.bn2.num_batches_tracked, layer3.21.bn2.num_batches_tracked, layer3.1.bn2.num_batches_tracked, layer3.19.bn3.num_batches_tracked, layer3.18.bn3.num_batches_tracked, layer3.9.bn1.num_batches_tracked, layer3.10.bn3.num_batches_tracked, layer3.15.bn1.num_batches_tracked, layer2.0.downsample.1.num_batches_tracked, layer4.2.bn3.num_batches_tracked, layer3.9.bn3.num_batches_tracked, layer1.1.bn2.num_batches_tracked, layer1.0.bn1.num_batches_tracked

loading annotations into memory...
Done (t=0.64s)
creating index...
index created!
2019-07-19 07:50:56,395 - INFO - Start running, host: adam@adam-train-1, work_dir: /home/adam/code/mmdetection/work_dirs
2019-07-19 07:50:56,397 - INFO - workflow: [('train', 1)], max: 60 epochs

Have you solved the problem?

Was this page helpful?
0 / 5 - 0 ratings

Related issues

hust-kevin picture hust-kevin  路  3Comments

letanloc1998 picture letanloc1998  路  3Comments

songyuc picture songyuc  路  3Comments

namheegordonkim picture namheegordonkim  路  3Comments

tianxinhang picture tianxinhang  路  3Comments