Similar behavior as reported by others, e.g. #166.
Env: Ubuntu 16.04.5, nvidia driver 410.78, cuda 9.0/cudnn 7.2.1. Built from source (v0.4.1 branch).
Hardware is 8 Nvidia 1080 Tis.
Anyone saw similar issue? Any suggestion? Will try v1.0.0 next. Thx.
while 8 GPUs consistently got stuck, 2 to 5 GPUs seems to work from time to time -- not always works, but works if trying enough times to get going (behavior also changes depending on whether runs training with nohup or direct command).
try pip install torch==0.4.1 ?
I meet the same problem, my system is ubuntu14.04, cuda 8.0.61, cudnn 5.1.0, python 3.6.7. How to solve it?
2018-12-28 14:42:37,927 - INFO - Distributed training: True
2018-12-28 14:42:38,475 - INFO - load model from: modelzoo://resnet101
2018-12-28 14:42:38,957 - WARNING - unexpected key in source state_dict: fc.weight, fc.bias
missing keys in source state_dict: layer4.1.bn1.num_batches_tracked, layer2.1.bn1.num_batches_tracked, layer3.0.downsample.1.num_batches_tracked, layer3.9.bn2.num_batches_tracked, layer3.12.bn3.num_batches_tracked, layer3.5.bn2.num_batches_tracked, layer3.2.bn3.num_batches_tracked, layer3.11.bn3.num_batches_tracked, layer3.11.bn1.num_batches_tracked, layer1.2.bn2.num_batches_tracked, layer3.0.bn3.num_batches_tracked, layer3.13.bn1.num_batches_tracked, layer3.22.bn1.num_batches_tracked, layer3.10.bn3.num_batches_tracked, layer3.7.bn3.num_batches_tracked, layer3.14.bn1.num_batches_tracked, layer3.20.bn3.num_batches_tracked, layer3.12.bn2.num_batches_tracked, layer3.15.bn2.num_batches_tracked, layer3.22.bn3.num_batches_tracked, layer3.1.bn3.num_batches_tracked, layer1.1.bn1.num_batches_tracked, layer4.0.bn2.num_batches_tracked, layer3.3.bn3.num_batches_tracked, layer4.1.bn2.num_batches_tracked, layer3.19.bn2.num_batches_tracked, layer3.11.bn2.num_batches_tracked, layer2.0.bn3.num_batches_tracked, layer3.4.bn3.num_batches_tracked, layer3.13.bn3.num_batches_tracked, layer3.17.bn3.num_batches_tracked, layer1.0.bn1.num_batches_tracked, layer3.14.bn3.num_batches_tracked, layer3.2.bn2.num_batches_tracked, layer3.21.bn3.num_batches_tracked, layer2.1.bn3.num_batches_tracked, layer3.16.bn3.num_batches_tracked, layer1.1.bn2.num_batches_tracked, layer3.15.bn1.num_batches_tracked, layer3.15.bn3.num_batches_tracked, layer3.10.bn2.num_batches_tracked, layer3.21.bn1.num_batches_tracked, layer3.5.bn3.num_batches_tracked, layer2.3.bn3.num_batches_tracked, layer3.2.bn1.num_batches_tracked, layer3.7.bn2.num_batches_tracked, layer3.1.bn2.num_batches_tracked, layer2.0.bn1.num_batches_tracked, layer3.6.bn3.num_batches_tracked, layer3.17.bn1.num_batches_tracked, layer3.18.bn1.num_batches_tracked, layer1.2.bn1.num_batches_tracked, layer2.2.bn3.num_batches_tracked, layer3.22.bn2.num_batches_tracked, layer4.1.bn3.num_batches_tracked, layer3.20.bn1.num_batches_tracked, layer3.3.bn2.num_batches_tracked, layer2.3.bn2.num_batches_tracked, bn1.num_batches_tracked, layer4.2.bn3.num_batches_tracked, layer3.18.bn3.num_batches_tracked, layer3.9.bn1.num_batches_tracked, layer2.2.bn2.num_batches_tracked, layer4.2.bn1.num_batches_tracked, layer3.8.bn2.num_batches_tracked, layer3.20.bn2.num_batches_tracked, layer3.9.bn3.num_batches_tracked, layer4.0.bn1.num_batches_tracked, layer3.17.bn2.num_batches_tracked, layer3.3.bn1.num_batches_tracked, layer3.6.bn1.num_batches_tracked, layer3.12.bn1.num_batches_tracked, layer1.0.bn2.num_batches_tracked, layer3.19.bn3.num_batches_tracked, layer3.19.bn1.num_batches_tracked, layer1.0.downsample.1.num_batches_tracked, layer3.5.bn1.num_batches_tracked, layer3.10.bn1.num_batches_tracked, layer3.1.bn1.num_batches_tracked, layer3.8.bn3.num_batches_tracked, layer2.0.downsample.1.num_batches_tracked, layer2.2.bn1.num_batches_tracked, layer3.7.bn1.num_batches_tracked, layer3.18.bn2.num_batches_tracked, layer3.13.bn2.num_batches_tracked, layer2.3.bn1.num_batches_tracked, layer4.0.bn3.num_batches_tracked, layer1.2.bn3.num_batches_tracked, layer3.6.bn2.num_batches_tracked, layer2.0.bn2.num_batches_tracked, layer3.14.bn2.num_batches_tracked, layer3.16.bn1.num_batches_tracked, layer3.4.bn2.num_batches_tracked, layer3.0.bn2.num_batches_tracked, layer4.0.downsample.1.num_batches_tracked, layer1.1.bn3.num_batches_tracked, layer4.2.bn2.num_batches_tracked, layer2.1.bn2.num_batches_tracked, layer3.16.bn2.num_batches_tracked, layer3.4.bn1.num_batches_tracked, layer3.21.bn2.num_batches_tracked, layer1.0.bn3.num_batches_tracked, layer3.0.bn1.num_batches_tracked, layer3.8.bn1.num_batches_tracked
loading annotations into memory...
loading annotations into memory...
Done (t=10.53s)
creating index...
Done (t=10.57s)
creating index...
index created!
index created!
2018-12-28 14:42:53,295 - INFO - Start running, host: yan@2x1080Ti-36, work_dir: /home1/mmdetection/work_dirs/cascade_mask_rcnn_r101_fpn_1x
2018-12-28 14:42:53,295 - INFO - workflow: [('train', 1)], max: 12 epochs
@YanHengxu update nvidia-driver to 410.78
Thank you,I'll try it
Meet the same question锛宎nyone solved this problem??
me too.
Im using GTX1080, nvidia driver 418.61, cuda 10.0 cudnn 7.1.4 pytorch 1.1.0 and after loading annotations into memory, it stuck with the same situation. Maybe it's because of some similar issue.
I experience exactly the same problem.
By using a debugger I found that the hang occurs here, where it tries to read the image.
At that point, when the execution works, the image is a mmcv DataContainer with a torch tensor. When it doesn't work the execution hangs when accessing it.
I suspect the cause of the error is not there, but at some point before when the data is being moved to the devices.
Any lead will be welcomed.
To run this, I use a docker container that I created from nvidia/cuda:9.2-cudnn7-devel-ubuntu16.04.
torch is installed using pip. mmdet is installed from source using python3 compile.sh and python3 setup.py install. I already tried installing other pytorch versions and also pytorch from source.
My system:
PyTorch version: 1.1.0
Is debug build: No
CUDA used to build PyTorch: 9.0.176
OS: Ubuntu 16.04.6 LTS
GCC version: (Ubuntu 5.4.0-6ubuntu1~16.04.11) 5.4.0 20160609
CMake version: Could not collect
Python version: 3.5
Is CUDA available: Yes
CUDA runtime version: 9.2.148
GPU models and configuration:
GPU 0: GeForce GTX 1080 Ti
GPU 1: GeForce GTX 1080 Ti
GPU 2: GeForce GTX 1080 Ti
GPU 3: GeForce GTX 1080 Ti
Nvidia driver version: 418.67
cuDNN version: /usr/lib/x86_64-linux-gnu/libcudnn.so.7.6.0
Versions of relevant libraries:
[pip] Could not collect
[conda] Could not collect
$ python3 -m pip list
Package Version
addict 2.2.1
certifi 2019.6.16
chardet 3.0.4
cycler 0.10.0
Cython 0.29.10
idna 2.8
kiwisolver 1.1.0
matplotlib 3.0.3
mmcv 0.2.8
mmdet 0.6.0+53c647e
numpy 1.16.4
opencv-python 4.1.0.25
Pillow 6.0.0
pip 19.1.1
pycocotools 2.0.0
pyparsing 2.4.0
python-dateutil 2.8.0
PyYAML 5.1.1
requests 2.22.0
setuptools 20.7.0
six 1.12.0
terminaltables 3.1.0
torch 1.1.0
torchvision 0.3.0
urllib3 1.25.3
wheel 0.29.0
same problem here
same problem
python tools/train.py configs/pascal_voc/faster_rcnn_r50_fpn_1x_voc0712.py
2019-09-30 22:47:05,823 - INFO - Distributed training: False
2019-09-30 22:47:06,037 - INFO - load model from: torchvision://resnet50
2019-09-30 22:47:06,154 - WARNING - The model and loaded state dict do not match exactly
unexpected key in source state_dict: fc.weight, fc.bias
2019-09-30 22:47:08,450 - INFO - Start running, host: lab405@lab-405, work_dir: /home/data/whj2/faster-rcnn-kitti/mmdetection/work_dirs/faster_rcnn_r50_fpn_1x_voc0712
2019-09-30 22:47:08,450 - INFO - workflow: [('train', 1)], max: 4 epochs
same problem
python tools/train.py configs/pascal_voc/faster_rcnn_r50_fpn_1x_voc0712.py 2019-09-30 22:47:05,823 - INFO - Distributed training: False 2019-09-30 22:47:06,037 - INFO - load model from: torchvision://resnet50 2019-09-30 22:47:06,154 - WARNING - The model and loaded state dict do not match exactly unexpected key in source state_dict: fc.weight, fc.bias 2019-09-30 22:47:08,450 - INFO - Start running, host: lab405@lab-405, work_dir: /home/data/whj2/faster-rcnn-kitti/mmdetection/work_dirs/faster_rcnn_r50_fpn_1x_voc0712 2019-09-30 22:47:08,450 - INFO - workflow: [('train', 1)], max: 4 epochs
I solved this problem by changing the 'difficult' item in my own VOC format dataset to 0.
I meet the same problem, but it occurs when I use multi GPUs. anyone solved this problem??
At my company we never discovered the solution to this and ended up working with the "--gpus" flag of tools/train.py... :(
At my company we never discovered the solution to this and ended up working with the "--gpus" flag of tools/train.py... :(
Hi,can you tell me more detail for solving this problem?
thx
Having the similar issue:
The training process works with 1 GPU, but gets stuck with multi-GPUs.
NVIDIA GTX 2080ti, Driver Version: 418.87.00, CUDA Version: 10.0
got the same problem, can I look forward to a fix? @hellock
same problem
python tools/train.py configs/pascal_voc/faster_rcnn_r50_fpn_1x_voc0712.py 2019-09-30 22:47:05,823 - INFO - Distributed training: False 2019-09-30 22:47:06,037 - INFO - load model from: torchvision://resnet50 2019-09-30 22:47:06,154 - WARNING - The model and loaded state dict do not match exactly unexpected key in source state_dict: fc.weight, fc.bias 2019-09-30 22:47:08,450 - INFO - Start running, host: lab405@lab-405, work_dir: /home/data/whj2/faster-rcnn-kitti/mmdetection/work_dirs/faster_rcnn_r50_fpn_1x_voc0712 2019-09-30 22:47:08,450 - INFO - workflow: [('train', 1)], max: 4 epochsI solved this problem by changing the 'difficult' item in my own VOC format dataset to 0.
I meet the same problem in v1.2.0's code when training with multiple GPU in one machine, but when I update to v2.0.0, the problem seems be solved
I'm trying to train mask_rcnn with one class on single machine multi-gpus.
It's stuck. I tried with v2.0.0 and v2.1.0