Checklist
Describe the bug
A clear and concise description of what the bug is.
If there are any related issues or upstream bugs, please also refer to them.
Error traceback
I run the following command to train mask_rcnn_r50_fpn_fp16
==============================================
NUM_GPUS=4
CONFIG=mmdetection/configs/fp16/mask_rcnn_r50_fpn_fp16_1x.py'
WORK_DIR=work_dirs/mask_rcnn_r50_fpn_fp16_1x'
tools/dist_train.sh $CONFIG $NUM_GPUS --validate --work_dir $WORK_DIR
==============================================
Because it is too long, i will paste it in the end.
Reproduction details
Environment
When I try to train fp 16 model,
CUDA error: device-side assert triggered
(insert_events at ../c10/cuda/CUDACachingAllocator.cpp:564)
and many repetitive following messages
/tmp/pip-req-build-fl_vaj2n/aten/src/ATen/native/cuda/IndexKernel.cu:53: lambda ->auto::operator()(int)->auto: block: [1,0,0], thread: [80,0,0] Assertion index >= -sizes[i] && index < sizes[i] && "index out of bounds" failed.
when i comment out fp16 configuration , it doesn't produce error.
https://github.com/open-mmlab/mmdetection/blob/master/configs/fp16/mask_rcnn_r50_fpn_fp16_1x.py#L2
Error message
``/home/user/Desktop/workspace_zacurr/mmdetection/work_dirs/mask_rcnn_r50_fpn_fp16_1x
Directory exists
loading annotations into memory...
2019-07-02 00:56:29,056 - INFO - Distributed training: True
2019-07-02 00:56:29,549 - INFO - load model from: modelzoo://resnet50
loading annotations into memory...
2019-07-02 00:56:29,828 - WARNING - unexpected key in source state_dict: fc.weight, fc.bias
missing keys in source state_dict: layer3.0.bn1.num_batches_tracked, layer2.0.bn3.num_batches_tracked, layer2.2.bn1.num_batches_tracked, layer2.1.bn2.num_batches_tracked, layer1.0.bn2.num_batches_tracked, layer3.0.downsample.1.num_batches_tracked, layer4.1.bn2.num_batches_tracked, layer3.5.bn2.num_batches_tracked, layer3.1.bn2.num_batches_tracked, layer3.4.bn1.num_batches_tracked, layer1.2.bn2.num_batches_tracked, layer3.2.bn2.num_batches_tracked, layer3.1.bn3.num_batches_tracked, layer4.2.bn2.num_batches_tracked, layer2.0.bn2.num_batches_tracked, layer2.3.bn1.num_batches_tracked, layer4.2.bn3.num_batches_tracked, layer3.4.bn3.num_batches_tracked, layer3.2.bn3.num_batches_tracked, layer1.0.downsample.1.num_batches_tracked, layer2.1.bn1.num_batches_tracked, layer3.3.bn3.num_batches_tracked, layer4.0.downsample.1.num_batches_tracked, layer4.0.bn1.num_batches_tracked, layer4.0.bn3.num_batches_tracked, layer1.1.bn2.num_batches_tracked, layer3.0.bn3.num_batches_tracked, layer3.2.bn1.num_batches_tracked, layer3.0.bn2.num_batches_tracked, layer4.0.bn2.num_batches_tracked, layer2.2.bn2.num_batches_tracked, layer3.5.bn3.num_batches_tracked, layer1.0.bn1.num_batches_tracked, layer2.3.bn3.num_batches_tracked, layer1.0.bn3.num_batches_tracked, layer3.3.bn2.num_batches_tracked, layer4.1.bn1.num_batches_tracked, layer1.1.bn3.num_batches_tracked, layer2.3.bn2.num_batches_tracked, layer3.3.bn1.num_batches_tracked, layer3.1.bn1.num_batches_tracked, layer3.5.bn1.num_batches_tracked, layer2.0.downsample.1.num_batches_tracked, layer1.1.bn1.num_batches_tracked, layer3.4.bn2.num_batches_tracked, bn1.num_batches_tracked, layer1.2.bn1.num_batches_tracked, layer4.2.bn1.num_batches_tracked, layer2.0.bn1.num_batches_tracked, layer4.1.bn3.num_batches_tracked, layer2.1.bn3.num_batches_tracked, layer1.2.bn3.num_batches_tracked, layer2.2.bn3.num_batches_tracked
loading annotations into memory...
loading annotations into memory...
Done (t=12.76s)
creating index...
Done (t=12.47s)
creating index...
index created!
Done (t=12.82s)
creating index...
index created!
index created!
Done (t=13.82s)
creating index...
index created!
loading annotations into memory...
loading annotations into memory...
loading annotations into memory...
loading annotations into memory...
Done (t=1.77s)
creating index...
index created!
Done (t=2.36s)
creating index...
Done (t=2.39s)
creating index...
index created!
index created!
Done (t=2.53s)
creating index...
index created!
2019-07-02 00:56:53,981 - INFO - Start running, host: root@b6940c72ef4f, work_dir: /home/user/Desktop/workspace_zacurr/mmdetection/work_dirs/mask_rcnn_r50_fpn_fp16_1x
2019-07-02 00:56:53,981 - INFO - workflow: [('train', 1)], max: 12 epochs
/tmp/pip-req-build-fl_vaj2n/aten/src/ATen/native/cuda/IndexKernel.cu:53: lambda ->auto::operator()(int)->auto: block: [5,0,0], thread: [96,0,0] Assertion index >= -sizes[i] && index < sizes[i] && "index out of bounds" failed.
/tmp/pip-req-build-fl_vaj2n/aten/src/ATen/native/cuda/IndexKernel.cu:53: lambda ->auto::operator()(int)->auto: block: [5,0,0], thread: [97,0,0] Assertion index >= -sizes[i] && index < sizes[i] && "index out of bounds" failed.
/tmp/pip-req-build-fl_vaj2n/aten/src/ATen/native/cuda/IndexKernel.cu:53: lambda ->auto::operator()(int)->auto: block: [5,0,0], thread: [98,0,0] Assertion index >= -sizes[i] && index < sizes[i] && "index out of bounds" failed.
... omitted...
/tmp/pip-req-build-fl_vaj2n/aten/src/ATen/native/cuda/IndexKernel.cu:53: lambda ->auto::operator()(int)->auto: block: [1,0,0], thread: [126,0,0] Assertion index >= -sizes[i] && index < sizes[i] && "index out of bounds" failed.
/tmp/pip-req-build-fl_vaj2n/aten/src/ATen/native/cuda/IndexKernel.cu:53: lambda ->auto::operator()(int)->auto: block: [1,0,0], thread: [127,0,0] Assertion index >= -sizes[i] && index < sizes[i] && "index out of bounds" failed.
Traceback (most recent call last):
File "/home/user/Desktop/workspace_zacurr/mmdetection/tools/train.py", line 98, in
main()
File "/home/user/Desktop/workspace_zacurr/mmdetection/tools/train.py", line 94, in main
logger=logger)
File "/home/user/Desktop/workspace_zacurr/mmdetection/mmdet/apis/train.py", line 60, in train_detector
_dist_train(model, dataset, cfg, validate=validate)
File "/home/user/Desktop/workspace_zacurr/mmdetection/mmdet/apis/train.py", line 189, in _dist_train
runner.run(data_loaders, cfg.workflow, cfg.total_epochs)
File "/opt/conda/lib/python3.6/site-packages/mmcv/runner/runner.py", line 356, in run
epoch_runner(data_loaders[i], kwargs)
File "/opt/conda/lib/python3.6/site-packages/mmcv/runner/runner.py", line 262, in train
self.model, data_batch, train_mode=True, *kwargs)
File "/home/user/Desktop/workspace_zacurr/mmdetection/mmdet/apis/train.py", line 40, in batch_processor
losses = model(data)
File "/opt/conda/lib/python3.6/site-packages/torch/nn/modules/module.py", line 494, in __call__
result = self.forward(input, kwargs)
File "/opt/conda/lib/python3.6/site-packages/mmcv/parallel/distributed.py", line 50, in forward
return self.module(inputs[0], *kwargs[0])
File "/opt/conda/lib/python3.6/site-packages/torch/nn/modules/module.py", line 494, in __call__
result = self.forward(input, *kwargs)
File "/home/user/Desktop/workspace_zacurr/mmdetection/mmdet/core/fp16/decorators.py", line 75, in new_func
output = old_func(new_args, *new_kwargs)
File "/home/user/Desktop/workspace_zacurr/mmdetection/mmdet/models/detectors/base.py", line 86, in forward
return self.forward_train(img, img_meta, *kwargs)
File "/home/user/Desktop/workspace_zacurr/mmdetection/mmdet/models/detectors/two_stage.py", line 114, in forward_train
proposal_list = self.rpn_head.get_bboxes(proposal_inputs)
File "/home/user/Desktop/workspace_zacurr/mmdetection/mmdet/core/fp16/decorators.py", line 152, in new_func
output = old_func(new_args, *new_kwargs)
File "/home/user/Desktop/workspace_zacurr/mmdetection/mmdet/models/anchor_heads/anchor_head.py", line 221, in get_bboxes
scale_factor, cfg, rescale)
File "/home/user/Desktop/workspace_zacurr/mmdetection/mmdet/models/anchor_heads/rpn_head.py", line 83, in get_bboxes_single
self.target_stds, img_shape)
File "/home/user/Desktop/workspace_zacurr/mmdetection/mmdet/core/bbox/transforms.py", line 40, in delta2bbox
means = deltas.new_tensor(means).repeat(1, deltas.size(1) // 4)
RuntimeError: CUDA error: device-side assert triggered
Traceback (most recent call last):
File "/home/user/Desktop/workspace_zacurr/mmdetection/tools/train.py", line 98, in
main()
File "/home/user/Desktop/workspace_zacurr/mmdetection/tools/train.py", line 94, in main
logger=logger)
File "/home/user/Desktop/workspace_zacurr/mmdetection/mmdet/apis/train.py", line 60, in train_detector
_dist_train(model, dataset, cfg, validate=validate)
File "/home/user/Desktop/workspace_zacurr/mmdetection/mmdet/apis/train.py", line 189, in _dist_train
runner.run(data_loaders, cfg.workflow, cfg.total_epochs)
File "/opt/conda/lib/python3.6/site-packages/mmcv/runner/runner.py", line 356, in run
epoch_runner(data_loaders[i], *kwargs)
File "/opt/conda/lib/python3.6/site-packages/mmcv/runner/runner.py", line 262, in train
self.model, data_batch, train_mode=True, *kwargs)
File "/home/user/Desktop/workspace_zacurr/mmdetection/mmdet/apis/train.py", line 40, in batch_processor
losses = model(
File "/opt/conda/lib/python3.6/site-packages/torch/nn/modules/module.py", line 494, in __call__
result = self.forward(input, kwargs)
File "/opt/conda/lib/python3.6/site-packages/mmcv/parallel/distributed.py", line 50, in forward
return self.module(inputs[0], *kwargs[0])
File "/opt/conda/lib/python3.6/site-packages/torch/nn/modules/module.py", line 494, in __call__
result = self.forward(input, *kwargs)
File "/home/user/Desktop/workspace_zacurr/mmdetection/mmdet/core/fp16/decorators.py", line 75, in new_func
output = old_func(new_args, *new_kwargs)
File "/home/user/Desktop/workspace_zacurr/mmdetection/mmdet/models/detectors/base.py", line 86, in forward
return self.forward_train(img, img_meta, *kwargs)
File "/home/user/Desktop/workspace_zacurr/mmdetection/mmdet/models/detectors/two_stage.py", line 114, in forward_train
proposal_list = self.rpn_head.get_bboxes(proposal_inputs)
File "/home/user/Desktop/workspace_zacurr/mmdetection/mmdet/core/fp16/decorators.py", line 152, in new_func
output = old_func(new_args, *new_kwargs)
File "/home/user/Desktop/workspace_zacurr/mmdetection/mmdet/models/anchor_heads/anchor_head.py", line 221, in get_bboxes
scale_factor, cfg, rescale)
File "/home/user/Desktop/workspace_zacurr/mmdetection/mmdet/models/anchor_heads/rpn_head.py", line 83, in get_bboxes_single
self.target_stds, img_shape)
File "/home/user/Desktop/workspace_zacurr/mmdetection/mmdet/core/bbox/transforms.py", line 40, in delta2bbox
means = deltas.new_tensor(means).repeat(1, deltas.size(1) // 4)
RuntimeError: CUDA error: device-side assert triggered
Traceback (most recent call last):
File "/home/user/Desktop/workspace_zacurr/mmdetection/tools/train.py", line 98, in
main()
File "/home/user/Desktop/workspace_zacurr/mmdetection/tools/train.py", line 94, in main
logger=logger)
File "/home/user/Desktop/workspace_zacurr/mmdetection/mmdet/apis/train.py", line 60, in train_detector
_dist_train(model, dataset, cfg, validate=validate)
File "/home/user/Desktop/workspace_zacurr/mmdetection/mmdet/apis/train.py", line 189, in _dist_train
runner.run(data_loaders, cfg.workflow, cfg.total_epochs)
File "/opt/conda/lib/python3.6/site-packages/mmcv/runner/runner.py", line 356, in run
epoch_runner(data_loaders[i], *
File "/opt/conda/lib/python3.6/site-packages/mmcv/runner/runner.py", line 262, in train
self.model, data_batch, train_mode=True, *kwargs)
File "/home/user/Desktop/workspace_zacurr/mmdetection/mmdet/apis/train.py", line 40, in batch_processor
losses = model(data)
File "/opt/conda/lib/python3.6/site-packages/torch/nn/modules/module.py", line 494, in __call__
result = self.forward(input, kwargs)
File "/opt/conda/lib/python3.6/site-packages/mmcv/parallel/distributed.py", line 50, in forward
return self.module(inputs[0], *kwargs[0])
File "/opt/conda/lib/python3.6/site-packages/torch/nn/modules/module.py", line 494, in __call__
result = self.forward(input, *kwargs)
File "/home/user/Desktop/workspace_zacurr/mmdetection/mmdet/core/fp16/decorators.py", line 75, in new_func
output = old_func(new_args, *new_kwargs)
File "/home/user/Desktop/workspace_zacurr/mmdetection/mmdet/models/detectors/base.py", line 86, in forward
return self.forward_train(img, img_meta, *kwargs)
File "/home/user/Desktop/workspace_zacurr/mmdetection/mmdet/models/detectors/two_stage.py", line 114, in forward_train
proposal_list = self.rpn_head.get_bboxes(proposal_inputs)
File "/home/user/Desktop/workspace_zacurr/mmdetection/mmdet/core/fp16/decorators.py", line 152, in new_func
output = old_func(new_args, *new_kwargs)
File "/home/user/Desktop/workspace_zacurr/mmdetection/mmdet/models/anchor_heads/anchor_head.py", line 221, in get_bboxes
scale_factor, cfg, rescale)
File "/home/user/Desktop/workspace_zacurr/mmdetection/mmdet/models/anchor_heads/rpn_head.py", line 83, in get_bboxes_single
self.target_stds, img_shape)
File "/home/user/Desktop/workspace_zacurr/mmdetection/mmdet/core/bbox/transforms.py", line 40, in delta2bbox
means = deltas.new_tensor(means).repeat(1, deltas.size(1) // 4)
RuntimeError: CUDA error: device-side assert triggered
Traceback (most recent call last):
File "/home/user/Desktop/workspace_zacurr/mmdetection/tools/train.py", line 98, in
main()
File "/home/user/Desktop/workspace_zacurr/mmdetection/tools/train.py", line 94, in main
logger=logger)
File "/home/user/Desktop/workspace_zacurr/mmdetection/mmdet/apis/train.py", line 60, in train_detector
_dist_train(model, dataset, cfg, validate=validate)
File "/home/user/Desktop/workspace_zacurr/mmdetection/mmdet/apis/train.py", line 189, in _dist_train
runner.run(data_loaders, cfg.workflow, cfg.total_epochs)
File "/opt/conda/lib/python3.6/site-packages/mmcv/runner/runner.py", line 356, in run
epoch_runner(data_loaders[i], *kwargs)
File "/opt/conda/lib/python3.6/site-packages/mmcv/runner/runner.py", line 262, in train
self.model, data_batch, train_mode=True, *kwargs)
File "/home/user/Desktop/workspace_zacurr/mmdetection/mmdet/apis/train.py", line 40, in batch_processor
losses = model(
File "/opt/conda/lib/python3.6/site-packages/torch/nn/modules/module.py", line 494, in __call__
result = self.forward(input, *kwargs)
File "/opt/conda/lib/python3.6/site-packages/mmcv/parallel/distributed.py", line 50, in forward
return self.module(inputs[0], *kwargs[0])
File "/opt/conda/lib/python3.6/site-packages/torch/nn/modules/module.py", line 494, in __call__
result = self.forward(input, *kwargs)
File "/home/user/Desktop/workspace_zacurr/mmdetection/mmdet/core/fp16/decorators.py", line 75, in new_func
output = old_func(new_args, *new_kwargs)
File "/home/user/Desktop/workspace_zacurr/mmdetection/mmdet/models/detectors/base.py", line 86, in forward
return self.forward_train(img, img_meta, *kwargs)
File "/home/user/Desktop/workspace_zacurr/mmdetection/mmdet/models/detectors/two_stage.py", line 114, in forward_train
proposal_list = self.rpn_head.get_bboxes(proposal_inputs)
File "/home/user/Desktop/workspace_zacurr/mmdetection/mmdet/core/fp16/decorators.py", line 152, in new_func
output = old_func(new_args, *new_kwargs)
File "/home/user/Desktop/workspace_zacurr/mmdetection/mmdet/models/anchor_heads/anchor_head.py", line 221, in get_bboxes
scale_factor, cfg, rescale)
File "/home/user/Desktop/workspace_zacurr/mmdetection/mmdet/models/anchor_heads/rpn_head.py", line 83, in get_bboxes_single
self.target_stds, img_shape)
File "/home/user/Desktop/workspace_zacurr/mmdetection/mmdet/core/bbox/transforms.py", line 40, in delta2bbox
means = deltas.new_tensor(means).repeat(1, deltas.size(1) // 4)
RuntimeError: CUDA error: device-side assert triggered
terminate called after throwing an instance of 'c10::Error'
what(): CUDA error: device-side assert triggered (insert_events at ../c10/cuda/CUDACachingAllocator.cpp:564)
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string
frame #1:
frame #2: c10::TensorImpl::release_resources() + 0x61 (0x7fe9d571b661 in /opt/conda/lib/python3.6/site-packages/torch/lib/libc10.so)
frame #3: torch::autograd::Variable::Impl::release_resources() + 0x5e (0x7fe9d4d160ae in /opt/conda/lib/python3.6/site-packages/torch/lib/libtorch.so.1)
frame #4:
frame #5:
frame #6:
frame #7:
frame #8:
frame #9:
frame #10:
frame #11:
frame #12:
frame #13:
frame #14:
frame #15:
frame #16:
frame #17:
frame #18:
frame #19:
frame #20: PyDict_SetItem + 0x3da (0x5575e5387d4a in /opt/conda/bin/python)
frame #21: PyDict_SetItemString + 0x4f (0x5575e539084f in /opt/conda/bin/python)
frame #22: PyImport_Cleanup + 0x99 (0x5575e53f6b79 in /opt/conda/bin/python)
frame #23: Py_FinalizeEx + 0x61 (0x5575e5461961 in /opt/conda/bin/python)
frame #24: Py_Main + 0x355 (0x5575e546beb5 in /opt/conda/bin/python)
frame #25: main + 0xee (0x5575e5333b4e in /opt/conda/bin/python)
frame #26: __libc_start_main + 0xf0 (0x7fea04b54830 in /lib/x86_64-linux-gnu/libc.so.6)
frame #27:
terminate called after throwing an instance of 'c10::Error'
what(): CUDA error: device-side assert triggered (insert_events at ../c10/cuda/CUDACachingAllocator.cpp:564)
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string
frame #1:
frame #2: c10::TensorImpl::release_resources() + 0x61 (0x7f6c09f65661 in /opt/conda/lib/python3.6/site-packages/torch/lib/libc10.so)
frame #3: torch::autograd::Variable::Impl::release_resources() + 0x5e (0x7f6c095600ae in /opt/conda/lib/python3.6/site-packages/torch/lib/libtorch.so.1)
frame #4:
frame #5:
frame #6:
frame #7:
frame #8:
frame #9:
frame #10:
frame #11:
frame #12:
frame #13:
frame #14:
frame #15:
frame #16:
frame #17:
frame #18:
frame #19:
frame #20: PyDict_SetItem + 0x3da (0x5614512a9d4a in /opt/conda/bin/python)
frame #21: PyDict_SetItemString + 0x4f (0x5614512b284f in /opt/conda/bin/python)
frame #22: PyImport_Cleanup + 0x99 (0x561451318b79 in /opt/conda/bin/python)
frame #23: Py_FinalizeEx + 0x61 (0x561451383961 in /opt/conda/bin/python)
frame #24: Py_Main + 0x355 (0x56145138deb5 in /opt/conda/bin/python)
frame #25: main + 0xee (0x561451255b4e in /opt/conda/bin/python)
frame #26: __libc_start_main + 0xf0 (0x7f6c3939e830 in /lib/x86_64-linux-gnu/libc.so.6)
frame #27:
terminate called after throwing an instance of 'c10::Error'
what(): CUDA error: device-side assert triggered (insert_events at ../c10/cuda/CUDACachingAllocator.cpp:564)
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string
frame #1:
frame #2: c10::TensorImpl::release_resources() + 0x61 (0x7fa0300f4661 in /opt/conda/lib/python3.6/site-packages/torch/lib/libc10.so)
frame #3: torch::autograd::Variable::Impl::release_resources() + 0x5e (0x7fa02f6ef0ae in /opt/conda/lib/python3.6/site-packages/torch/lib/libtorch.so.1)
frame #4:
frame #5:
frame #6:
frame #7:
frame #8:
frame #9:
frame #10:
frame #11:
frame #12:
frame #13:
frame #14:
frame #15:
frame #16:
frame #17:
frame #18:
frame #19:
frame #20: PyDict_SetItem + 0x3da (0x564f6c175d4a in /opt/conda/bin/python)
frame #21: PyDict_SetItemString + 0x4f (0x564f6c17e84f in /opt/conda/bin/python)
frame #22: PyImport_Cleanup + 0x99 (0x564f6c1e4b79 in /opt/conda/bin/python)
frame #23: Py_FinalizeEx + 0x61 (0x564f6c24f961 in /opt/conda/bin/python)
frame #24: Py_Main + 0x355 (0x564f6c259eb5 in /opt/conda/bin/python)
frame #25: main + 0xee (0x564f6c121b4e in /opt/conda/bin/python)
frame #26: __libc_start_main + 0xf0 (0x7fa05f52d830 in /lib/x86_64-linux-gnu/libc.so.6)
frame #27:
terminate called after throwing an instance of 'c10::Error'
what(): CUDA error: device-side assert triggered (insert_events at ../c10/cuda/CUDACachingAllocator.cpp:564)
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string
frame #1:
frame #2: c10::TensorImpl::release_resources() + 0x61 (0x7f23624a3661 in /opt/conda/lib/python3.6/site-packages/torch/lib/libc10.so)
frame #3: torch::autograd::Variable::Impl::release_resources() + 0x5e (0x7f2361a9e0ae in /opt/conda/lib/python3.6/site-packages/torch/lib/libtorch.so.1)
frame #4:
frame #5:
frame #6:
frame #7:
frame #8:
frame #9:
frame #10:
frame #11:
frame #12:
frame #13:
frame #14:
frame #15:
frame #16:
frame #17:
frame #18:
frame #19:
frame #20: PyDict_SetItem + 0x3da (0x55f3935e4d4a in /opt/conda/bin/python)
frame #21: PyDict_SetItemString + 0x4f (0x55f3935ed84f in /opt/conda/bin/python)
frame #22: PyImport_Cleanup + 0x99 (0x55f393653b79 in /opt/conda/bin/python)
frame #23: Py_FinalizeEx + 0x61 (0x55f3936be961 in /opt/conda/bin/python)
frame #24: Py_Main + 0x355 (0x55f3936c8eb5 in /opt/conda/bin/python)
frame #25: main + 0xee (0x55f393590b4e in /opt/conda/bin/python)
frame #26: __libc_start_main + 0xf0 (0x7f23918dc830 in /lib/x86_64-linux-gnu/libc.so.6)
frame #27:
Traceback (most recent call last):
File "/opt/conda/lib/python3.6/runpy.py", line 193, in _run_module_as_main
__main__, mod_spec)
File "/opt/conda/lib/python3.6/runpy.py", line 85, in _run_code
exec(code, run_globals)
File "/opt/conda/lib/python3.6/site-packages/torch/distributed/launch.py", line 235, in
main()
File "/opt/conda/lib/python3.6/site-packages/torch/distributed/launch.py", line 231, in main
cmd=process.args)
subprocess.CalledProcessError: Command '['/opt/conda/bin/python', '-u', '/home/user/Desktop/workspace_zacurr/mmdetection/tools/train.py', '--local_rank=0', '/home/user/Desktop/workspace_zacurr/mmdetection/configs/fp16/mask_rcnn_r50_fpn_fp16_1x.py', '--launcher', 'pytorch', '--validate', '--work_dir', '/home/user/Desktop/workspace_zacurr/mmdetection/work_dirs/mask_rcnn_r50_fpn_fp16_1x']' died with
i just ran the command
./tools/dist_train.sh configs/fp16/mask_rcnn_r50_fpn_fp16_1x.py 4
But I did not get this error. Does this error appear every time?
yes. There is an only trivial difference between commands that I and you have used.
and Under the fp32 (default) setting, it doesn't have an error.
I will try this on the other server (Cuda 10) when GPUs are not busy..maybe one week later
@zacurr,when I add random_scale function in extra_aug.py,I encounter the same problem,I guess the reason is that my bbox is out of range,Am I right?when I remove the random_scale function,the model trains normally
@yhcao6,when will you release data-pipeline to the master ,I am waiting for this part,When I write my own function,I just finish rotate function,but the scale part encounter the above error,so what should I do?
The data-pipeline will not add extra operations such as rotation. There maybe some problems in your code, could you give a minimum example to reproduce this error?
@yhcao6,the code is reference by https://github.com/Paperspace/DataAugmentationForObjectDetection,and I have checked it and have no problem after data augment.when I use random_shift and random_scale it does not work.
@yhcao6 ,I find the same problem in the pytorch issue,https://github.com/pytorch/pytorch/issues/21136,so,what should I do to fix this bug?
Could you give me a minimum example to reproduce the bug? So that I can check if there is something wrong in your code. Or maybe there is a bug in this repo.
I have sent the code to your gmail,waiting for your reply.Thanks
@gittigxuy Have you fixed your problem ? I met the same one when training on my custom dataset.
no,did you change other code?just train your own data to get this error?I add some data augment function to get this error,if I do not change code,I could train normally
Yes, I changed the code to apply text detection. I converted the labels into coco format, use original CocoDataset, and no error occurred. But when I modified the code to add random scale and random crop, the error appears.
same problem,waiting for author to deal with the problem,I have sent the code to him
Thx, if I fix it, I will tell you.
which data augment did you add,could I add your QQ or wechat?I just add random_rotate and it works fine
@gittigxuy sorry for late response! I have just solved the problem, I found that is caused by the mismatch among the numbers of gt_bboxes, gt_labels and gt_masks. I filtered some bboxes out of the cropping range when applying crop operation, but forgot filtering the gt_labels and gt_masks.
So I guess your problem is caused by the same reason?
thanks,meybe I meet the same probrem,so could you share your augment code to me?my email is [email protected]
No problem @gittigxuy
I have sent the code to your gmail,waiting for your reply.Thank
def clip_box(bbox, clip_box, alpha):
ar_ = (bbox_area(bbox))
x_min = np.maximum(bbox[:, 0], clip_box[0]).reshape(-1, 1)
y_min = np.maximum(bbox[:, 1], clip_box[1]).reshape(-1, 1)
x_max = np.minimum(bbox[:, 2], clip_box[2]).reshape(-1, 1)
y_max = np.minimum(bbox[:, 3], clip_box[3]).reshape(-1, 1)
bbox = np.hstack((x_min, y_min, x_max, y_max, bbox[:, 4:]))
delta_area = ((ar_ - bbox_area(bbox)) / ar_)
mask = (delta_area < (1 - alpha)).astype(int)
bbox = bbox[mask == 1, :]
return bbox
This is the clip_box function in your code, which may delete some gt boxes. However, you forget to delete the corresponding gt labels.
if i fix the code to your code,but get the same problem,what should I do?
I meet the same problem, after I add some data into the dataset, I meet this error:
`RuntimeError: merge_sort: failed to synchronize: device-side assert triggered
terminate called after throwing an instance of 'c10::Error'
what(): CUDA error: device-side assert triggered (insert_events at /opt/conda/conda-bld/pytorch_1565287025495/work/c10/cuda/CUDACachingAllocator.cpp:569)
frame #0: c10::Error::Error(c10::SourceLocation, std::string const&) + 0x47 (0x7f5083808e37 in /home/titan-ubuntu/miniconda3/envs/open-mmlab/lib/python3.7/site-packages/torch/lib/libc10.so)
frame #1: + 0x12e14 (0x7f5083a40e14 in /home/titan-ubuntu/miniconda3/envs/open-mmlab/lib/python3.7/site-packages/torch/lib/libc10_cuda.so)
frame #2: + 0x165bf (0x7f5083a445bf in /home/titan-ubuntu/miniconda3/envs/open-mmlab/lib/python3.7/site-packages/torch/lib/libc10_cuda.so)
frame #3: c10::TensorImpl::release_resources() + 0x74 (0x7f50837f3fa4 in /home/titan-ubuntu/miniconda3/envs/open-mmlab/lib/python3.7/site-packages/torch/lib/libc10.so)
frame #4: + 0x140fc34 (0x7f50868b8c34 in /home/titan-ubuntu/miniconda3/envs/open-mmlab/lib/python3.7/site-packages/torch/lib/libtorch.so)
frame #5: + 0x31a4bf0 (0x7f508864dbf0 in /home/titan-ubuntu/miniconda3/envs/open-mmlab/lib/python3.7/site-packages/torch/lib/libtorch.so)
frame #6: + 0x3756d12 (0x7f5088bffd12 in /home/titan-ubuntu/miniconda3/envs/open-mmlab/lib/python3.7/site-packages/torch/lib/libtorch.so)
frame #7: torch::autograd::deleteNode(torch::autograd::Node*) + 0x7f (0x7f5088bffdbf in /home/titan-ubuntu/miniconda3/envs/open-mmlab/lib/python3.7/site-packages/torch/lib/libtorch.so)
frame #8: + 0x37739b1 (0x7f5088c1c9b1 in /home/titan-ubuntu/miniconda3/envs/open-mmlab/lib/python3.7/site-packages/torch/lib/libtorch.so)
frame #9: c10::TensorImpl::release_resources() + 0x20 (0x7f50837f3f50 in /home/titan-ubuntu/miniconda3/envs/open-mmlab/lib/python3.7/site-packages/torch/lib/libc10.so)
frame #10: + 0x1bb014 (0x7f50aece0014 in /home/titan-ubuntu/miniconda3/envs/open-mmlab/lib/python3.7/site-packages/torch/lib/libtorch_python.so)
frame #11: + 0x40142b (0x7f50aef2642b in /home/titan-ubuntu/miniconda3/envs/open-mmlab/lib/python3.7/site-packages/torch/lib/libtorch_python.so)
frame #12: + 0x401461 (0x7f50aef26461 in /home/titan-ubuntu/miniconda3/envs/open-mmlab/lib/python3.7/site-packages/torch/lib/libtorch_python.so)
frame #28: __libc_start_main + 0xf0 (0x7f50bdd19830 in /lib/x86_64-linux-gnu/libc.so.6)
已放弃 (核心已转储)
`
this is my annotations:
I already waste 3 days, but I cant solve the problem. anybody help me?, thank you very much,
Maybe you can print the labels to ensure the maxvalue be in line with the num_classes
------------------ 原始邮件 ------------------
发件人: "sun"<[email protected]>;
发送时间: 2020年1月14日(星期二) 晚上11:31
收件人: "open-mmlab/mmdetection"<[email protected]>;
抄送: "郭彤彤"<[email protected]>; "Mention"<[email protected]>;
主题: Re: [open-mmlab/mmdetection] [fp16 training error] CUDA error: device-side assert triggered (#911)
I meet the same problem, after I add some data into the dataset, I meet this error:
`RuntimeError: merge_sort: failed to synchronize: device-side assert triggered
terminate called after throwing an instance of 'c10::Error'
what(): CUDA error: device-side assert triggered (insert_events at /opt/conda/conda-bld/pytorch_1565287025495/work/c10/cuda/CUDACachingAllocator.cpp:569)
frame #0: c10::Error::Error(c10::SourceLocation, std::string const&) + 0x47 (0x7f5083808e37 in /home/titan-ubuntu/miniconda3/envs/open-mmlab/lib/python3.7/site-packages/torch/lib/libc10.so)
frame #1: + 0x12e14 (0x7f5083a40e14 in /home/titan-ubuntu/miniconda3/envs/open-mmlab/lib/python3.7/site-packages/torch/lib/libc10_cuda.so)
frame #2: + 0x165bf (0x7f5083a445bf in /home/titan-ubuntu/miniconda3/envs/open-mmlab/lib/python3.7/site-packages/torch/lib/libc10_cuda.so)
frame #3: c10::TensorImpl::release_resources() + 0x74 (0x7f50837f3fa4 in /home/titan-ubuntu/miniconda3/envs/open-mmlab/lib/python3.7/site-packages/torch/lib/libc10.so)
frame #4: + 0x140fc34 (0x7f50868b8c34 in /home/titan-ubuntu/miniconda3/envs/open-mmlab/lib/python3.7/site-packages/torch/lib/libtorch.so)
frame #5: + 0x31a4bf0 (0x7f508864dbf0 in /home/titan-ubuntu/miniconda3/envs/open-mmlab/lib/python3.7/site-packages/torch/lib/libtorch.so)
frame #6: + 0x3756d12 (0x7f5088bffd12 in /home/titan-ubuntu/miniconda3/envs/open-mmlab/lib/python3.7/site-packages/torch/lib/libtorch.so)
frame #7: torch::autograd::deleteNode(torch::autograd::Node*) + 0x7f (0x7f5088bffdbf in /home/titan-ubuntu/miniconda3/envs/open-mmlab/lib/python3.7/site-packages/torch/lib/libtorch.so)
frame #8: + 0x37739b1 (0x7f5088c1c9b1 in /home/titan-ubuntu/miniconda3/envs/open-mmlab/lib/python3.7/site-packages/torch/lib/libtorch.so)
frame #9: c10::TensorImpl::release_resources() + 0x20 (0x7f50837f3f50 in /home/titan-ubuntu/miniconda3/envs/open-mmlab/lib/python3.7/site-packages/torch/lib/libc10.so)
frame #10: + 0x1bb014 (0x7f50aece0014 in /home/titan-ubuntu/miniconda3/envs/open-mmlab/lib/python3.7/site-packages/torch/lib/libtorch_python.so)
frame #11: + 0x40142b (0x7f50aef2642b in /home/titan-ubuntu/miniconda3/envs/open-mmlab/lib/python3.7/site-packages/torch/lib/libtorch_python.so)
frame #12: + 0x401461 (0x7f50aef26461 in /home/titan-ubuntu/miniconda3/envs/open-mmlab/lib/python3.7/site-packages/torch/lib/libtorch_python.so)
frame #28: __libc_start_main + 0xf0 (0x7f50bdd19830 in /lib/x86_64-linux-gnu/libc.so.6)
已放弃 (核心已转储)
`
this is my annotations:
VOC20202020_000001.jpgThe VOC2020 DatabasePASCAL VOC2020flickr05003753person0012203680
I already waste 3 days, but I cant solve the problem. anybody help me?, thank you very much,
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub, or unsubscribe.
Most helpful comment
@gittigxuy sorry for late response! I have just solved the problem, I found that is caused by the mismatch among the numbers of gt_bboxes, gt_labels and gt_masks. I filtered some bboxes out of the cropping range when applying crop operation, but forgot filtering the gt_labels and gt_masks.
So I guess your problem is caused by the same reason?