Mmdetection: train ssd300 on VOC

Created on 13 Mar 2019  路  5Comments  路  Source: open-mmlab/mmdetection

python train.py --work_dir '/home/hs/hs/237014845/HuaWei/mmdetection-master/weights' --seed 100 '/home/hs/hs/237014845/HuaWei/mmdetection-master/configs/pascal_voc/ssd300_voc.py'
2019-03-13 09:43:47,761 - INFO - Distributed training: False
2019-03-13 09:43:47,761 - INFO - Set random seed to 100
2019-03-13 09:43:48,000 - INFO - load model from: open-mmlab://vgg16_caffe
2019-03-13 09:43:48,050 - WARNING - missing keys in source state_dict: extra.4.weight, extra.7.weight, extra.1.bias, extra.1.weight, l2_norm.weight, extra.2.bias, extra.7.bias, extra.4.bias, extra.0.bias, extra.3.bias, extra.0.weight, extra.5.bias, extra.2.weight, extra.6.weight, extra.3.weight, extra.5.weight, extra.6.bias

2019-03-13 09:43:50,310 - INFO - Start running, host: hs@hs-System-Product-Name, work_dir: /home/hs/hs/237014845/HuaWei/mmdetection-master/weights
2019-03-13 09:43:50,311 - INFO - workflow: [('train', 1)], max: 24 epochs
2019-03-13 09:44:16,016 - INFO - Epoch [1][50/41378] lr: 0.00100, eta: 5 days, 21:48:20, time: 0.514, data_time: 0.006, loss_cls: 19.5927, loss_reg: 3.8320, loss: 23.4247
/opt/conda/conda-bld/pytorch_1549628766161/work/aten/src/THC/THCTensorScatterGather.cu:124: void THCudaTensor_scatterKernel(TensorInfo, TensorInfo, TensorInfo, int, IndexType) [with IndexType = unsigned int, Real = float, Dims = 1]: block: [0,0,0], thread: [0,0,0] Assertion indexValue >= 0 && indexValue < tensor.sizes[dim] failed.
THCudaCheck FAIL file=/opt/conda/conda-bld/pytorch_1549628766161/work/aten/src/THC/generated/../THCReduceAll.cuh line=317 error=59 : device-side assert triggered
Traceback (most recent call last):
File "train.py", line 90, in
main()
File "train.py", line 86, in main
logger=logger)
File "/home/hs/anaconda3/lib/python3.6/site-packages/mmdet-0.6rc0+unknown-py3.6.egg/mmdet/apis/train.py", line 59, in train_detector
_non_dist_train(model, dataset, cfg, validate=validate)
File "/home/hs/anaconda3/lib/python3.6/site-packages/mmdet-0.6rc0+unknown-py3.6.egg/mmdet/apis/train.py", line 121, in _non_dist_train
runner.run(data_loaders, cfg.workflow, cfg.total_epochs)
File "/home/hs/anaconda3/lib/python3.6/site-packages/mmcv/runner/runner.py", line 355, in run
epoch_runner(data_loaders[i], *kwargs)
File "/home/hs/anaconda3/lib/python3.6/site-packages/mmcv/runner/runner.py", line 268, in train
self.call_hook('after_train_iter')
File "/home/hs/anaconda3/lib/python3.6/site-packages/mmcv/runner/runner.py", line 228, in call_hook
getattr(hook, fn_name)(self)
File "/home/hs/anaconda3/lib/python3.6/site-packages/mmcv/runner/hooks/optimizer.py", line 17, in after_train_iter
runner.outputs['loss'].backward()
File "/home/hs/anaconda3/lib/python3.6/site-packages/torch/tensor.py", line 102, in backward
torch.autograd.backward(self, gradient, retain_graph, create_graph)
File "/home/hs/anaconda3/lib/python3.6/site-packages/torch/autograd/__init__.py", line 90, in backward
allow_unreachable=True) # allow_unreachable flag
RuntimeError: cuda runtime error (59) : device-side assert triggered at /opt/conda/conda-bld/pytorch_1549628766161/work/aten/src/THC/generated/../THCReduceAll.cuh:317
terminate called after throwing an instance of 'c10::Error'
what(): CUDA error: device-side assert triggered (insert_events at /opt/conda/conda-bld/pytorch_1549628766161/work/aten/src/THC/THCCachingAllocator.cpp:470)
frame #0: c10::Error::Error(c10::SourceLocation, std::string const&) + 0x45 (0x7fb752f6ccf5 in /home/hs/anaconda3/lib/python3.6/site-packages/torch/lib/libc10.so)
frame #1: + 0x122a0d0 (0x7fb75723f0d0 in /home/hs/anaconda3/lib/python3.6/site-packages/torch/lib/libcaffe2_gpu.so)
frame #2: at::TensorImpl::release_resources() + 0x50 (0x7fb7536d8c30 in /home/hs/anaconda3/lib/python3.6/site-packages/torch/lib/libcaffe2.so)
frame #3: + 0x2a836b (0x7fb750cea36b in /home/hs/anaconda3/lib/python3.6/site-packages/torch/lib/libtorch.so.1)
frame #4: + 0x30eff0 (0x7fb750d50ff0 in /home/hs/anaconda3/lib/python3.6/site-packages/torch/lib/libtorch.so.1)
frame #5: torch::autograd::deleteFunction(torch::autograd::Function
) + 0x2f0 (0x7fb750cecd70 in /home/hs/anaconda3/lib/python3.6/site-packages/torch/lib/libtorch.so.1)
frame #6: std::_Sp_counted_base<(__gnu_cxx::_Lock_policy)2>::_M_release() + 0x45 (0x7fb7741887f5 in /home/hs/anaconda3/lib/python3.6/site-packages/torch/lib/libtorch_python.so)
frame #7: torch::autograd::Variable::Impl::release_resources() + 0x4a (0x7fb750f5f1ba in /home/hs/anaconda3/lib/python3.6/site-packages/torch/lib/libtorch.so.1)
frame #8: + 0x12148b (0x7fb7741a048b in /home/hs/anaconda3/lib/python3.6/site-packages/torch/lib/libtorch_python.so)
frame #9: + 0x31a49f (0x7fb77439949f in /home/hs/anaconda3/lib/python3.6/site-packages/torch/lib/libtorch_python.so)
frame #10: + 0x31a4e1 (0x7fb7743994e1 in /home/hs/anaconda3/lib/python3.6/site-packages/torch/lib/libtorch_python.so)

frame #26: __libc_start_main + 0xf0 (0x7fb78ff0f830 in /lib/x86_64-linux-gnu/libc.so.6)

Aborted (core dumped)

Most helpful comment

I decrease lr,but it does not work.

All 5 comments

You should decrease lr if you train the model on a single card.

@yhcao6 thanks

I decrease lr,but it does not work.

How did you fix this? I have the same issue.

I meet the same problem, how to solve it ?

Was this page helpful?
0 / 5 - 0 ratings

Related issues

dereyly picture dereyly  路  3Comments

yangcong955 picture yangcong955  路  3Comments

michaelisc picture michaelisc  路  3Comments

namheegordonkim picture namheegordonkim  路  3Comments

fengxiuyaun picture fengxiuyaun  路  3Comments