When training SSD with pytorch-1.2, it comes with some errors if we only use a single gpu. However, if we train it in distributed mode, everything is fine. It is a little weird that they have different behaviors.
Here are commands I tried
python tools/train.py /home/ubuntu/mmdetection/configs/ssd300_coco.py
./tools/dist_train.sh /home/ubuntu/mmdetection/configs/ssd300_coco.py 8
CUDA version is 10.1
Here is the error message for training with a single gpu
>, int, IndexType) [with IndexType = unsigned int, Real = float, Dims = 1]: block: [0,0,0], thread: [11,0,0] Assertion `indexValue >= 0 && indexValue < tensor.sizes[dim]` failed.
/opt/conda/conda-bld/pytorch_1565272279342/work/aten/src/THC/THCTensorScatterGather.cu:130: void THCudaTensor_scatterKernel(TensorInfo<Real, IndexType>, TensorInfo<Real, IndexType>, TensorInfo<long, IndexType>, int, IndexType) [with IndexType = unsigned int, Real = float, Dims = 1]: block: [0,0,0], thread: [12,0,0] Assertion `indexValue >= 0 && indexValue < tensor.sizes[dim]` failed.
/opt/conda/conda-bld/pytorch_1565272279342/work/aten/src/THC/THCTensorScatterGather.cu:130: void THCudaTensor_scatterKernel(TensorInfo<Real, IndexType>, TensorInfo<Real, IndexType>, TensorInfo<long, IndexType>, int, IndexType) [with IndexType = unsigned int, Real = float, Dims = 1]: block: [0,0,0], thread: [13,0,0] Assertion `indexValue >= 0 && indexValue < tensor.sizes[dim]` failed.
/opt/conda/conda-bld/pytorch_1565272279342/work/aten/src/THC/THCTensorScatterGather.cu:130: void THCudaTensor_scatterKernel(TensorInfo<Real, IndexType>, TensorInfo<Real, IndexType>, TensorInfo<long, IndexType>, int, IndexType) [with IndexType = unsigned int, Real = float, Dims = 1]: block: [0,0,0], thread: [14,0,0] Assertion `indexValue >= 0 && indexValue < tensor.sizes[dim]` failed.
/opt/conda/conda-bld/pytorch_1565272279342/work/aten/src/THC/THCTensorScatterGather.cu:130: void THCudaTensor_scatterKernel(TensorInfo<Real, IndexType>, TensorInfo<Real, IndexType>, TensorInfo<long, IndexType>, int, IndexType) [with IndexType = unsigned int, Real = float, Dims = 1]: block: [0,0,0], thread: [16,0,0] Assertion `indexValue >= 0 && indexValue < tensor.sizes[dim]` failed.
/opt/conda/conda-bld/pytorch_1565272279342/work/aten/src/THC/THCTensorScatterGather.cu:130: void THCudaTensor_scatterKernel(TensorInfo<Real, IndexType>, TensorInfo<Real, IndexType>, TensorInfo<long, IndexType>, int, IndexType) [with IndexType = unsigned int, Real = float, Dims = 1]: block: [0,0,0], thread: [17,0,0] Assertion `indexValue >= 0 && indexValue < tensor.sizes[dim]` failed.
/opt/conda/conda-bld/pytorch_1565272279342/work/aten/src/THC/THCTensorScatterGather.cu:130: void THCudaTensor_scatterKernel(TensorInfo<Real, IndexType>, TensorInfo<Real, IndexType>, TensorInfo<long, IndexType>, int, IndexType) [with IndexType = unsigned int, Real = float, Dims = 1]: block: [0,0,0], thread: [18,0,0] Assertion `indexValue >= 0 && indexValue < tensor.sizes[dim]` failed.
/opt/conda/conda-bld/pytorch_1565272279342/work/aten/src/THC/THCTensorScatterGather.cu:130: void THCudaTensor_scatterKernel(TensorInfo<Real, IndexType>, TensorInfo<Real, IndexType>, TensorInfo<long, IndexType>, int, IndexType) [with IndexType = unsigned int, Real = float, Dims = 1]: block: [0,0,0], thread: [19,0,0] Assertion `indexValue >= 0 && indexValue < tensor.sizes[dim]` failed.
/opt/conda/conda-bld/pytorch_1565272279342/work/aten/src/THC/THCTensorScatterGather.cu:130: void THCudaTensor_scatterKernel(TensorInfo<Real, IndexType>, TensorInfo<Real, IndexType>, TensorInfo<long, IndexType>, int, IndexType) [with IndexType = unsigned int, Real = float, Dims = 1]: block: [0,0,0], thread: [20,0,0] Assertion `indexValue >= 0 && indexValue < tensor.sizes[dim]` failed.
/opt/conda/conda-bld/pytorch_1565272279342/work/aten/src/THC/THCTensorScatterGather.cu:130: void THCudaTensor_scatterKernel(TensorInfo<Real, IndexType>, TensorInfo<Real, IndexType>, TensorInfo<long, IndexType>, int, IndexType) [with IndexType = unsigned int, Real = float, Dims = 1]: block: [0,0,0], thread: [21,0,0] Assertion `indexValue >= 0 && indexValue < tensor.sizes[dim]` failed.
/opt/conda/conda-bld/pytorch_1565272279342/work/aten/src/THC/THCTensorScatterGather.cu:130: void THCudaTensor_scatterKernel(TensorInfo<Real, IndexType>, TensorInfo<Real, IndexType>, TensorInfo<long, IndexType>, int, IndexType) [with IndexType = unsigned int, Real = float, Dims = 1]: block: [0,0,0], thread: [22,0,0] Assertion `indexValue >= 0 && indexValue < tensor.sizes[dim]` failed.
/opt/conda/conda-bld/pytorch_1565272279342/work/aten/src/THC/THCTensorScatterGather.cu:130: void THCudaTensor_scatterKernel(TensorInfo<Real, IndexType>, TensorInfo<Real, IndexType>, TensorInfo<long, IndexType>, int, IndexType) [with IndexType = unsigned int, Real = float, Dims = 1]: block: [0,0,0], thread: [23,0,0] Assertion `indexValue >= 0 && indexValue < tensor.sizes[dim]` failed.
/opt/conda/conda-bld/pytorch_1565272279342/work/aten/src/THC/THCTensorScatterGather.cu:130: void THCudaTensor_scatterKernel(TensorInfo<Real, IndexType>, TensorInfo<Real, IndexType>, TensorInfo<long, IndexType>, int, IndexType) [with IndexType = unsigned int, Real = float, Dims = 1]: block: [0,0,0], thread: [25,0,0] Assertion `indexValue >= 0 && indexValue < tensor.sizes[dim]` failed.
/opt/conda/conda-bld/pytorch_1565272279342/work/aten/src/THC/THCTensorScatterGather.cu:130: void THCudaTensor_scatterKernel(TensorInfo<Real, IndexType>, TensorInfo<Real, IndexType>, TensorInfo<long, IndexType>, int, IndexType) [with IndexType = unsigned int, Real = float, Dims = 1]: block: [0,0,0], thread: [26,0,0] Assertion `indexValue >= 0 && indexValue < tensor.sizes[dim]` failed.
/opt/conda/conda-bld/pytorch_1565272279342/work/aten/src/THC/THCTensorScatterGather.cu:130: void THCudaTensor_scatterKernel(TensorInfo<Real, IndexType>, TensorInfo<Real, IndexType>, TensorInfo<long, IndexType>, int, IndexType) [with IndexType = unsigned int, Real = float, Dims = 1]: block: [0,0,0], thread: [27,0,0] Assertion `indexValue >= 0 && indexValue < tensor.sizes[dim]` failed.
/opt/conda/conda-bld/pytorch_1565272279342/work/aten/src/THC/THCTensorScatterGather.cu:130: void THCudaTensor_scatterKernel(TensorInfo<Real, IndexType>, TensorInfo<Real, IndexType>, TensorInfo<long, IndexType>, int, IndexType) [with IndexType = unsigned int, Real = float, Dims = 1]: block: [0,0,0], thread: [28,0,0] Assertion `indexValue >= 0 && indexValue < tensor.sizes[dim]` failed.
/opt/conda/conda-bld/pytorch_1565272279342/work/aten/src/THC/THCTensorScatterGather.cu:130: void THCudaTensor_scatterKernel(TensorInfo<Real, IndexType>, TensorInfo<Real, IndexType>, TensorInfo<long, IndexType>, int, IndexType) [with IndexType = unsigned int, Real = float, Dims = 1]: block: [0,0,0], thread: [29,0,0] Assertion `indexValue >= 0 && indexValue < tensor.sizes[dim]` failed.
/opt/conda/conda-bld/pytorch_1565272279342/work/aten/src/THC/THCTensorScatterGather.cu:130: void THCudaTensor_scatterKernel(TensorInfo<Real, IndexType>, TensorInfo<Real, IndexType>, TensorInfo<long, IndexType>, int, IndexType) [with IndexType = unsigned int, Real = float, Dims = 1]: block: [0,0,0], thread: [30,0,0] Assertion `indexValue >= 0 && indexValue < tensor.sizes[dim]` failed.
/opt/conda/conda-bld/pytorch_1565272279342/work/aten/src/THC/THCTensorScatterGather.cu:130: void THCudaTensor_scatterKernel(TensorInfo<Real, IndexType>, TensorInfo<Real, IndexType>, TensorInfo<long, IndexType>, int, IndexType) [with IndexType = unsigned int, Real = float, Dims = 1]: block: [0,0,0], thread: [31,0,0] Assertion `indexValue >= 0 && indexValue < tensor.sizes[dim]` failed.
Traceback (most recent call last):
File "tools/train.py", line 110, in <module>
main()
File "tools/train.py", line 106, in main
logger=logger)
File "/home/ubuntu/mmdetection/mmdet/apis/train.py", line 65, in train_detector
_non_dist_train(model, dataset, cfg, validate=validate)
File "/home/ubuntu/mmdetection/mmdet/apis/train.py", line 237, in _non_dist_train
runner.run(data_loaders, cfg.workflow, cfg.total_epochs)
File "/home/ubuntu/anaconda3/envs/pytorch-1.2/lib/python3.6/site-packages/mmcv/runner/runner.py", line 363, in run
epoch_runner(data_loaders[i], **kwargs)
File "/home/ubuntu/anaconda3/envs/pytorch-1.2/lib/python3.6/site-packages/mmcv/runner/runner.py", line 274, in train
self.call_hook('after_train_iter')
File "/home/ubuntu/anaconda3/envs/pytorch-1.2/lib/python3.6/site-packages/mmcv/runner/runner.py", line 230, in call_hook
getattr(hook, fn_name)(self)
File "/home/ubuntu/anaconda3/envs/pytorch-1.2/lib/python3.6/site-packages/mmcv/runner/hooks/optimizer.py", line 17, in after_train_iter
runner.outputs['loss'].backward()
File "/home/ubuntu/anaconda3/envs/pytorch-1.2/lib/python3.6/site-packages/torch/tensor.py", line 118, in backward
torch.autograd.backward(self, gradient, retain_graph, create_graph)
File "/home/ubuntu/anaconda3/envs/pytorch-1.2/lib/python3.6/site-packages/torch/autograd/__init__.py", line 93, in backward
allow_unreachable=True) # allow_unreachable flag
RuntimeError: merge_sort: failed to synchronize: device-side assert triggered
terminate called after throwing an instance of 'c10::Error'
what(): CUDA error: device-side assert triggered (insert_events at /opt/conda/conda-bld/pytorch_1565272279342/work/c10/cuda/CUDACachingAllocator.cpp:569)
frame #0: c10::Error::Error(c10::SourceLocation, std::string const&) + 0x47 (0x7f55e61e6e37 in /home/ubuntu/anaconda3/envs/pytorch-1.2/lib/python3.6/site-packages/torch/lib/libc10.so)
frame #1: <unknown function> + 0x12e14 (0x7f55e641ee14 in /home/ubuntu/anaconda3/envs/pytorch-1.2/lib/python3.6/site-packages/torch/lib/libc10_cuda.so)
frame #2: <unknown function> + 0x165bf (0x7f55e64225bf in /home/ubuntu/anaconda3/envs/pytorch-1.2/lib/python3.6/site-packages/torch/lib/libc10_cuda.so)
frame #3: c10::TensorImpl::release_resources() + 0x74 (0x7f55e61d1fa4 in /home/ubuntu/anaconda3/envs/pytorch-1.2/lib/python3.6/site-packages/torch/lib/libc10.so)
frame #4: <unknown function> + 0x141ece4 (0x7f55e92a5ce4 in /home/ubuntu/anaconda3/envs/pytorch-1.2/lib/python3.6/site-packages/torch/lib/libtorch.so)
frame #5: <unknown function> + 0x31b3ca0 (0x7f55eb03aca0 in /home/ubuntu/anaconda3/envs/pytorch-1.2/lib/python3.6/site-packages/torch/lib/libtorch.so)
frame #6: <unknown function> + 0x3765dc2 (0x7f55eb5ecdc2 in /home/ubuntu/anaconda3/envs/pytorch-1.2/lib/python3.6/site-packages/torch/lib/libtorch.so)
frame #7: torch::autograd::deleteNode(torch::autograd::Node*) + 0x7f (0x7f55eb5ece6f in /home/ubuntu/anaconda3/envs/pytorch-1.2/lib/python3.6/site-packages/torch/lib/libtorch.so)
frame #8: <unknown function> + 0x3782a61 (0x7f55eb609a61 in /home/ubuntu/anaconda3/envs/pytorch-1.2/lib/python3.6/site-packages/torch/lib/libtorch.so)
frame #9: c10::TensorImpl::release_resources() + 0x20 (0x7f55e61d1f50 in /home/ubuntu/anaconda3/envs/pytorch-1.2/lib/python3.6/site-packages/torch/lib/libc10.so)
frame #10: <unknown function> + 0x1ba9b4 (0x7f56172d19b4 in /home/ubuntu/anaconda3/envs/pytorch-1.2/lib/python3.6/site-packages/torch/lib/libtorch_python.so)
frame #11: <unknown function> + 0x4000eb (0x7f56175170eb in /home/ubuntu/anaconda3/envs/pytorch-1.2/lib/python3.6/site-packages/torch/lib/libtorch_python.so)
frame #12: <unknown function> + 0x400121 (0x7f5617517121 in /home/ubuntu/anaconda3/envs/pytorch-1.2/lib/python3.6/site-packages/torch/lib/libtorch_python.so)
<omitting python frames>
frame #28: __libc_start_main + 0xf0 (0x7f562628d830 in /lib/x86_64-linux-gnu/libc.so.6)
Aborted (core dumped)
the same error comes to me too when using single gpu! i use pytorch1.2 , and i check data label many times to make sure data is ok.
besides, sometimes trainning can run for a few iterations and break down suddenly with the same error, it's truly weird
When you are using a single GPU, you need to modify the learning rate accordingly.
When you are using a single GPU, you need to modify the learning rate accordingly.
thx for your reply, can you give me some details of the suggestion? it seems possible to solve it tonight , hhhhhh
Please refer to the documentation and it is marked with Important.
Please refer to the documentation and it is marked with Important.
thx, i will try it
Most helpful comment
When you are using a single GPU, you need to modify the learning rate accordingly.