I am so sorry that facing with this issue while i am train my custom dataset.
CUDA:10.0
and i am using the lastest version! here is my error information:
Model Summary: 222 layers, 6.1626e+07 parameters, 6.1626e+07 gradients
Epoch Batch xy wh conf cls total nTargets time
0/272 0/1266 7.04 2.11 302 17.3 328 16 2.03
0/272 1/1266 7.24 2.27 302 17.5 329 16 0.713
/pytorch/aten/src/THCUNN/ClassNLLCriterion.cu:105: void cunn_ClassNLLCriterion_updateOutput_kernel(Dtype *, Dtype *, Dtype *, long * , Dtype *, int, int, int, int, long) [with Dtype = float, Acctype = float]: block: [0,0,0], thread: [11,0,0] Assertion `t >= 0 && t < n_classes` failed.
Traceback (most recent call last):
File "train.py", line 313, in <module>
multi_scale=opt.multi_scale,
File "train.py", line 190, in train
loss, loss_items = compute_loss(pred, targets, model)
File "/home/star/Wayne/transportation/ygc/yolov3/utils/utils.py", line 278, in compute_loss
lxy += (k * h['xy']) * MSE(torch.sigmoid(pi[..., 0:2]), txy[i]) # xy loss
File "/home/star/anaconda3/lib/python3.7/site-packages/torch/nn/modules/module.py", line 493, in __call__
result = self.forward(*input, **kwargs)
File "/home/star/anaconda3/lib/python3.7/site-packages/torch/nn/modules/loss.py", line 443, in forward
return F.mse_loss(input, target, reduction=self.reduction)
File "/home/star/anaconda3/lib/python3.7/site-packages/torch/nn/functional.py", line 2257, in mse_loss
ret = torch._C._nn.mse_loss(expanded_input, expanded_target, _Reduction.get_enum(reduction))
RuntimeError: reduce failed to synchronize: device-side assert triggered
terminate called after throwing an instance of 'c10::Error'
what(): CUDA error: device-side assert triggered (insert_events at /pytorch/c10/cuda/CUDACachingAllocator.cpp:564)
frame #0: std::function<std::string ()>::operator()() const + 0x11 (0x7f2ff22a8441 in /home/star/anaconda3/lib/python3.7/site-packag es/torch/lib/libc10.so)
frame #1: c10::Error::Error(c10::SourceLocation, std::string const&) + 0x2a (0x7f2ff22a7d7a in /home/star/anaconda3/lib/python3.7/si te-packages/torch/lib/libc10.so)
frame #2: <unknown function> + 0x13652 (0x7f2ff01e5652 in /home/star/anaconda3/lib/python3.7/site-packages/torch/lib/libc10_cuda.so)
frame #3: c10::TensorImpl::release_resources() + 0x50 (0x7f2ff2298ce0 in /home/star/anaconda3/lib/python3.7/site-packages/torch/lib/ libc10.so)
frame #4: <unknown function> + 0x30facb (0x7f2ff0b81acb in /home/star/anaconda3/lib/python3.7/site-packages/torch/lib/libtorch.so.1)
frame #5: <unknown function> + 0x1423ab (0x7f30315b93ab in /home/star/anaconda3/lib/python3.7/site-packages/torch/lib/libtorch_pytho n.so)
frame #6: <unknown function> + 0x6c0a41 (0x7f3031b37a41 in /home/star/anaconda3/lib/python3.7/site-packages/torch/lib/libtorch_pytho n.so)
frame #7: std::_Sp_counted_ptr<c10d::Reducer*, (__gnu_cxx::_Lock_policy)2>::_M_dispose() + 0x12 (0x7f3031b37b82 in /home/star/anacon da3/lib/python3.7/site-packages/torch/lib/libtorch_python.so)
frame #8: std::_Sp_counted_base<(__gnu_cxx::_Lock_policy)2>::_M_release() + 0xa2 (0x7f3031595d82 in /home/star/anaconda3/lib/python3 .7/site-packages/torch/lib/libtorch_python.so)
frame #9: <unknown function> + 0x6b598b (0x7f3031b2c98b in /home/star/anaconda3/lib/python3.7/site-packages/torch/lib/libtorch_pytho n.so)
frame #10: <unknown function> + 0x12fe67 (0x7f30315a6e67 in /home/star/anaconda3/lib/python3.7/site-packages/torch/lib/libtorch_pyth on.so)
frame #11: <unknown function> + 0x1300be (0x7f30315a70be in /home/star/anaconda3/lib/python3.7/site-packages/torch/lib/libtorch_pyth on.so)
<omitting python frames>
frame #23: __libc_start_main + 0xf0 (0x7f30411be830 in /lib/x86_64-linux-gnu/libc.so.6)
Aborted (core dumped)
and I train my dataset with command following:
python train.py --cfg cfg/yolov3.cfg --data data/coco.data --multi-scale
i have modified yolov3.cfg and coco.data
i'm very confusing that , while i try command: python train.py without assigning cfg file (coco.data is deafult and i have modified that),it works, but when i only modify the classes in cfg file, i meet this issue
Hello, thank you for your interest in our work! It sounds like you have incorrectly configured your cfg file. Please note that most technical problems are due to:
git clone version of this repository we can not debug it. Before going further run this code and ensure your issue persists:sudo rm -rf yolov3 # remove exising repo
git clone https://github.com/ultralytics/yolov3 && cd yolov3 # git clone latest
python3 detect.py # verify detection
python3 train.py # verify training (a few batches only)
# CODE TO REPRODUCE YOUR ISSUE HERE
train_batch0.jpg and test_batch0.jpg for a sanity check of training and testing data.If none of these apply to you, we suggest you close this issue and raise a new one using the Bug Report template, providing screenshots and minimum viable code to reproduce your issue. Thank you!
facing similar issue but when computing mAP
This seems to be a PyTorch error related to out of bound values passed to loss functions, i.e. value outside of 0-1 range passed to BCE.
https://github.com/pytorch/pytorch/issues/5560
https://github.com/pytorch/pytorch/issues/14519
And also appearing multiple times in this repository. I will reopen since it seems to not be resolved.
The two PyTorch issues seem to stem from BCELoss inputs falling outside of the required 0-1 range, however this repo does not use BCELoss, so I am quite mystified (It uses BCEWithLogitsLoss, which is not input constrained). I've also never encountered this error myself.
facing similar issue but when computing mAP
i face the same issue in computing mAP
Run on CPU to see the underlying error message.
On Thu, 30 May 2019 at 14:06, lzl4525 notifications@github.com wrote:
facing similar issue but when computing mAP
i face the same issue in computing mAP
—
You are receiving this because you modified the open/close state.
Reply to this email directly, view it on GitHub
https://github.com/ultralytics/yolov3/issues/263?email_source=notifications&email_token=AGMXEGJXI5KJCHUOYSJLQSDPX67LTA5CNFSM4HK5Q7UKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODWSE5EQ#issuecomment-497307282,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AGMXEGM3UWMVR2A45CGCUQ3PX67LTANCNFSM4HK5Q7UA
.>
Glenn Jocher
Founder & CEO, Ultralytics LLC
+1 301 237 6695
https://www.facebook.com/ultralytics
https://www.twitter.com/ultralytics
https://www.youtube.com/ultralytics
https://www.github.com/ultralytics
https://www.linkedin.com/company/ultralytics
https://www.instagram.com/ultralytics
https://contact.ultralytics.com/
your dataset has problems, such as x1 > x2 or y1 > y2 or x2 > img_w or y2 > img_h ...
| |
zm19921120
邮箱:[email protected]
|
Signature is customized by Netease Mail Master
On 05/31/2019 15:53, Glenn Jocher wrote:
Run on CPU to see the underlying error message.
On Thu, 30 May 2019 at 14:06, lzl4525 notifications@github.com wrote:
facing similar issue but when computing mAP
i face the same issue in computing mAP
—
You are receiving this because you modified the open/close state.
Reply to this email directly, view it on GitHub
https://github.com/ultralytics/yolov3/issues/263?email_source=notifications&email_token=AGMXEGJXI5KJCHUOYSJLQSDPX67LTA5CNFSM4HK5Q7UKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODWSE5EQ#issuecomment-497307282,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AGMXEGM3UWMVR2A45CGCUQ3PX67LTANCNFSM4HK5Q7UA
.>
Glenn Jocher
Founder & CEO, Ultralytics LLC
+1 301 237 6695
https://www.facebook.com/ultralytics
https://www.twitter.com/ultralytics
https://www.youtube.com/ultralytics
https://www.github.com/ultralytics
https://www.linkedin.com/company/ultralytics
https://www.instagram.com/ultralytics
https://contact.ultralytics.com/
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub, or mute the thread.
Most helpful comment
This seems to be a PyTorch error related to out of bound values passed to loss functions, i.e. value outside of 0-1 range passed to BCE.
https://github.com/pytorch/pytorch/issues/5560
https://github.com/pytorch/pytorch/issues/14519
And also appearing multiple times in this repository. I will reopen since it seems to not be resolved.
139
157
166