Hi,
I'm trying your code but when I run:
CUDA_VISIBLE_DEVICES=$GPU_ID python trainval_net.py --dataset pascal_voc --net vgg16 --cuda --bs $BATCH_SIZE
for the training I have this error:
from model.utils.cython_bbox import bbox_overlaps
ImportError: No module named cython_bbox
I've just installed cython doing:
sudo pip install cython
Thanks in advance
did you cd lib directory and run the command sh make.sh?
Yes of course but it doesn't work.
Hi, @Capuz93 ,
after compiling, can you find "cython_bbox.so" in folder lib/model/utils?
Hi @jwyang .
No there isn't ""cython_bbox.so". I checked that when I run the command "sh make.sh" I have this error at the beginning:
Traceback (most recent call last):
File "setup.py", line 59, in
CUDA = locate_cuda()
File "setup.py", line 54, in locate_cuda
raise EnvironmentError('The CUDA %s path could not be located in %s' % (k, v))
EnvironmentError: The CUDA lib64 path could not be located in /usr/lib64
After this error it seems compile in the right way.
Maybe this is the reason. How can I fix it?
Hi, @Capuz93 ,
it seems that setup.py did not find your cuda. where did you install your cuda?
@Capuz93
I have updated setup.py by commenting unused lines. Update it and try again. If you have CUDA installed on your machine, it should work.
@jwyang Thanks very much.
Now it works.
Does this code work on TitanX GPU only? Because I have NVIDIA GEFORCE but when I try to train the model I have this error:
cuda runtime error (38) : no CUDA-capable device is detected
N.B.: If you prefer I could open a new Issue and close this one.
It should not. I tried on TitanXp, Titan X and 980Tias well. What is your
exact command for training?
On Thu, Dec 28, 2017 at 12:00 Capuz93 notifications@github.com wrote:
Does this code work on TitanX GPU only? Because I have NVIDIA GEFORCE but
when I try to train the model I have this error:cuda runtime error (38) : no CUDA-capable device is detected
N.B.: If you prefer I could open a new Issue and close this one.
—
You are receiving this because you were mentioned.Reply to this email directly, view it on GitHub
https://github.com/jwyang/faster-rcnn.pytorch/issues/9#issuecomment-354320342,
or mute the thread
https://github.com/notifications/unsubscribe-auth/ADtr5ztSOM8A_y-qFbn3nrGu7BlBGrKDks5tE8kxgaJpZM4ROMy0
.
I forgot to set GPU_ID.
Then I've run this command:
CUDA_VISIBLE_DEVICES=0 python trainval_net.py --dataset pascal_voc --net v
gg16 --cuda --bs 1
But now I have an other error:
RuntimeError: cuda runtime error (48) : no kernel image is available for execution on the device at /pytorch/torch/lib/THC/THCTensorCopy.cu:204
it seems a pytorch error. Did you ever run cuda training of other code successfully? Which line of code trigger this error?
@jwyang Yes. I expected to find out an error like "out of memory" because I don't have enough memory on my computer, but this error is strange.
The complete error is this:
/faster-rcnn.pytorch-master/lib/model/rpn/rpn.py:66: UserWarning: Implicit dimension choice for softmax has been deprecated. Change the call to include dim=X as an argument.
rpn_cls_prob_reshape = F.softmax(rpn_cls_score_reshape)
THCudaCheck FAIL file=/pytorch/torch/lib/THC/THCTensorCopy.cu line=204 error=48 : no kernel image is available for execution on the device
Traceback (most recent call last):
File "trainval_net.py", line 316, in
_, cls_prob, bbox_pred, rpn_loss, rcnn_loss = fasterRCNN(im_data, im_info, gt_boxes, num_boxes)
File "/usr/local/lib/python2.7/dist-packages/torch/nn/modules/module.py", line 325, in __call__
result = self.forward(input, *kwargs)
File "/faster-rcnn.pytorch-master/lib/model/faster_rcnn/faster_rcnn_cascade.py", line 51, in forward
rois, rpn_loss_cls, rpn_loss_bbox = self.RCNN_rpn(base_feat, im_info, gt_boxes, num_boxes)
File "/usr/local/lib/python2.7/dist-packages/torch/nn/modules/module.py", line 325, in __call__
result = self.forward(input, *kwargs)
File "/faster-rcnn.pytorch-master/lib/model/rpn/rpn.py", line 76, in forward
im_info, cfg_key))
File "/usr/local/lib/python2.7/dist-packages/torch/nn/modules/module.py", line 325, in __call__
result = self.forward(input, *kwargs)
File "/faster-rcnn.pytorch-master/lib/model/rpn/proposal_layer.py", line 148, in forward
keep_idx_i = keep_idx_i.long().view(-1)
File "/usr/local/lib/python2.7/dist-packages/torch/tensor.py", line 51, in long
return self.type(type(self).__module__ + '.LongTensor')
File "/usr/local/lib/python2.7/dist-packages/torch/cuda/__init__.py", line 370, in type
return super(_CudaBase, self).type(args, *kwargs)
File "/usr/local/lib/python2.7/dist-packages/torch/_utils.py", line 38, in _type
return new_type(self.size()).copy_(self, async)
RuntimeError: cuda runtime error (48) : no kernel image is available for execution on the device at /pytorch/torch/lib/THC/THCTensorCopy.cu:204
hi, @Capuz93 , could you set a break point at line 148 to see whether there are something wrong with keep_idx_i? This error is weird.
Hi @jwyang , I tried to print keep_idx_id and I have this output:
0 1 2â‹®
11997
11998
11999
[torch.cuda.IntTensor of size 12000x1 (GPU 0)]
Do yuo have any idea about this error?
Hi, @Capuz93 , this looks good. Did you try debug step by step?
Sorry for my delay but I was busy in these days.
Which version of CUDA do you use? Because maybe the error is due to the cuda version.
I am using CUDA 8.0. Pytorch 0.2.0. I also tried Pytorch 0.3.0. I should have posted these requirements on the readme.
Ok because I'm using CUDA 9.0 and maybe the error is due to this.
I'll try with CUDA 8.0.
Thanks
I tried also with CUDA 8.0 but it doesn't work. I have the same error (no kernel image is available for execution on the device).
Do you have any idea about this issue?
Could you reinstall your pytorch? And recompile all the libs.
On Sun, Jan 7, 2018 at 19:13 Capuz93 notifications@github.com wrote:
I tried also with CUDA 8.0 but it doesn't work. I have the same error (no
kernel image is available for execution on the device).
Do you have any idea about this issue?—
You are receiving this because you modified the open/close state.Reply to this email directly, view it on GitHub
https://github.com/jwyang/faster-rcnn.pytorch/issues/9#issuecomment-355864784,
or mute the thread
https://github.com/notifications/unsubscribe-auth/ADtr5w6lmaCenpekzgB7qDFe9OzplJn5ks5tIV2ygaJpZM4ROMy0
.
I reinstalled my pytorch and recompiled all the libs and I had this error:
invalid device function
After some search I understood that the problem could be the value of -arch in the make; so I tried to change it from sm_52 to sm_20 (I'm not sure is the correct value for my GPU), but now I have this new error:
/faster-rcnn.pytorch-master/lib/model/rpn/rpn.py:66: UserWarning: Implicit dimension choice for softmax has been deprecated. Change the call to include dim=X as an argument.
rpn_cls_prob_reshape = F.softmax(rpn_cls_score_reshape)
Traceback (most recent call last):
File "trainval_net.py", line 316, in
_, cls_prob, bbox_pred, rpn_loss, rcnn_loss = fasterRCNN(im_data, im_info, gt_boxes, num_boxes)
File "/usr/local/lib/python2.7/dist-packages/torch/nn/modules/module.py", line 325, in __call__
result = self.forward(input, kwargs)
File "/faster-rcnn.pytorch-master/lib/model/faster_rcnn/faster_rcnn_cascade.py", line 51, in forward
rois, rpn_loss_cls, rpn_loss_bbox = self.RCNN_rpn(base_feat, im_info, gt_boxes, num_boxes)
File "/usr/local/lib/python2.7/dist-packages/torch/nn/modules/module.py", line 325, in __call__
result = self.forward(input, *kwargs)
File "/faster-rcnn.pytorch-master/lib/model/rpn/rpn.py", line 85, in forward
rpn_data = self.RPN_anchor_target((rpn_cls_score.data, gt_boxes, im_info, num_boxes))
File "/usr/local/lib/python2.7/dist-packages/torch/nn/modules/module.py", line 325, in __call__
result = self.forward(input, *kwargs)
*File "/faster-rcnn.pytorch-master/lib/model/rpn/anchor_target_layer.py", line 149, in forward
positive_weights = 1.0 / num_examples
ZeroDivisionError: float division by zero
Hi, @Capuz93 ,
This error is also wired. I did not encounter this issue ever. It indicates that all region proposals have small overlaps to the ground truth rows. One possible reason is that there are something wrong with the ground-truth bounding box. Did you check the data loaded is correct?
I had that error because I had changed RPN_BATCHSIZE and BATCH_SIZE values from 256 to 1 from the file "faster-rcnn.pytorch-master/cfgs/vgg16.yml because I have memory problem.
If I set RPN_BATCHSIZE to 256 and BATCH_SIZE to 1 or 2 I have this error:
ValueError: result of slicing is an empty tensor
If I leave 256 for both RPN_BATCHSIZE and BATCH_SIZE I have a memory error (out of memory).
How can I change that values for my memory problem?
I see, I think you can change 256 to 64 for both batch size. I will reproduce this error by setting the batch size as yours to find a solution to this issue.
I have out of memory error if I set batch size bigger than 2.
I'll wait for your suggestions.
Thanks
ok, I will work on that. However, if you GPU cannot hold batch size even bigger than 2, I think it is hard for you to train a good detection model actually, :), what kind of GPU are you using?
I know, in fact I told you I'm trying your model to understand faster rcnn better and then run this model on a more performing machine.
Now I'm using NVIDIA GEFORCE 920M and actually I'm not sure about the -arch value in make.sh in lib folder. How can I know which value should I set for my GPU? You have -arch=sm_52; could you explain me why?
this would be a good guide:
http://arnon.dk/matching-sm-architectures-arch-and-gencode-for-various-nvidia-cards/
I will add these information to readme.
Hi, @Capuz93, I have updated the make.sh in lib folder. You can try to recompile the cuda libraries and it is supposed to work well directly.
I will check what will happen if batch size is very small, e.g., 1 or 2
Ok thanks.
I've downloaded the new make file but I have the same error:
ValueError: result of slicing is an empty tensor
I'll wait for your news.
yeah, I have not yet solved this problem, please give me some time.
Hi, @Capuz93 , I have modified proposal_target_layer_cascade.py to adapt to very tine batch training. Now you can set the batch size to 2. Have a try
I see now that when I run the new make file with:
sh make.sh
I have this error:
nvcc fatal : Unsupported gpu architecture 'sm_60'
but after this error the program continues compiling.
I show you the output of "sh make.sh" in the file attached.
output.txt
However I tried with the old make file setting -arch value to sm_52 according to the link you sent me yesterday but without success because I had this error:
cuda runtime error: invalid device function
you might need to modify file make.sh, you should remove sm_60.
I changed the make file and I have a warning but I think it's ok:
nvcc warning : The 'compute_20', 'sm_20', and 'sm_21' architectures are deprecated, and may be removed in a future release (Use -Wno-deprecated-gpu-targets to suppress warning).
Then when I run the program setting BATCH_SIZE=2 and RPN_BATCHSIZE=2 I have "out of memory" error; when I run the program setting BATCH_SIZE=1 and RPN_BATCHSIZE=2 I have this error:
File "/home/Scrivania/TESI/FASTER_RCNN/2/faster-rcnn.pytorch-master/lib/model/rpn/proposal_target_layer_cascade.py", line 168, in _sample_rois_pytorch
rand_num = torch.from_numpy(rand_num).type_as(gt_boxes).long()
RuntimeError: the given numpy array has zero-sized dimensions. Zero-sized dimensions are not supported in PyTorch
The minimal batch size should be 2. If you want make it work, you can
reduce the image size, and use resnet instead of vgg16.
On Wed, Jan 10, 2018 at 18:35 Capuz93 notifications@github.com wrote:
I changed the make file and I have a warning but I think it's ok:
nvcc warning : The 'compute_20', 'sm_20', and 'sm_21' architectures are
deprecated, and may be removed in a future release (Use
-Wno-deprecated-gpu-targets to suppress warning).Then when I run the program setting BATCH_SIZE=2 and RPN_BATCHSIZE=2 I
have "out of memory" error; when I run the program setting BATCH_SIZE=1 and
RPN_BATCHSIZE=2 I have this error:File
"/home/Scrivania/TESI/FASTER_RCNN/2/faster-rcnn.pytorch-master/lib/model/rpn/proposal_target_layer_cascade.py",
line 168, in _sample_rois_pytorch
rand_num = torch.from_numpy(rand_num).type_as(gt_boxes).long()
RuntimeError: the given numpy array has zero-sized dimensions. Zero-sized
dimensions are not supported in PyTorch—
You are receiving this because you modified the open/close state.Reply to this email directly, view it on GitHub
https://github.com/jwyang/faster-rcnn.pytorch/issues/9#issuecomment-356773953,
or mute the thread
https://github.com/notifications/unsubscribe-auth/ADtr55iF19TvZeRQ79TvRDxr6zCHI2W0ks5tJUk-gaJpZM4ROMy0
.
Most helpful comment
did you cd lib directory and run the command sh make.sh?