Hi. Thank you for your great efforts.
I'm trying to use my own dataset which only has a single class.
I implemented e.g. maskrcnn/data/datasets/mydata.py following README,
set ROI_BOX_HEAD.NUM_CLASSES=2, and updated some other files correspondingly.
(Detailed description is omitted since the error can be simply reproduced in a different way as described below)
Traceback (most recent call last):
File "tools/train_net.py", line 174, in <module>
main()
File "tools/train_net.py", line 167, in main
model = train(cfg, args.local_rank, args.distributed)
File "tools/train_net.py", line 73, in train
arguments,
File "/maskrcnn-benchmark/maskrcnn_benchmark/engine/trainer.py", line 76, in do_train
losses.backward()
File "/miniconda/envs/py36/lib/python3.6/site-packages/torch/tensor.py", line 106, in backward
torch.autograd.backward(self, gradient, retain_graph, create_graph)
File "/miniconda/envs/py36/lib/python3.6/site-packages/torch/autograd/__init__.py", line 93, in backward
allow_unreachable=True) # allow_unreachable flag
RuntimeError: cublas runtime error : the GPU program failed to execute at /opt/conda/conda-bld/pytorch-nightly_1549566635986/work/aten/src/THC/THCBlas.cu:259
Build docker image & setup maskrcnn-benchmark
Due to #167, I commented out https://github.com/facebookresearch/maskrcnn-benchmark/blob/13b4f82efd953276b24ce01f0fd1cd08f94fbaf8/docker/Dockerfile#L51
And run nvidia-docker build -t maskrcnn-benchmark docker/
Afterward, run python setup.py build develop inside the container
Add this line
target.extra_fields['labels'].clamp_(0, 1)
above here:
https://github.com/facebookresearch/maskrcnn-benchmark/blob/13b4f82efd953276b24ce01f0fd1cd08f94fbaf8/maskrcnn_benchmark/data/datasets/coco.py#L91-L96
This will reduce 80 classes into a single foreground class
Place COCO dataset in /maskrcnn-benchmark/datasets/coco
Run (single GPU)
python tools/train_net.py \
--config-file "configs/e2e_faster_rcnn_R_50_FPN_1x.yaml" \
MODEL.ROI_BOX_HEAD.NUM_CLASSES 2 \
SOLVER.IMS_PER_BATCH 2
I found that increasing NUM_CLASSES makes a few iterations successful.
What am I missing? Please help!
Hi,
I think you might have a buggy version of PyTorch.
Are you using a PyTorch nightly? If yes, which version?
Also, can you try installing either 1.0.0 or 1.0.1?
Yes. It turned out to be a problem with PyTorch-nightly.
I was using it because Docker installs it.
The previous version of PyTorch was
>>> torch.__version__
'1.0.0.dev20190207'
After I replaced this line
https://github.com/facebookresearch/maskrcnn-benchmark/blob/327bc29bcc4924e35bd61c59877d5a1d25bb75af/docker/Dockerfile#L35
with
RUN conda install -y pytorch -c pytorch \
, the problem was resolved.
Now the PyTorch version is
>>> torch.__version__
'1.0.1.post2'
Thanks a lot!
Cool, great! For information, the latest PyTorch master has fixed the aforementioned problem
For anyone else running into this issue, I was able to solve it by installing the latest pytorch (pytorch-nightly still seemed to give this error, even recent versions, but I may have just been screwing something up), and then reinstalling this library. I didn't reinstall the library at first, and it caused some headaches.
In short,
conda uninstall pytorch-nightly
conda install pytorch -c pytorch
cd path/to/maskrcnn-benchmark
rm -rf build # Remove the previous build files
rm -rf maskrcnn_benchmark.egg-info # Remove metadata about the previous build
python setup.py build develop
Again, for me, it's important to run that last step. Seems kinda obvious in hindsight, but not at the time :)
Yes, whenever we update pytorch we need to recompile maskrcnn-benchmark, which generally involve removing the build folder as well
Great reminder! I did that reflexively, but forgot to note it. Just added it to the steps above, along with the .egg-info file.
Most helpful comment
For anyone else running into this issue, I was able to solve it by installing the latest pytorch (pytorch-nightly still seemed to give this error, even recent versions, but I may have just been screwing something up), and then reinstalling this library. I didn't reinstall the library at first, and it caused some headaches.
In short,
Again, for me, it's important to run that last step. Seems kinda obvious in hindsight, but not at the time :)