Maskrcnn-benchmark: CUDA RuntimeError during losses.backward()

Created on 14 Feb 2019 · 6Comments · Source: facebookresearch/maskrcnn-benchmark

❓ Questions and Help

Hi. Thank you for your great efforts.

I'm trying to use my own dataset which only has a single class.
I implemented e.g. maskrcnn/data/datasets/mydata.py following README,
set ROI_BOX_HEAD.NUM_CLASSES=2, and updated some other files correspondingly.
(Detailed description is omitted since the error can be simply reproduced in a different way as described below)

Error message

Traceback (most recent call last):
  File "tools/train_net.py", line 174, in <module>
    main()
  File "tools/train_net.py", line 167, in main
    model = train(cfg, args.local_rank, args.distributed)
  File "tools/train_net.py", line 73, in train
    arguments,
  File "/maskrcnn-benchmark/maskrcnn_benchmark/engine/trainer.py", line 76, in do_train
    losses.backward()
  File "/miniconda/envs/py36/lib/python3.6/site-packages/torch/tensor.py", line 106, in backward
    torch.autograd.backward(self, gradient, retain_graph, create_graph)
  File "/miniconda/envs/py36/lib/python3.6/site-packages/torch/autograd/__init__.py", line 93, in backward
    allow_unreachable=True)  # allow_unreachable flag
RuntimeError: cublas runtime error : the GPU program failed to execute at /opt/conda/conda-bld/pytorch-nightly_1549566635986/work/aten/src/THC/THCBlas.cu:259

How to reproduce

Build docker image & setup maskrcnn-benchmark
Due to #167, I commented out https://github.com/facebookresearch/maskrcnn-benchmark/blob/13b4f82efd953276b24ce01f0fd1cd08f94fbaf8/docker/Dockerfile#L51
And run nvidia-docker build -t maskrcnn-benchmark docker/
Afterward, run python setup.py build develop inside the container
Add this line
target.extra_fields['labels'].clamp_(0, 1)
above here:
https://github.com/facebookresearch/maskrcnn-benchmark/blob/13b4f82efd953276b24ce01f0fd1cd08f94fbaf8/maskrcnn_benchmark/data/datasets/coco.py#L91-L96
This will reduce 80 classes into a single foreground class
Place COCO dataset in /maskrcnn-benchmark/datasets/coco
Run (single GPU)

python tools/train_net.py \
    --config-file "configs/e2e_faster_rcnn_R_50_FPN_1x.yaml" \
    MODEL.ROI_BOX_HEAD.NUM_CLASSES 2 \
    SOLVER.IMS_PER_BATCH 2

I found that increasing NUM_CLASSES makes a few iterations successful.
What am I missing? Please help!

awaiting response

Source

limbee

👍1

Most helpful comment

For anyone else running into this issue, I was able to solve it by installing the latest pytorch (pytorch-nightly still seemed to give this error, even recent versions, but I may have just been screwing something up), and then reinstalling this library. I didn't reinstall the library at first, and it caused some headaches.

In short,

conda uninstall pytorch-nightly
conda install pytorch -c pytorch
cd path/to/maskrcnn-benchmark
rm -rf build # Remove the previous build files
rm -rf maskrcnn_benchmark.egg-info # Remove metadata about the previous build
python setup.py build develop

Again, for me, it's important to run that last step. Seems kinda obvious in hindsight, but not at the time :)

ClimbsRocks on 20 Feb 2019

👍2 👎1

All 6 comments

Hi,

I think you might have a buggy version of PyTorch.
Are you using a PyTorch nightly? If yes, which version?

Also, can you try installing either 1.0.0 or 1.0.1?

fmassa on 14 Feb 2019

Yes. It turned out to be a problem with PyTorch-nightly.
I was using it because Docker installs it.

The previous version of PyTorch was

>>> torch.__version__
'1.0.0.dev20190207'

After I replaced this line
https://github.com/facebookresearch/maskrcnn-benchmark/blob/327bc29bcc4924e35bd61c59877d5a1d25bb75af/docker/Dockerfile#L35
with
RUN conda install -y pytorch -c pytorch \
, the problem was resolved.

Now the PyTorch version is

>>> torch.__version__
'1.0.1.post2'

Thanks a lot!

limbee on 15 Feb 2019

👍2

Cool, great! For information, the latest PyTorch master has fixed the aforementioned problem

fmassa on 15 Feb 2019

In short,

conda uninstall pytorch-nightly
conda install pytorch -c pytorch
cd path/to/maskrcnn-benchmark
rm -rf build # Remove the previous build files
rm -rf maskrcnn_benchmark.egg-info # Remove metadata about the previous build
python setup.py build develop

Again, for me, it's important to run that last step. Seems kinda obvious in hindsight, but not at the time :)

ClimbsRocks on 20 Feb 2019

👍2 👎1

Yes, whenever we update pytorch we need to recompile maskrcnn-benchmark, which generally involve removing the build folder as well

fmassa on 20 Feb 2019

👍1

Great reminder! I did that reflexively, but forgot to note it. Just added it to the steps above, along with the .egg-info file.

ClimbsRocks on 20 Feb 2019

👍1

Was this page helpful?

0 / 5 - 0 ratings