Maskrcnn-benchmark: Training on a pre-trained model: RuntimeError: CUDA error: out of memory

Created on 30 Nov 2018  路  8Comments  路  Source: facebookresearch/maskrcnn-benchmark

馃悰 Bug

I am launching training on a pretrained model and a 2 classes coco like dataset.

To Reproduce

Steps to reproduce the behavior:

  1. Run training with this command line

python tools/train_net.py --config-file "configs/myconfig.yaml" SOLVER.IMS_PER_BATCH 10 SOLVER.BASE_LR 0.0025 SOLVER.MAX_ITER 720000 SOLVER.STEPS "(480000, 640000)" TEST.IMS_PER_BATCH 1

Where myconfig.yaml points out to mymodel.pth like this:
WEIGHT: "/Users/karimimohammedbelhal/.torch/models/mymodel"
And mymodel.pth is a pre trained model with the right keys deleted as suggested in #15

Expected behavior

Training should start and complete.

Environment

PyTorch version: 1.0.0.dev20181123
Is debug build: No
CUDA used to build PyTorch: 9.0.176

OS: Ubuntu 18.04.1 LTS
GCC version: (Ubuntu 7.3.0-27ubuntu1~18.04) 7.3.0
CMake version: version 3.10.2

Python version: 3.7
Is CUDA available: Yes
CUDA runtime version: 9.1.85
GPU models and configuration:
GPU 0: GeForce GTX 1080 Ti
GPU 1: GeForce GTX 1080 Ti

Nvidia driver version: 396.51
cuDNN version: Probably one of the following:
/usr/lib/x86_64-linux-gnu/libcudnn.so.7.2.1
/usr/lib/x86_64-linux-gnu/libcudnn_static_v7.a

Versions of relevant libraries:
[pip3] numpy (1.13.3)
[pip3] torch (0.4.1)
[pip3] torchvision (0.2.1)
[conda] pytorch-nightly 1.0.0.dev20181123 py3.7_cuda9.0.176_cudnn7.4.1_0 pytorch

Returned Error

Traceback (most recent call last):
  File "tools/train_net.py", line 170, in <module>
    main()
  File "tools/train_net.py", line 163, in main
    model = train(cfg, args.local_rank, args.distributed)
  File "tools/train_net.py", line 31, in train
    model.to(device)
  File "/home/belhal/anaconda3/envs/maskrcnn/lib/python3.7/site-packages/torch/nn/modules/module.py", line 381, in to
    return self._apply(convert)
  File "/home/belhal/anaconda3/envs/maskrcnn/lib/python3.7/site-packages/torch/nn/modules/module.py", line 187, in _apply
    module._apply(fn)
  File "/home/belhal/anaconda3/envs/maskrcnn/lib/python3.7/site-packages/torch/nn/modules/module.py", line 187, in _apply
    module._apply(fn)
  File "/home/belhal/anaconda3/envs/maskrcnn/lib/python3.7/site-packages/torch/nn/modules/module.py", line 187, in _apply
    module._apply(fn)
  [Previous line repeated 1 more time]
  File "/home/belhal/anaconda3/envs/maskrcnn/lib/python3.7/site-packages/torch/nn/modules/module.py", line 193, in _apply
    param.data = fn(param.data)
  File "/home/belhal/anaconda3/envs/maskrcnn/lib/python3.7/site-packages/torch/nn/modules/module.py", line 379, in convert
    return t.to(device, dtype if t.is_floating_point() else None, non_blocking)
RuntimeError: CUDA error: out of memory

Most helpful comment

if u use single gpu to train a model, u should change the IMS_PER_BATCH to be small enough (e.g. IMS_PER_BATCH=2).

All 8 comments

if u use single gpu to train a model, u should change the IMS_PER_BATCH to be small enough (e.g. IMS_PER_BATCH=2).

As @zimenglan-sysu-512 pointed out, you are training on a single GPU with a batch size of 10, which is quite large in general. Try decreasing the batch size.

Actually I also tried with this command line (setting SOLVER.IMS_PER_BATCHto 1)
python tools/train_net.py --config-file "configs/myconfig.yaml" SOLVER.IMS_PER_BATCH 1 SOLVER.BASE_LR 0.0025 SOLVER.MAX_ITER 720000 SOLVER.STEPS "(480000, 640000)" TEST.IMS_PER_BATCH 1
and still get a similar error that seems weird indeed

THCudaCheck FAIL file=/opt/conda/conda-bld/pytorch-nightly_1542964575207/work/aten/src/THC/THCGeneral.cpp line=51 error=30 : unknown error
Traceback (most recent call last):
  File "tools/train_net.py", line 170, in <module>
    main()
  File "tools/train_net.py", line 163, in main
    model = train(cfg, args.local_rank, args.distributed)
  File "tools/train_net.py", line 31, in train
    model.to(device)
  File "/home/belhal/anaconda3/envs/maskrcnn/lib/python3.7/site-packages/torch/nn/modules/module.py", line 381, in to
    return self._apply(convert)
  File "/home/belhal/anaconda3/envs/maskrcnn/lib/python3.7/site-packages/torch/nn/modules/module.py", line 187, in _apply
    module._apply(fn)
  File "/home/belhal/anaconda3/envs/maskrcnn/lib/python3.7/site-packages/torch/nn/modules/module.py", line 187, in _apply
    module._apply(fn)
  File "/home/belhal/anaconda3/envs/maskrcnn/lib/python3.7/site-packages/torch/nn/modules/module.py", line 187, in _apply
    module._apply(fn)
  [Previous line repeated 1 more time]
  File "/home/belhal/anaconda3/envs/maskrcnn/lib/python3.7/site-packages/torch/nn/modules/module.py", line 193, in _apply
    param.data = fn(param.data)
  File "/home/belhal/anaconda3/envs/maskrcnn/lib/python3.7/site-packages/torch/nn/modules/module.py", line 379, in convert
    return t.to(device, dtype if t.is_floating_point() else None, non_blocking)
  File "/home/belhal/anaconda3/envs/maskrcnn/lib/python3.7/site-packages/torch/cuda/__init__.py", line 162, in _lazy_init
    torch._C._cuda_init()
RuntimeError: cuda runtime error (30) : unknown error at /opt/conda/conda-bld/pytorch-nightly_1542964575207/work/aten/src/THC/THCGeneral.cpp:51

If you suspect this is an IPython bug, please report it at:
    https://github.com/ipython/ipython/issues
or send an email to the mailing list at [email protected]

You can print a more detailed traceback right now with "%tb", or use "%debug"
to interactively debug it.

Extra-detailed tracebacks for bug-reporting purposes can be enabled via:
    %config Application.verbose_crash=True

can you do

import torch
print(torch.rand(1, device="cuda"))

in your interpreter?

Hm interesting, it reutrns

THCudaCheck FAIL file=/opt/conda/conda-bld/pytorch-nightly_1542964575207/work/aten/src/THC/THCGeneral.cpp line=51 error=30 : unknown error
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/belhal/anaconda3/envs/maskrcnn/lib/python3.7/site-packages/torch/cuda/__init__.py", line 162, in _lazy_init
    torch._C._cuda_init()
RuntimeError: cuda runtime error (30) : unknown error at /opt/conda/conda-bld/pytorch-nightly_1542964575207/work/aten/src/THC/THCGeneral.cpp:51

only

import torch
print(torch.rand(1, device="cpu"))

works

It looks like there is a problem with your setup / gpu. Maybe a reboot would help?

That was it, thank you!

Was this page helpful?
0 / 5 - 0 ratings