I am launching training on a pretrained model and a 2 classes coco like dataset.
Steps to reproduce the behavior:
python tools/train_net.py --config-file "configs/myconfig.yaml" SOLVER.IMS_PER_BATCH 10 SOLVER.BASE_LR 0.0025 SOLVER.MAX_ITER 720000 SOLVER.STEPS "(480000, 640000)" TEST.IMS_PER_BATCH 1
Where myconfig.yaml points out to mymodel.pth like this:
WEIGHT: "/Users/karimimohammedbelhal/.torch/models/mymodel"
And mymodel.pth is a pre trained model with the right keys deleted as suggested in #15
Training should start and complete.
PyTorch version: 1.0.0.dev20181123
Is debug build: No
CUDA used to build PyTorch: 9.0.176
OS: Ubuntu 18.04.1 LTS
GCC version: (Ubuntu 7.3.0-27ubuntu1~18.04) 7.3.0
CMake version: version 3.10.2
Python version: 3.7
Is CUDA available: Yes
CUDA runtime version: 9.1.85
GPU models and configuration:
GPU 0: GeForce GTX 1080 Ti
GPU 1: GeForce GTX 1080 Ti
Nvidia driver version: 396.51
cuDNN version: Probably one of the following:
/usr/lib/x86_64-linux-gnu/libcudnn.so.7.2.1
/usr/lib/x86_64-linux-gnu/libcudnn_static_v7.a
Versions of relevant libraries:
[pip3] numpy (1.13.3)
[pip3] torch (0.4.1)
[pip3] torchvision (0.2.1)
[conda] pytorch-nightly 1.0.0.dev20181123 py3.7_cuda9.0.176_cudnn7.4.1_0 pytorch
Traceback (most recent call last):
File "tools/train_net.py", line 170, in <module>
main()
File "tools/train_net.py", line 163, in main
model = train(cfg, args.local_rank, args.distributed)
File "tools/train_net.py", line 31, in train
model.to(device)
File "/home/belhal/anaconda3/envs/maskrcnn/lib/python3.7/site-packages/torch/nn/modules/module.py", line 381, in to
return self._apply(convert)
File "/home/belhal/anaconda3/envs/maskrcnn/lib/python3.7/site-packages/torch/nn/modules/module.py", line 187, in _apply
module._apply(fn)
File "/home/belhal/anaconda3/envs/maskrcnn/lib/python3.7/site-packages/torch/nn/modules/module.py", line 187, in _apply
module._apply(fn)
File "/home/belhal/anaconda3/envs/maskrcnn/lib/python3.7/site-packages/torch/nn/modules/module.py", line 187, in _apply
module._apply(fn)
[Previous line repeated 1 more time]
File "/home/belhal/anaconda3/envs/maskrcnn/lib/python3.7/site-packages/torch/nn/modules/module.py", line 193, in _apply
param.data = fn(param.data)
File "/home/belhal/anaconda3/envs/maskrcnn/lib/python3.7/site-packages/torch/nn/modules/module.py", line 379, in convert
return t.to(device, dtype if t.is_floating_point() else None, non_blocking)
RuntimeError: CUDA error: out of memory
if u use single gpu to train a model, u should change the IMS_PER_BATCH to be small enough (e.g. IMS_PER_BATCH=2).
As @zimenglan-sysu-512 pointed out, you are training on a single GPU with a batch size of 10, which is quite large in general. Try decreasing the batch size.
Actually I also tried with this command line (setting SOLVER.IMS_PER_BATCHto 1)
python tools/train_net.py --config-file "configs/myconfig.yaml" SOLVER.IMS_PER_BATCH 1 SOLVER.BASE_LR 0.0025 SOLVER.MAX_ITER 720000 SOLVER.STEPS "(480000, 640000)" TEST.IMS_PER_BATCH 1
and still get a similar error that seems weird indeed
THCudaCheck FAIL file=/opt/conda/conda-bld/pytorch-nightly_1542964575207/work/aten/src/THC/THCGeneral.cpp line=51 error=30 : unknown error
Traceback (most recent call last):
File "tools/train_net.py", line 170, in <module>
main()
File "tools/train_net.py", line 163, in main
model = train(cfg, args.local_rank, args.distributed)
File "tools/train_net.py", line 31, in train
model.to(device)
File "/home/belhal/anaconda3/envs/maskrcnn/lib/python3.7/site-packages/torch/nn/modules/module.py", line 381, in to
return self._apply(convert)
File "/home/belhal/anaconda3/envs/maskrcnn/lib/python3.7/site-packages/torch/nn/modules/module.py", line 187, in _apply
module._apply(fn)
File "/home/belhal/anaconda3/envs/maskrcnn/lib/python3.7/site-packages/torch/nn/modules/module.py", line 187, in _apply
module._apply(fn)
File "/home/belhal/anaconda3/envs/maskrcnn/lib/python3.7/site-packages/torch/nn/modules/module.py", line 187, in _apply
module._apply(fn)
[Previous line repeated 1 more time]
File "/home/belhal/anaconda3/envs/maskrcnn/lib/python3.7/site-packages/torch/nn/modules/module.py", line 193, in _apply
param.data = fn(param.data)
File "/home/belhal/anaconda3/envs/maskrcnn/lib/python3.7/site-packages/torch/nn/modules/module.py", line 379, in convert
return t.to(device, dtype if t.is_floating_point() else None, non_blocking)
File "/home/belhal/anaconda3/envs/maskrcnn/lib/python3.7/site-packages/torch/cuda/__init__.py", line 162, in _lazy_init
torch._C._cuda_init()
RuntimeError: cuda runtime error (30) : unknown error at /opt/conda/conda-bld/pytorch-nightly_1542964575207/work/aten/src/THC/THCGeneral.cpp:51
If you suspect this is an IPython bug, please report it at:
https://github.com/ipython/ipython/issues
or send an email to the mailing list at [email protected]
You can print a more detailed traceback right now with "%tb", or use "%debug"
to interactively debug it.
Extra-detailed tracebacks for bug-reporting purposes can be enabled via:
%config Application.verbose_crash=True
can you do
import torch
print(torch.rand(1, device="cuda"))
in your interpreter?
Hm interesting, it reutrns
THCudaCheck FAIL file=/opt/conda/conda-bld/pytorch-nightly_1542964575207/work/aten/src/THC/THCGeneral.cpp line=51 error=30 : unknown error
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/home/belhal/anaconda3/envs/maskrcnn/lib/python3.7/site-packages/torch/cuda/__init__.py", line 162, in _lazy_init
torch._C._cuda_init()
RuntimeError: cuda runtime error (30) : unknown error at /opt/conda/conda-bld/pytorch-nightly_1542964575207/work/aten/src/THC/THCGeneral.cpp:51
only
import torch
print(torch.rand(1, device="cpu"))
works
It looks like there is a problem with your setup / gpu. Maybe a reboot would help?
That was it, thank you!
Most helpful comment
if u use single gpu to train a model, u should change the
IMS_PER_BATCHto be small enough (e.g. IMS_PER_BATCH=2).