Mmdetection: OSError: [Errno 12] Cannot allocate memor

Created on 24 May 2019 · 7Comments · Source: open-mmlab/mmdetection

**`(open-mmlab_ldh5) ➜ mmdetection git:(master) ✗ CUDA_VISIBLE_DEVICES=4,5,6,7 ./tools/dist_train.sh ./configs/rpc/faster_rcnn_r50_fpn_1x.py 4 --validate
2019-05-24 20:08:24,708 - INFO - Distributed training: True
2019-05-24 20:08:25,313 - INFO - load model from: modelzoo://resnet50
2019-05-24 20:08:25,611 - WARNING - unexpected key in source state_dict: fc.weight, fc.bias

missing keys in source state_dict: layer2.2.bn1.num_batches_tracked, layer2.2.bn3.num_batches_tracked, layer4.0.bn3.num_batches_tracked, layer1.0.bn3.num_batches_tracked, layer3.0.bn1.num_batches_tracked, layer4.1.bn1.num_batches_tracked, la
yer2.0.downsample.1.num_batches_tracked, layer1.1.bn2.num_batches_tracked, layer2.1.bn1.num_batches_tracked, layer3.1.bn3.num_batches_tracked, layer2.2.bn2.num_batches_tracked, layer1.2.bn3.num_batches_tracked, layer4.2.bn1.num_batches_track
ed, layer3.5.bn3.num_batches_tracked, layer2.3.bn1.num_batches_tracked, layer4.0.bn1.num_batches_tracked, layer3.0.bn3.num_batches_tracked, layer3.2.bn3.num_batches_tracked, layer3.4.bn1.num_batches_tracked, layer4.0.downsample.1.num_batches
_tracked, layer3.4.bn3.num_batches_tracked, layer3.0.bn2.num_batches_tracked, layer2.1.bn2.num_batches_tracked, layer1.0.bn2.num_batches_tracked, layer2.3.bn2.num_batches_tracked, layer1.1.bn3.num_batches_tracked, layer3.1.bn1.num_batches_tr
acked, layer2.0.bn3.num_batches_tracked, layer2.0.bn1.num_batches_tracked, layer3.4.bn2.num_batches_tracked, layer2.0.bn2.num_batches_tracked, layer3.3.bn2.num_batches_tracked, layer1.0.bn1.num_batches_tracked, layer1.2.bn2.num_batches_track
ed, layer2.3.bn3.num_batches_tracked, layer3.3.bn3.num_batches_tracked, layer3.1.bn2.num_batches_tracked, bn1.num_batches_tracked, layer4.2.bn2.num_batches_tracked, layer3.5.bn2.num_batches_tracked, layer3.3.bn1.num_batches_tracked, layer3.5
.bn1.num_batches_tracked, layer4.2.bn3.num_batches_tracked, layer4.1.bn3.num_batches_tracked, layer4.0.bn2.num_batches_tracked, layer1.1.bn1.num_batches_tracked, layer3.0.downsample.1.num_batches_tracked, layer4.1.bn2.num_batches_tracked, la
yer3.2.bn2.num_batches_tracked, layer1.2.bn1.num_batches_tracked, layer1.0.downsample.1.num_batches_tracked, layer2.1.bn3.num_batches_tracked, layer3.2.bn1.num_batches_tracked

loading annotations into memory...
loading annotations into memory...
loading annotations into memory...
loading annotations into memory...
Done (t=202.67s)
creating index...
index created!
Done (t=254.98s)
creating index...
index created!
Done (t=278.15s)
creating index...
Done (t=279.31s)
creating index...
index created!
index created!
loading annotations into memory...
loading annotations into memory...
loading annotations into memory...
loading annotations into memory...
Done (t=1.17s)
creating index...
index created!
Done (t=1.26s)
creating index...
index created!
Done (t=1.36s)
creating index...
index created!
Done (t=1.82s)
creating index...
index created!
2019-05-24 20:13:14,064 - INFO - Start running, host: ices@ices-SYS-4028GR-TR, work_dir: /home/ices/andrewjyz/Projects/detection/2019-5-23-18-56/mmdetection/work_dirs/faster_rcnn_r50_fpn_1x
2019-05-24 20:13:14,065 - INFO - workflow: [('train', 1)], max: 12 epochs
Traceback (most recent call last):
File "./tools/train.py", line 95, in
main()
File "./tools/train.py", line 91, in main
logger=logger)
File "/home/ices/andrewjyz/Projects/detection/2019-5-23-18-56/mmdetection/mmdet/apis/train.py", line 59, in train_detector
_dist_train(model, dataset, cfg, validate=validate)
File "/home/ices/andrewjyz/Projects/detection/2019-5-23-18-56/mmdetection/mmdet/apis/train.py", line 171, in _dist_train
runner.run(data_loaders, cfg.workflow, cfg.total_epochs)
File "/home/ices/andrewjyz/miniconda3/envs/open-mmlab_ldh5/lib/python3.7/site-packages/mmcv/runner/runner.py", line 356, in run
epoch_runner(data_loaders[i], kwargs)
File "/home/ices/andrewjyz/miniconda3/envs/open-mmlab_ldh5/lib/python3.7/site-packages/mmcv/runner/runner.py", line 258, in train
for i, data_batch in enumerate(data_loader):
File "/home/ices/andrewjyz/miniconda3/envs/open-mmlab_ldh5/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 193, in __iter__
return _DataLoaderIter(self)
File "/home/ices/andrewjyz/miniconda3/envs/open-mmlab_ldh5/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 469, in __init__
w.start()
File "/home/ices/andrewjyz/miniconda3/envs/open-mmlab_ldh5/lib/python3.7/multiprocessing/process.py", line 112, in start
self._popen = self._Popen(self)
File "/home/ices/andrewjyz/miniconda3/envs/open-mmlab_ldh5/lib/python3.7/multiprocessing/context.py", line 223, in _Popen
return _default_context.get_context().Process._Popen(process_obj)
File "/home/ices/andrewjyz/miniconda3/envs/open-mmlab_ldh5/lib/python3.7/multiprocessing/context.py", line 284, in _Popen
return Popen(process_obj)
File "/home/ices/andrewjyz/miniconda3/envs/open-mmlab_ldh5/lib/python3.7/multiprocessing/popen_spawn_posix.py", line 32, in __init__
super().__init__(process_obj)
File "/home/ices/andrewjyz/miniconda3/envs/open-mmlab_ldh5/lib/python3.7/multiprocessing/popen_fork.py", line 20, in __init__
self._launch(process_obj)
File "/home/ices/andrewjyz/miniconda3/envs/open-mmlab_ldh5/lib/python3.7/multiprocessing/popen_spawn_posix.py", line 59, in _launch
cmd, self._fds)
File "/home/ices/andrewjyz/miniconda3/envs/open-mmlab_ldh5/lib/python3.7/multiprocessing/util.py", line 420, in spawnv_passfds
False, False, None)
OSError: [Errno 12] Cannot allocate memory
`

My dataset is COCO format. The Json file has "segmentation" data. The size of train JSON file is 7.0GB. The numble of picture is 100000(img_size 1851*1851 ). When I train model , it can not load the dataset and the above error will appear.

My server has 252GB of memory . GPU is GeForce GTX 1080Ti and memory-Usage is 11178MiB.
I would like to ask whether all data is imported into memory at one time during the training?
If the data is too big, how to train.

I hope someone can help me solve the problem, thanks.

Source

ldhai

Most helpful comment

I found the same problem and solved it by expanding the swap partition size.

Step-by-step solution:

Increase the swap size, such as 2G.
dd if=/dev/zero of=/var/swap bs=1024 count=2048000
Setup the swap file.
mkswap /var/swap
Activate the swap partition.
swapon /var/swap

Good luck!

mzk665 on 24 Sep 2020

👍2 ❤1 🎉1

All 7 comments

For each dataloader process, the whole annotation (json) file is loaded into memory. You may reduce the worker number (by default is 2*GPU_NUM) and disable validation during training.

hellock on 25 May 2019

OK, I'm going to try.

ldhai on 27 May 2019

I want to know how to modify worker number. Because I modified the imgs_per_gpu=1 and the workers_per_gpu=1 of the config file. But it seems that worker nums is still unchanged. @hellock .

ldhai on 31 May 2019

I find the problem occured because the CPU memory usage is too high in my Machine!
kill some other progress!

linzhenyuyuchen on 1 Jun 2020

I have the same problem when loading coco dataset, but it used to work fine with mmdetv1.

wojiaoyanmin on 15 Jun 2020

I found the same problem and solved it by expanding the swap partition size.

Step-by-step solution:

Increase the swap size, such as 2G.
dd if=/dev/zero of=/var/swap bs=1024 count=2048000
Setup the swap file.
mkswap /var/swap
Activate the swap partition.
swapon /var/swap

Good luck!

mzk665 on 24 Sep 2020

👍2 ❤1 🎉1

@mzk665
Thanks for providing the good solution.

I cannot activate the swap partition. The error message is as follows:

root@mmdetection20200628:/mmdetection# swapon /var/swap
swapon: /var/swap: swapon failed: Operation not permitted

Perhaps the docker setting is the reason I can't activate the file. Are you working in a docker environment? Do you know how to solve the problem ?

AkihiroSasabe on 30 Oct 2020

Was this page helpful?

0 / 5 - 0 ratings

Related issues

Other features DCNv2 Group Normalization Weight Standardization OHEM Soft-NMS Generalized Attention GCNet Mixed Precision (FP16) Training

henbucuoshanghai · 3Comments

Cyclic depedency causes KeyError: 'ConvWS is already registered in conv layer'

namheegordonkim · 3Comments

Adding TTFNet

michaelisc · 3Comments

finetune CascadeRcnn generate too large model weights

BeBeauty · 3Comments

Where's the implementation of CARAFE

Youngkl0726 · 3Comments