**`(open-mmlab_ldh5) ➜ mmdetection git:(master) ✗ CUDA_VISIBLE_DEVICES=4,5,6,7 ./tools/dist_train.sh ./configs/rpc/faster_rcnn_r50_fpn_1x.py 4 --validate
2019-05-24 20:08:24,708 - INFO - Distributed training: True
2019-05-24 20:08:25,313 - INFO - load model from: modelzoo://resnet50
2019-05-24 20:08:25,611 - WARNING - unexpected key in source state_dict: fc.weight, fc.bias
missing keys in source state_dict: layer2.2.bn1.num_batches_tracked, layer2.2.bn3.num_batches_tracked, layer4.0.bn3.num_batches_tracked, layer1.0.bn3.num_batches_tracked, layer3.0.bn1.num_batches_tracked, layer4.1.bn1.num_batches_tracked, la
yer2.0.downsample.1.num_batches_tracked, layer1.1.bn2.num_batches_tracked, layer2.1.bn1.num_batches_tracked, layer3.1.bn3.num_batches_tracked, layer2.2.bn2.num_batches_tracked, layer1.2.bn3.num_batches_tracked, layer4.2.bn1.num_batches_track
ed, layer3.5.bn3.num_batches_tracked, layer2.3.bn1.num_batches_tracked, layer4.0.bn1.num_batches_tracked, layer3.0.bn3.num_batches_tracked, layer3.2.bn3.num_batches_tracked, layer3.4.bn1.num_batches_tracked, layer4.0.downsample.1.num_batches
_tracked, layer3.4.bn3.num_batches_tracked, layer3.0.bn2.num_batches_tracked, layer2.1.bn2.num_batches_tracked, layer1.0.bn2.num_batches_tracked, layer2.3.bn2.num_batches_tracked, layer1.1.bn3.num_batches_tracked, layer3.1.bn1.num_batches_tr
acked, layer2.0.bn3.num_batches_tracked, layer2.0.bn1.num_batches_tracked, layer3.4.bn2.num_batches_tracked, layer2.0.bn2.num_batches_tracked, layer3.3.bn2.num_batches_tracked, layer1.0.bn1.num_batches_tracked, layer1.2.bn2.num_batches_track
ed, layer2.3.bn3.num_batches_tracked, layer3.3.bn3.num_batches_tracked, layer3.1.bn2.num_batches_tracked, bn1.num_batches_tracked, layer4.2.bn2.num_batches_tracked, layer3.5.bn2.num_batches_tracked, layer3.3.bn1.num_batches_tracked, layer3.5
.bn1.num_batches_tracked, layer4.2.bn3.num_batches_tracked, layer4.1.bn3.num_batches_tracked, layer4.0.bn2.num_batches_tracked, layer1.1.bn1.num_batches_tracked, layer3.0.downsample.1.num_batches_tracked, layer4.1.bn2.num_batches_tracked, la
yer3.2.bn2.num_batches_tracked, layer1.2.bn1.num_batches_tracked, layer1.0.downsample.1.num_batches_tracked, layer2.1.bn3.num_batches_tracked, layer3.2.bn1.num_batches_tracked
loading annotations into memory...
loading annotations into memory...
loading annotations into memory...
loading annotations into memory...
Done (t=202.67s)
creating index...
index created!
Done (t=254.98s)
creating index...
index created!
Done (t=278.15s)
creating index...
Done (t=279.31s)
creating index...
index created!
index created!
loading annotations into memory...
loading annotations into memory...
loading annotations into memory...
loading annotations into memory...
Done (t=1.17s)
creating index...
index created!
Done (t=1.26s)
creating index...
index created!
Done (t=1.36s)
creating index...
index created!
Done (t=1.82s)
creating index...
index created!
2019-05-24 20:13:14,064 - INFO - Start running, host: ices@ices-SYS-4028GR-TR, work_dir: /home/ices/andrewjyz/Projects/detection/2019-5-23-18-56/mmdetection/work_dirs/faster_rcnn_r50_fpn_1x
2019-05-24 20:13:14,065 - INFO - workflow: [('train', 1)], max: 12 epochs
Traceback (most recent call last):
File "./tools/train.py", line 95, in
main()
File "./tools/train.py", line 91, in main
logger=logger)
File "/home/ices/andrewjyz/Projects/detection/2019-5-23-18-56/mmdetection/mmdet/apis/train.py", line 59, in train_detector
_dist_train(model, dataset, cfg, validate=validate)
File "/home/ices/andrewjyz/Projects/detection/2019-5-23-18-56/mmdetection/mmdet/apis/train.py", line 171, in _dist_train
runner.run(data_loaders, cfg.workflow, cfg.total_epochs)
File "/home/ices/andrewjyz/miniconda3/envs/open-mmlab_ldh5/lib/python3.7/site-packages/mmcv/runner/runner.py", line 356, in run
epoch_runner(data_loaders[i], kwargs)
File "/home/ices/andrewjyz/miniconda3/envs/open-mmlab_ldh5/lib/python3.7/site-packages/mmcv/runner/runner.py", line 258, in train
for i, data_batch in enumerate(data_loader):
File "/home/ices/andrewjyz/miniconda3/envs/open-mmlab_ldh5/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 193, in __iter__
return _DataLoaderIter(self)
File "/home/ices/andrewjyz/miniconda3/envs/open-mmlab_ldh5/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 469, in __init__
w.start()
File "/home/ices/andrewjyz/miniconda3/envs/open-mmlab_ldh5/lib/python3.7/multiprocessing/process.py", line 112, in start
self._popen = self._Popen(self)
File "/home/ices/andrewjyz/miniconda3/envs/open-mmlab_ldh5/lib/python3.7/multiprocessing/context.py", line 223, in _Popen
return _default_context.get_context().Process._Popen(process_obj)
File "/home/ices/andrewjyz/miniconda3/envs/open-mmlab_ldh5/lib/python3.7/multiprocessing/context.py", line 284, in _Popen
return Popen(process_obj)
File "/home/ices/andrewjyz/miniconda3/envs/open-mmlab_ldh5/lib/python3.7/multiprocessing/popen_spawn_posix.py", line 32, in __init__
super().__init__(process_obj)
File "/home/ices/andrewjyz/miniconda3/envs/open-mmlab_ldh5/lib/python3.7/multiprocessing/popen_fork.py", line 20, in __init__
self._launch(process_obj)
File "/home/ices/andrewjyz/miniconda3/envs/open-mmlab_ldh5/lib/python3.7/multiprocessing/popen_spawn_posix.py", line 59, in _launch
cmd, self._fds)
File "/home/ices/andrewjyz/miniconda3/envs/open-mmlab_ldh5/lib/python3.7/multiprocessing/util.py", line 420, in spawnv_passfds
False, False, None)
OSError: [Errno 12] Cannot allocate memory
`
My dataset is COCO format. The Json file has "segmentation" data. The size of train JSON file is 7.0GB. The numble of picture is 100000(img_size 1851*1851 ). When I train model , it can not load the dataset and the above error will appear.
My server has 252GB of memory . GPU is GeForce GTX 1080Ti and memory-Usage is 11178MiB.
I would like to ask whether all data is imported into memory at one time during the training?
If the data is too big, how to train.
I hope someone can help me solve the problem, thanks.
For each dataloader process, the whole annotation (json) file is loaded into memory. You may reduce the worker number (by default is 2*GPU_NUM) and disable validation during training.
OK, I'm going to try.
I want to know how to modify worker number. Because I modified the imgs_per_gpu=1 and the workers_per_gpu=1 of the config file. But it seems that worker nums is still unchanged. @hellock .
I find the problem occured because the CPU memory usage is too high in my Machine!
kill some other progress!
I have the same problem when loading coco dataset, but it used to work fine with mmdetv1.

I found the same problem and solved it by expanding the swap partition size.
Step-by-step solution:
Increase the swap size, such as 2G.
dd if=/dev/zero of=/var/swap bs=1024 count=2048000
Setup the swap file.
mkswap /var/swap
Activate the swap partition.
swapon /var/swap
Good luck!
@mzk665
Thanks for providing the good solution.
I cannot activate the swap partition. The error message is as follows:
root@mmdetection20200628:/mmdetection# swapon /var/swap
swapon: /var/swap: swapon failed: Operation not permitted
Perhaps the docker setting is the reason I can't activate the file. Are you working in a docker environment? Do you know how to solve the problem ?
Most helpful comment
I found the same problem and solved it by expanding the swap partition size.
Step-by-step solution:
Increase the swap size, such as 2G.
dd if=/dev/zero of=/var/swap bs=1024 count=2048000
Setup the swap file.
mkswap /var/swap
Activate the swap partition.
swapon /var/swap
Good luck!