Maskrcnn-benchmark: RuntimeError: unable to open shared memory object </torch_29919_1396182366> in read-write mode

Created on 3 Nov 2018 · 5Comments · Source: facebookresearch/maskrcnn-benchmark

🐛 Bug

Thanks for the maskrcnn-benchmark project, which is really an awesome job! However, I got some problems while training with my own instance segmentation dataset as described below.

To Reproduce

Steps to reproduce the behavior:

I just replaced the original (instance segmentation) trainning dataset (e.g. COCO) with my own dataset. The format of my own dataset was organized as the same as that of COCO dataset, say, JSON file for annotations. And I used "R-101-FPN" for my instance segmentation task training with a single TITAN X GPU.
To make the configuration correspond to my own dataset, I also modified the dataset configuration in the file of ~/maskrcnn-benchmark/maskrcnn_benchmark/config/paths_catalog.py. The main changes were the paths to link to my own dataset. And I don't think these changes could lead to the training failure.
Of course, ~/maskrcnn-benchmark/maskrcnn_benchmark/config/defaults.py and ~/maskrcnn-benchmark/configs/e2e_mask_rcnn_R_101_FPN_1_x.yaml were empoyed as default configuration. And I also set _C.SOLVER.IMS_PER_BATCH = 1 and _C.TEST.IMS_PER_BATCH = 1 .

Everything was ok during the training period at beginning. However, after several thousands of iteration, the training broke down. For simplicity, I paste the finall training output information here:

2018-11-03 11:11:27,514 maskrcnn_benchmark.trainer INFO: eta: 1 day, 0:51:50 iter: 6840 loss: 0.8033 (1.0337) loss_classifier: 0.2025 (0.2768) loss_box_reg: 0.1245 (0.1395) loss_mask: 0.3138 (0.4098) loss_objectness: 0.0600 (0.1297) loss_rpn_box_reg: 0.0195 (0.0779) time: 0.3105 (0.3053) data: 0.0067 (0.0130) lr: 0.002500 max mem: 4887
2018-11-03 11:11:33,930 maskrcnn_benchmark.trainer INFO: eta: 1 day, 0:51:57 iter: 6860 loss: 0.9135 (1.0337) loss_classifier: 0.1846 (0.2767) loss_box_reg: 0.0630 (0.1395) loss_mask: 0.3499 (0.4097) loss_objectness: 0.0861 (0.1298) loss_rpn_box_reg: 0.0168 (0.0780) time: 0.2981 (0.3054) data: 0.0064 (0.0130) lr: 0.002500 max mem: 4887
2018-11-03 11:11:40,246 maskrcnn_benchmark.trainer INFO: eta: 1 day, 0:52:00 iter: 6880 loss: 0.7548 (1.0331) loss_classifier: 0.1516 (0.2764) loss_box_reg: 0.0880 (0.1395) loss_mask: 0.3342 (0.4095) loss_objectness: 0.0588 (0.1298) loss_rpn_box_reg: 0.0457 (0.0780) time: 0.3046 (0.3054) data: 0.0064 (0.0130) lr: 0.002500 max mem: 4887
2018-11-03 11:11:46,088 maskrcnn_benchmark.trainer INFO: eta: 1 day, 0:51:43 iter: 6900 loss: 0.5536 (1.0324) loss_classifier: 0.1185 (0.2762) loss_box_reg: 0.0669 (0.1394) loss_mask: 0.2970 (0.4092) loss_objectness: 0.0445 (0.1297) loss_rpn_box_reg: 0.0095 (0.0779) time: 0.2823 (0.3054) data: 0.0048 (0.0130) lr: 0.002500 max mem: 4887
2018-11-03 11:11:52,392 maskrcnn_benchmark.trainer INFO: eta: 1 day, 0:51:45 iter: 6920 loss: 0.7813 (1.0319) loss_classifier: 0.1759 (0.2761) loss_box_reg: 0.0824 (0.1394) loss_mask: 0.3130 (0.4090) loss_objectness: 0.0393 (0.1295) loss_rpn_box_reg: 0.0133 (0.0779) time: 0.3052 (0.3054) data: 0.0061 (0.0129) lr: 0.002500 max mem: 4887
Traceback (most recent call last):
File "/home/ly/sfw/anaconda3/lib/python3.7/multiprocessing/queues.py", line 236, in _feed
File "/home/ly/sfw/anaconda3/lib/python3.7/multiprocessing/reduction.py", line 51, in dumps
File "/home/ly/sfw/anaconda3/lib/python3.7/site-packages/torch/multiprocessing/reductions.py", line 243, in reduce_storage
RuntimeError: unable to open shared memory object in read-write mode
Traceback (most recent call last):
File "/home/ly/sfw/anaconda3/lib/python3.7/multiprocessing/resource_sharer.py", line 149, in _serve
send(conn, destination_pid)
File "/home/ly/sfw/anaconda3/lib/python3.7/multiprocessing/resource_sharer.py", line 50, in send
reduction.send_handle(conn, new_fd, pid)
File "/home/ly/sfw/anaconda3/lib/python3.7/multiprocessing/reduction.py", line 179, in send_handle
with socket.fromfd(conn.fileno(), socket.AF_UNIX, socket.SOCK_STREAM) as s:
File "/home/ly/sfw/anaconda3/lib/python3.7/socket.py", line 463, in fromfd
nfd = dup(fd)
OSError: [Errno 24] Too many open files
Traceback (most recent call last):
File "tools/train_net.py", line 172, in
main()
File "tools/train_net.py", line 165, in main
model = train(cfg, args.local_rank, args.distributed)
File "tools/train_net.py", line 74, in train
arguments,
File "/home/ly/projects/MaskRCNN/maskrcnn/maskrcnn_benchmark/engine/trainer.py", line 56, in do_train
for iteration, (images, targets, _) in enumerate(data_loader, start_iter):
File "/home/ly/sfw/anaconda3/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 631, in __next__
idx, batch = self._get_batch()
File "/home/ly/sfw/anaconda3/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 610, in _get_batch
return self.data_queue.get()
File "/home/ly/sfw/anaconda3/lib/python3.7/multiprocessing/queues.py", line 113, in get
return _ForkingPickler.loads(res)
File "/home/ly/sfw/anaconda3/lib/python3.7/site-packages/torch/multiprocessing/reductions.py", line 204, in rebuild_storage_fd
fd = df.detach()
File "/home/ly/sfw/anaconda3/lib/python3.7/multiprocessing/resource_sharer.py", line 58, in detach
return reduction.recv_handle(conn)
File "/home/ly/sfw/anaconda3/lib/python3.7/multiprocessing/reduction.py", line 185, in recv_handle
return recvfds(s, 1)[0]
File "/home/ly/sfw/anaconda3/lib/python3.7/multiprocessing/reduction.py", line 155, in recvfds
raise EOFError
EOFError

Environment

PyTorch Version (e.g., 1.0): 1.0
OS (e.g., Linux): 16.04
How you installed PyTorch (conda, pip, source): conda install pytorch-nightly -c pytorch
Build command you used (if compiling from source):
Python version: python 3.7
CUDA/cuDNN version: cuda 9.0 / cuDNN 7.1.2
GPU models and configuration: torch.cuda.set_device(3)
Any other relevant information:

What should I do to solve this problem? Thanks for your help!

dependency bug

Source

IenLong

Most helpful comment

I have the same issue, the following code can solve it

import torch.multiprocessing
torch.multiprocessing.set_sharing_strategy('file_system')

This code is from #11201

u2400 on 25 Feb 2021

👍4 🚀2 😄2

All 5 comments

Problems seem to be caused by the parameter of num_workers in torch.utils.data.DataLoader(...), which have been discussed intensively at https://github.com/pytorch/pytorch/issues/1355. In my investigations, Setting _C.DATALOADER.NUM_WORKERS > 0 may lead to errors mentioned above. Therefore, I made _C.DATALOADER.NUM_WORKERS = 0 and the training has been keeping on for tens of thousands of iterations without anything unusual happends. However, less num_works means more training time is needed.

IenLong on 4 Nov 2018

👎5 👍4

Yes, it looks like you run out of shared memory. Could you try increasing it?

fmassa on 5 Nov 2018

Same issue.
Looks like this issue is related to this one.
Any clue?

hyichao on 30 Jan 2019

@hyichao the problem is probably because you are running out of shared memory, and increasing it will probably fix the issue.

Check https://github.com/pytorch/pytorch/issues/1355#issuecomment-297184037 for more details

fmassa on 30 Jan 2019

I have the same issue, the following code can solve it

import torch.multiprocessing
torch.multiprocessing.set_sharing_strategy('file_system')

This code is from #11201

u2400 on 25 Feb 2021

👍4 🚀2 😄2

Was this page helpful?

0 / 5 - 0 ratings

Related issues

Cityscapes to COCO inefficiency

botcs · 3Comments

problem at last setp 'python setup.py build develop'.

nanyoullm · 3Comments

COCODemo object: pickle data was truncated

BelhalK · 4Comments

Get 0 AP and AR when testing, and the inference result is very bad.

KuribohG · 3Comments

How to Improve?

jbitton · 4Comments