After merge the commit fix maskrnn typo (#154) , when i run the training procedure, it always encounters the problem as below:
Traceback (most recent call last):
File "/usr/lib/python3.6/multiprocessing/queues.py", line 234, in _feed
File "/usr/lib/python3.6/multiprocessing/reduction.py", line 51, in dumps
File "/usr/local/lib/python3.6/dist-packages/torch/multiprocessing/reductions.py", line 243, in reduce_storage
RuntimeError: unable to open shared memory object </torch_30997_2076642173> in read-write mode
Traceback (most recent call last):
File "/usr/lib/python3.6/multiprocessing/resource_sharer.py", line 149, in _serve
send(conn, destination_pid)
File "/usr/lib/python3.6/multiprocessing/resource_sharer.py", line 50, in send
reduction.send_handle(conn, new_fd, pid)
File "/usr/lib/python3.6/multiprocessing/reduction.py", line 176, in send_handle
with socket.fromfd(conn.fileno(), socket.AF_UNIX, socket.SOCK_STREAM) as s:
File "/usr/lib/python3.6/socket.py", line 460, in fromfd
nfd = dup(fd)
OSError: [Errno 24] Too many open files
Traceback (most recent call last):
File "tools/train_net.py", line 170, in <module>
main()
File "tools/train_net.py", line 163, in main
model = train(cfg, args.local_rank, args.distributed)
File "tools/train_net.py", line 73, in train
arguments,
File "maskrcnn-benchmark/maskrcnn_benchmark/engine/trainer.py", line 60, in do_train
for iteration, (images, targets, _) in enumerate(data_loader, start_iter):
File "/usr/local/lib/python3.6/dist-packages/torch/utils/data/dataloader.py", line 631, in __next__
idx, batch = self._get_batch()
File "/usr/local/lib/python3.6/dist-packages/torch/utils/data/dataloader.py", line 610, in _get_batch
return self.data_queue.get()
File "/usr/lib/python3.6/multiprocessing/queues.py", line 113, in get
return _ForkingPickler.loads(res)
File "/usr/local/lib/python3.6/dist-packages/torch/multiprocessing/reductions.py", line 204, in rebuild_storage_fd
fd = df.detach()
File "/usr/lib/python3.6/multiprocessing/resource_sharer.py", line 58, in detach
return reduction.recv_handle(conn)
File "/usr/lib/python3.6/multiprocessing/reduction.py", line 182, in recv_handle
return recvfds(s, 1)[0]
File "/usr/lib/python3.6/multiprocessing/reduction.py", line 155, in recvfds
raise EOFError
EOFError
anyone know to fix it?
thanks.
i follow OSError: Too many open files #396 to add these two lines to /etc/security/limits.conf.
* soft nofile 65535
* hard nofile 65535
then reboot to solve it.
do we really need to open so many files?
@yaohuaxin this is due to how DataLoader with multiple worker threads work, with some particular combination of settings
Most helpful comment
i follow OSError: Too many open files #396 to add these two lines to
/etc/security/limits.conf.then reboot to solve it.