Maskrcnn-benchmark: OSError: [Errno 24] Too many open files

Created on 14 Nov 2018  ยท  3Comments  ยท  Source: facebookresearch/maskrcnn-benchmark

โ“ Questions and Help

After merge the commit fix maskrnn typo (#154) , when i run the training procedure, it always encounters the problem as below:

 Traceback (most recent call last):
  File "/usr/lib/python3.6/multiprocessing/queues.py", line 234, in _feed
  File "/usr/lib/python3.6/multiprocessing/reduction.py", line 51, in dumps
  File "/usr/local/lib/python3.6/dist-packages/torch/multiprocessing/reductions.py", line 243, in reduce_storage
RuntimeError: unable to open shared memory object </torch_30997_2076642173> in read-write mode
Traceback (most recent call last):
  File "/usr/lib/python3.6/multiprocessing/resource_sharer.py", line 149, in _serve
    send(conn, destination_pid)
  File "/usr/lib/python3.6/multiprocessing/resource_sharer.py", line 50, in send
    reduction.send_handle(conn, new_fd, pid)
  File "/usr/lib/python3.6/multiprocessing/reduction.py", line 176, in send_handle
    with socket.fromfd(conn.fileno(), socket.AF_UNIX, socket.SOCK_STREAM) as s:
  File "/usr/lib/python3.6/socket.py", line 460, in fromfd
    nfd = dup(fd)
OSError: [Errno 24] Too many open files

Traceback (most recent call last):
  File "tools/train_net.py", line 170, in <module>
    main()
  File "tools/train_net.py", line 163, in main
    model = train(cfg, args.local_rank, args.distributed)
  File "tools/train_net.py", line 73, in train
    arguments,
  File "maskrcnn-benchmark/maskrcnn_benchmark/engine/trainer.py", line 60, in do_train
    for iteration, (images, targets, _) in enumerate(data_loader, start_iter):
  File "/usr/local/lib/python3.6/dist-packages/torch/utils/data/dataloader.py", line 631, in __next__
    idx, batch = self._get_batch()
  File "/usr/local/lib/python3.6/dist-packages/torch/utils/data/dataloader.py", line 610, in _get_batch
    return self.data_queue.get()
  File "/usr/lib/python3.6/multiprocessing/queues.py", line 113, in get
    return _ForkingPickler.loads(res)
  File "/usr/local/lib/python3.6/dist-packages/torch/multiprocessing/reductions.py", line 204, in rebuild_storage_fd
    fd = df.detach()
  File "/usr/lib/python3.6/multiprocessing/resource_sharer.py", line 58, in detach
    return reduction.recv_handle(conn)
  File "/usr/lib/python3.6/multiprocessing/reduction.py", line 182, in recv_handle
    return recvfds(s, 1)[0]
  File "/usr/lib/python3.6/multiprocessing/reduction.py", line 155, in recvfds
    raise EOFError
EOFError

anyone know to fix it?
thanks.

question

Most helpful comment

i follow OSError: Too many open files #396 to add these two lines to /etc/security/limits.conf.

*               soft    nofile         65535
*               hard    nofile         65535

then reboot to solve it.

All 3 comments

i follow OSError: Too many open files #396 to add these two lines to /etc/security/limits.conf.

*               soft    nofile         65535
*               hard    nofile         65535

then reboot to solve it.

do we really need to open so many files?

@yaohuaxin this is due to how DataLoader with multiple worker threads work, with some particular combination of settings

Was this page helpful?
0 / 5 - 0 ratings

Related issues

kaaier picture kaaier  ยท  3Comments

salehiac picture salehiac  ยท  4Comments

KuribohG picture KuribohG  ยท  3Comments

botcs picture botcs  ยท  3Comments

CF2220160244 picture CF2220160244  ยท  3Comments