Incubator-mxnet: "socket.error: [Errno 111] Connection refused" while training with multiple workers

Created on 24 Jul 2018  路  12Comments  路  Source: apache/incubator-mxnet

Hi,
I am getting following error after few data iteration @ 551/22210:
File "train.py", line 201, in
trainer.training(epoch)
File "train.py", line 142, in training
for i, (data, target) in enumerate(tbar):
File "/usr/local/lib/python2.7/dist-packages/tqdm/_tqdm.py", line 930, in iter
for obj in iterable:
File "/usr/local/lib/python2.7/dist-packages/mxnet/gluon/data/dataloader.py", line 222, in next
return self.next()
File "/usr/local/lib/python2.7/dist-packages/mxnet/gluon/data/dataloader.py", line 218, in next
idx, batch = self._data_queue.get()
File "/usr/lib/python2.7/multiprocessing/queues.py", line 117, in get
res = self._recv()
File "/usr/local/lib/python2.7/dist-packages/mxnet/gluon/data/dataloader.py", line 88, in recv
return pickle.loads(buf)
File "/usr/lib/python2.7/pickle.py", line 1388, in loads
return Unpickler(file).load()
File "/usr/lib/python2.7/pickle.py", line 864, in load
dispatchkey
File "/usr/lib/python2.7/pickle.py", line 1139, in load_reduce
value = func(args)
File "/usr/local/lib/python2.7/dist-packages/mxnet/gluon/data/dataloader.py", line 53, in rebuild_ndarray
fd = multiprocessing.reduction.rebuild_handle(fd)
File "/usr/lib/python2.7/multiprocessing/reduction.py", line 156, in rebuild_handle
conn = Client(address, authkey=current_process().authkey)
File "/usr/lib/python2.7/multiprocessing/connection.py", line 169, in Client
c = SocketClient(address)
File "/usr/lib/python2.7/multiprocessing/connection.py", line 308, in SocketClient
s.connect(address)
File "/usr/lib/python2.7/socket.py", line 228, in meth
return getattr(self._sock,name)(
args)
socket.error: [Errno 111] Connection refused

I am using latest nightly of MXNET along with newly added Sync BatchNorm layer , This error comes with and without SyncBatchNorm layer.

I am using MXNET docker

Any help is much appreciated.

https://github.com/dmlc/gluon-cv/issues/215

@zhreshold would you be able to comment on this?

Data-loading

Most helpful comment

I have figured out that the pre-fetch strategy for data loader is too aggressive which might cause the related issue with shared mem.
The fix is included in https://github.com/apache/incubator-mxnet/pull/11908

All 12 comments

This is related to recent change that we switched from shared memory to file descriptor on linux for inter-processing communication. Still investigating solutions for that.
Of course we can add an option to enable either way as fallback solution.

Temporary solutions:

  1. Increase shared memory if it's too small, you can use df -h /dev/shm to check the shared memory size and usage: edit /etc/sysctl.conf, add a line or edit add a line kernel.shmmax = 4,294,967,296 for example to use maximum 4G shared mem.
  2. Reduce num_workers, if you set num_workers = 0, no multiprocess worker will be used, but it's slower.

I have figured out that the pre-fetch strategy for data loader is too aggressive which might cause the related issue with shared mem.
The fix is included in https://github.com/apache/incubator-mxnet/pull/11908

Thanks @zhreshold , I will follow this PR.

Yes, even with few (0/1) workers resource usage was quite high requiring more than usual shared memory space.

With #11908 been merged, I am closing this for now. Feel free to ping me if it still exists.

I am using the latest master and the issue still persists in docker. Even num_workers = 1 causes a hang in the dataloader's while True loop

@ifeherva docker run --shm-size xxx, if not specified, docker has no shared memory

@zhreshold Good point. How much shared memory is recommended for mxnet?

That should align with the (input batch_size, data shape, worker number), usually several GB is recommended for multi-gpu training.

@zhreshold Adding shared memory to docker solved the problem. Thanks!

image

num_workers 30, consistently getting connection timeout.. will try reducing the workers.. mxnet-cu101mkl : 1.6.0b20191207, p3.16xlarge sagemaker notebook instance

Was this page helpful?
0 / 5 - 0 ratings

Related issues

yuconglin picture yuconglin  路  3Comments

luoruisichuan picture luoruisichuan  路  3Comments

realbns2008 picture realbns2008  路  3Comments

Zhaoyang-XU picture Zhaoyang-XU  路  3Comments

xzqjack picture xzqjack  路  3Comments