Note: DataLoader Error using Multi processing
minimum reproducible code:
import multiprocessing as mp
from mxnet import gluon
def bug_train():
train_data = gluon.data.DataLoader(
gluon.data.vision.CIFAR10(train=True),
batch_size=128, shuffle=True, last_batch='discard', num_workers=2)
val_data = gluon.data.DataLoader(
gluon.data.vision.CIFAR10(train=False),
batch_size=128, shuffle=False, num_workers=2)
if __name__ == '__main__':
p = mp.Process(target=bug_train)
p.start()
p.join()
Terminal output:
Segmentation fault: 11
Stack trace returned 10 entries:
[bt] (0) /home/ubuntu/anaconda3/lib/python3.7/site-packages/mxnet/libmxnet.so(+0x3f935a) [0x7fb43436535a]
[bt] (1) /home/ubuntu/anaconda3/lib/python3.7/site-packages/mxnet/libmxnet.so(+0x3513b36) [0x7fb43747fb36]
[bt] (2) /lib/x86_64-linux-gnu/libc.so.6(+0x354b0) [0x7fb503a0f4b0]
[bt] (3) /home/ubuntu/anaconda3/lib/python3.7/site-packages/mxnet/libiomp5.so(+0xa9617) [0x7fb4ffe8e617]
[bt] (4) /home/ubuntu/anaconda3/lib/python3.7/site-packages/mxnet/libiomp5.so(GOMP_parallel_start+0x115) [0x7fb4ffe7d105]
[bt] (5) /home/ubuntu/anaconda3/lib/python3.7/site-packages/mxnet/libmxnet.so(+0x3046d3b) [0x7fb436fb2d3b]
[bt] (6) /home/ubuntu/anaconda3/lib/python3.7/site-packages/mxnet/libmxnet.so(mxnet::NDArray::SyncCopyFromCPU(void const*, unsigned long) const+0x523) [0x7fb436f27203]
[bt] (7) /home/ubuntu/anaconda3/lib/python3.7/site-packages/mxnet/libmxnet.so(MXNDArraySyncCopyFromCPU+0x2b) [0x7fb436c8efab]
[bt] (8) /home/ubuntu/anaconda3/lib/python3.7/lib-dynload/../../libffi.so.6(ffi_call_unix64+0x4c) [0x7fb502b26ec0]
[bt] (9) /home/ubuntu/anaconda3/lib/python3.7/lib-dynload/../../libffi.so.6(ffi_call+0x22d) [0x7fb502b2687d
mxnet 1.4.1 or 1.5.1
ubuntu
cuda 10.0
Hey, this is the MXNet Label Bot.
Thank you for submitting the issue! I will try and suggest some labels so that the appropriate MXNet community members can help resolve it.
Here are my recommended labels: Bug
I couldn't reproduce the bug on ArchLinux 5.2.0, MXNet 1.5.0
This error is not consistent. Some of my AWS EC2 instances have this issue, but others not.
I am closing this for now.
This may be because of python verion 3.7.3
The version of Python I used is 3.7.3 (default, Jun 24 2019, 04:54:02) [GCC 9.1.0] on Linux too.
I think mxnet::NDArray::SyncCopyFromCPU doesn't support multi processing on some machines.
Reopen it. I found a consistent setting to reproduce the error:
AWS EC2 P3.16-Xlarge
ubuntu 16.04
python: anaconda 3.7.3
mxnet: mxnet-cu100mkl
The deep learning AMI using anaconda 3.6.5 does not have this problem.
@zhanghang1989 Could this be related to : https://github.com/apache/incubator-mxnet/issues/15690 ?
Thanks @piyushghai
I think this issue is independent with the other. My error only happens when starting a process.
In addition to what is discussed above: it doesn't matter if it is an EC2 P3.16 or P3.2.
In fact the same thing happened to me on P3.2 after I tried to upgrade Python to 3.7 using anaconda. Latter on I tried get rid of these bugs by using Python 3.6 again but it seems that these issues remain foreverly here from now on.
So frustrating:
Segmentation fault: 11
Stack trace:
[bt] (0) /home/ubuntu/anaconda3/lib/python3.7/site-packages/mxnet/libmxnet.so(+0x2e6b160) [0x7f07f023f160]
[bt] (1) /lib/x86_64-linux-gnu/libc.so.6(+0x354b0) [0x7f082d42a4b0]
[bt] (2) /home/ubuntu/anaconda3/lib/python3.7/site-packages/numpy/../../../libiomp5.so(+0xa82a8) [0x7f082b9eb2a8]
[bt] (3) /home/ubuntu/anaconda3/lib/python3.7/site-packages/numpy/../../../libiomp5.so(GOMP_parallel_start+0x115) [0x7f082b9d9a35]
[bt] (4) /home/ubuntu/anaconda3/lib/python3.7/site-packages/mxnet/libmxnet.so(+0x2817760) [0x7f07efbeb760]
[bt] (5) /home/ubuntu/anaconda3/lib/python3.7/site-packages/mxnet/libmxnet.so(+0x281cd8d) [0x7f07efbf0d8d]
[bt] (6) /home/ubuntu/anaconda3/lib/python3.7/site-packages/mxnet/libmxnet.so(mxnet::NDArray::SyncCopyFromCPU(void const*, unsigned long) const+0x27c) [0x7f07efb7c5ac]
[bt] (7) /home/ubuntu/anaconda3/lib/python3.7/site-packages/mxnet/libmxnet.so(MXNDArraySyncCopyFromCPU+0x2b) [0x7f07ef8f890b]
[bt] (8) /home/ubuntu/anaconda3/lib/python3.7/lib-dynload/../../libffi.so.6(ffi_call_unix64+0x4c) [0x7f082c4a5ec0]
The worst of all, there isn't any useful information from the message, except "segment fault". Hope this problem could be solved. But I'm moving on by launching a new server instance now.
Hi. This is not supported, to fork a process and use MXNet inside. We should probably detect this and throw an error.
It isn't always like this: normally it won't happen, only after I tried to upgrade Python to 3.7 on server it becomes like this. Also, my laptop has local Python3.7 installed but nothing is wrong here, simply slow (without GPU). For now I couldn't reproduce the error anymore, since I terminated that buggy environment to make my life easier. But it is for sure that the above-mentioned code isn't causing any error on my local mac environment, neither on the other instances I've launched.
I can confirm that this can be reproduced with anaconda environment + pip installed mxnet.
However,
So I suggest to tranfer this issue to pypi package building pipeline to see if the statically linked libs are causing the error.
@szha
Seems like a duplicate of #14979, and there's more context there, closing to avoid duplicate investigations.