Incubator-mxnet: DataLoader Error using Multi processing

Created on 23 Jul 2019 · 13Comments · Source: apache/incubator-mxnet

Note: DataLoader Error using Multi processing

Description

minimum reproducible code:

import multiprocessing as mp
from mxnet import gluon

def bug_train():
    train_data = gluon.data.DataLoader(
        gluon.data.vision.CIFAR10(train=True),
        batch_size=128, shuffle=True, last_batch='discard', num_workers=2)

    val_data = gluon.data.DataLoader(
        gluon.data.vision.CIFAR10(train=False),
        batch_size=128, shuffle=False, num_workers=2)


if __name__ == '__main__':
    p = mp.Process(target=bug_train)
    p.start()
    p.join()

Terminal output:

Segmentation fault: 11

Stack trace returned 10 entries:
[bt] (0) /home/ubuntu/anaconda3/lib/python3.7/site-packages/mxnet/libmxnet.so(+0x3f935a) [0x7fb43436535a]
[bt] (1) /home/ubuntu/anaconda3/lib/python3.7/site-packages/mxnet/libmxnet.so(+0x3513b36) [0x7fb43747fb36]
[bt] (2) /lib/x86_64-linux-gnu/libc.so.6(+0x354b0) [0x7fb503a0f4b0]
[bt] (3) /home/ubuntu/anaconda3/lib/python3.7/site-packages/mxnet/libiomp5.so(+0xa9617) [0x7fb4ffe8e617]
[bt] (4) /home/ubuntu/anaconda3/lib/python3.7/site-packages/mxnet/libiomp5.so(GOMP_parallel_start+0x115) [0x7fb4ffe7d105]
[bt] (5) /home/ubuntu/anaconda3/lib/python3.7/site-packages/mxnet/libmxnet.so(+0x3046d3b) [0x7fb436fb2d3b]
[bt] (6) /home/ubuntu/anaconda3/lib/python3.7/site-packages/mxnet/libmxnet.so(mxnet::NDArray::SyncCopyFromCPU(void const*, unsigned long) const+0x523) [0x7fb436f27203]
[bt] (7) /home/ubuntu/anaconda3/lib/python3.7/site-packages/mxnet/libmxnet.so(MXNDArraySyncCopyFromCPU+0x2b) [0x7fb436c8efab]
[bt] (8) /home/ubuntu/anaconda3/lib/python3.7/lib-dynload/../../libffi.so.6(ffi_call_unix64+0x4c) [0x7fb502b26ec0]
[bt] (9) /home/ubuntu/anaconda3/lib/python3.7/lib-dynload/../../libffi.so.6(ffi_call+0x22d) [0x7fb502b2687d

Environment info (Required)

mxnet 1.4.1 or 1.5.1
ubuntu
cuda 10.0

Bug Data-loading

Source

zhanghang1989

All 13 comments

Hey, this is the MXNet Label Bot.
Thank you for submitting the issue! I will try and suggest some labels so that the appropriate MXNet community members can help resolve it.
Here are my recommended labels: Bug

mxnet-label-bot on 23 Jul 2019

I couldn't reproduce the bug on ArchLinux 5.2.0, MXNet 1.5.0

wkcn on 23 Jul 2019

This error is not consistent. Some of my AWS EC2 instances have this issue, but others not.
I am closing this for now.

zhanghang1989 on 23 Jul 2019

This may be because of python verion 3.7.3

zhanghang1989 on 23 Jul 2019

👍1

The version of Python I used is 3.7.3 (default, Jun 24 2019, 04:54:02) [GCC 9.1.0] on Linux too.
I think mxnet::NDArray::SyncCopyFromCPU doesn't support multi processing on some machines.

wkcn on 24 Jul 2019

👍1

Reopen it. I found a consistent setting to reproduce the error:

AWS EC2 P3.16-Xlarge
ubuntu 16.04
python: anaconda 3.7.3
mxnet: mxnet-cu100mkl

The deep learning AMI using anaconda 3.6.5 does not have this problem.

zhanghang1989 on 30 Jul 2019

@zhanghang1989 Could this be related to : https://github.com/apache/incubator-mxnet/issues/15690 ?

piyushghai on 30 Jul 2019

Thanks @piyushghai
I think this issue is independent with the other. My error only happens when starting a process.

zhanghang1989 on 30 Jul 2019

In addition to what is discussed above: it doesn't matter if it is an EC2 P3.16 or P3.2.

In fact the same thing happened to me on P3.2 after I tried to upgrade Python to 3.7 using anaconda. Latter on I tried get rid of these bugs by using Python 3.6 again but it seems that these issues remain foreverly here from now on.

So frustrating:

Segmentation fault: 11

Stack trace:
  [bt] (0) /home/ubuntu/anaconda3/lib/python3.7/site-packages/mxnet/libmxnet.so(+0x2e6b160) [0x7f07f023f160]
  [bt] (1) /lib/x86_64-linux-gnu/libc.so.6(+0x354b0) [0x7f082d42a4b0]
  [bt] (2) /home/ubuntu/anaconda3/lib/python3.7/site-packages/numpy/../../../libiomp5.so(+0xa82a8) [0x7f082b9eb2a8]
  [bt] (3) /home/ubuntu/anaconda3/lib/python3.7/site-packages/numpy/../../../libiomp5.so(GOMP_parallel_start+0x115) [0x7f082b9d9a35]
  [bt] (4) /home/ubuntu/anaconda3/lib/python3.7/site-packages/mxnet/libmxnet.so(+0x2817760) [0x7f07efbeb760]
  [bt] (5) /home/ubuntu/anaconda3/lib/python3.7/site-packages/mxnet/libmxnet.so(+0x281cd8d) [0x7f07efbf0d8d]
  [bt] (6) /home/ubuntu/anaconda3/lib/python3.7/site-packages/mxnet/libmxnet.so(mxnet::NDArray::SyncCopyFromCPU(void const*, unsigned long) const+0x27c) [0x7f07efb7c5ac]
  [bt] (7) /home/ubuntu/anaconda3/lib/python3.7/site-packages/mxnet/libmxnet.so(MXNDArraySyncCopyFromCPU+0x2b) [0x7f07ef8f890b]
  [bt] (8) /home/ubuntu/anaconda3/lib/python3.7/lib-dynload/../../libffi.so.6(ffi_call_unix64+0x4c) [0x7f082c4a5ec0]

The worst of all, there isn't any useful information from the message, except "segment fault". Hope this problem could be solved. But I'm moving on by launching a new server instance now.

PatriciaXiao on 31 Jul 2019

Hi. This is not supported, to fork a process and use MXNet inside. We should probably detect this and throw an error.

larroy on 14 Aug 2019

It isn't always like this: normally it won't happen, only after I tried to upgrade Python to 3.7 on server it becomes like this. Also, my laptop has local Python3.7 installed but nothing is wrong here, simply slow (without GPU). For now I couldn't reproduce the error anymore, since I terminated that buggy environment to make my life easier. But it is for sure that the above-mentioned code isn't causing any error on my local mac environment, neither on the other instances I've launched.

PatriciaXiao on 20 Aug 2019

I can confirm that this can be reproduced with anaconda environment + pip installed mxnet.
However,