Incubator-mxnet: Windows segmentation faults in GPU tests

Created on 20 Feb 2020  ·  4Comments  ·  Source: apache/incubator-mxnet

Description

Windows GPU tests from the updated environment in https://github.com/aiengines/ci fails with the following:

======================================================================
ERROR: Failure: OSError (exception: access violation writing 0x0000000000000000)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "C:\Python37\lib\site-packages\nose\failure.py", line 39, in runTest
    raise self.exc_val.with_traceback(self.tb)
  File "C:\Python37\lib\site-packages\nose\loader.py", line 418, in loadTestsFromName
    addr.filename, addr.module)
  File "C:\Python37\lib\site-packages\nose\importer.py", line 47, in importFromPath
    return self.importFromDir(dir_path, fqname)
  File "C:\Python37\lib\site-packages\nose\importer.py", line 94, in importFromDir
    mod = load_module(part_fqname, fh, filename, desc)
  File "C:\Python37\lib\imp.py", line 235, in load_module
    return load_source(name, filename, file)
  File "C:\Python37\lib\imp.py", line 172, in load_source
    module = _load(spec)
  File "<frozen importlib._bootstrap>", line 696, in _load
  File "<frozen importlib._bootstrap>", line 677, in _load_unlocked
  File "<frozen importlib._bootstrap_external>", line 728, in exec_module
  File "<frozen importlib._bootstrap>", line 219, in _call_with_frames_removed
  File "C:\Users\Administrator\mxnet\tests\python\unittest\test_test_utils.py", line 21, in <module>
    import mxnet as mx
  File "C:\Users\Administrator\mxnet\windows_package\python\mxnet\__init__.py", line 33, in <module>
    from . import contrib
  File "C:\Users\Administrator\mxnet\windows_package\python\mxnet\contrib\__init__.py", line 27, in <module>
    from . import autograd
  File "C:\Users\Administrator\mxnet\windows_package\python\mxnet\contrib\autograd.py", line 27, in <module>
    from ..ndarray import NDArray, zeros_like, _GRAD_REQ_MAP
  File "C:\Users\Administrator\mxnet\windows_package\python\mxnet\ndarray\__init__.py", line 20, in <module>
    from . import _internal, contrib, linalg, op, random, sparse, utils, image, ndarray, numpy
  File "C:\Users\Administrator\mxnet\windows_package\python\mxnet\ndarray\numpy\__init__.py", line 23, in <module>
    from . import _register
  File "C:\Users\Administrator\mxnet\windows_package\python\mxnet\ndarray\numpy\_register.py", line 21, in <module>
    from ..register import _make_ndarray_function
  File "C:\Users\Administrator\mxnet\windows_package\python\mxnet\ndarray\register.py", line 277, in <module>
    _init_op_module('mxnet', 'ndarray', _make_ndarray_function)
  File "C:\Users\Administrator\mxnet\windows_package\python\mxnet\base.py", line 682, in _init_op_module
    ctypes.byref(plist)))
OSError: exception: access violation writing 0x0000000000000000

======================================================================
ERROR: Failure: OSError (exception: access violation writing 0x0000000000000000)
----------------------------------------------------------------------
Traceback (most recent call last):

Error Message

(Paste the complete error message. Please also include stack trace by setting environment variable DMLC_LOG_STACK_TRACE_DEPTH=10 before running your script.)

To Reproduce

Create an AMI with the provided scripts, compile and run GPU tests.

Steps to reproduce

(Paste the commands you ran that produced the error.)

1.
2.

What have you tried to solve it?

1.
2.

Environment

We recommend using our script for collecting the diagnositc information. Run the following command and paste the outputs below:

curl --retry 10 -s https://raw.githubusercontent.com/dmlc/gluon-nlp/master/tools/diagnose.py | python

# paste outputs here
Bug

All 4 comments

I was able to run the script ci/windows/test_py3_gpu.ps1 properly and did not run into this error with the setup script provided on a local Windows machine.

Ok, that's surprising. what build did you do before running the tests?

Ok, that's surprising. what build did you do before running the tests?

I simply ran pre_setup.ps1 and setup.ps1. Then I cloned MXNet and run py -3 ci/build_windows.py -f WIN_GPU and went into the python folder to install with python setup.py install.

I was able to import mxnet properly and the test script worked fine.

It turns out that the CUDA architecture detection probably failed, if you go with the 5.2 in the script it does give you mxnet_70.dll on P3 instance, but the binary might be wrong. When I set '-DMXNET_CUDA_ARCH="7.0” ‘ things work properly.

Still pretty weird since yesterday when I was trying this with Clang it gives me the same segmentation fault. I'll probably dig deeper into this if this happens again on a fresh P3 instance when I try to integrate TVM ops.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

luoruisichuan picture luoruisichuan  ·  3Comments

JonBoyleCoding picture JonBoyleCoding  ·  3Comments

yuconglin picture yuconglin  ·  3Comments

xzqjack picture xzqjack  ·  3Comments

qiliux picture qiliux  ·  3Comments