Hi all!
_In short:_
import mxnet returns:
ImportError: No module named mxnet
I have mxnet-cu90 version 1.1.0 installed. Not sure why it isn't being found.
_In long:_
I'm having a problem using GPU support in Ubuntu 16.04 with Python 2.7 (with Anaconda). I followed the CUDA and cuDNN instructions, and have all the verifications successfully passing. Here's the passing result of the last catch-all test:
```./mnistCUDNN
cudnnGetVersion() : 7301 , CUDNN_VERSION from cudnn.h : 7301 (7.3.1)
Host compiler version : GCC 5.4.0
There are 1 CUDA capable devices on your machine :
device 0 : sms 28 Capabilities 6.1, SmClock 1582.0 Mhz, MemSize (Mb) 11164, MemClock 5505.0 Mhz, Ecc=0, boardGroupID=0
Using device 0
Testing single precision
Loading image data/one_28x28.pgmhttps://github.com/apple/turicreate/issues/1190
Performing forward propagation ...
Testing cudnnGetConvolutionForwardAlgorithm ...
Fastest algorithm is Algo 1
Testing cudnnFindConvolutionForwardAlgorithm ...
^^^^ CUDNN_STATUS_SUCCESS for Algo 0: 0.021504 time requiring 0 memory
^^^^ CUDNN_STATUS_SUCCESS for Algo 1: 0.027648 time requiring 3464 memory
^^^^ CUDNN_STATUS_SUCCESS for Algo 2: 0.034816 time requiring 57600 memory
^^^^ CUDNN_STATUS_SUCCESS for Algo 7: 0.069632 time requiring 2057744 memory
^^^^ CUDNN_STATUS_SUCCESS for Algo 4: 0.070656 time requiring 207360 memory
Resulting weights from Softmax:
0.0000000 0.9999399 0.0000000 0.0000000 0.0000561 0.0000000 0.0000012 0.0000017 0.0000010 0.0000000
Loading image data/three_28x28.pgm
Performing forward propagation ...
Resulting weights from Softmax:
0.0000000 0.0000000 0.0000000 0.9999288 0.0000000 0.0000711 0.0000000 0.0000000 0.0000000 0.0000000
Loading image data/five_28x28.pgm
Performing forward propagation ...
Resulting weights from Softmax:
0.0000000 0.0000008 0.0000000 0.0000002 0.0000000 0.9999820 0.0000154 0.0000000 0.0000012 0.0000006
Result of classification: 1 3 5
Test passed!
Testing half precision (math in single precision)
Loading image data/one_28x28.pgm
Performing forward propagation ...
Testing cudnnGetConvolutionForwardAlgorithm ...
Fastest algorithm is Algo 1
Testing cudnnFindConvolutionForwardAlgorithm ...
^^^^ CUDNN_STATUS_SUCCESS for Algo 0: 0.024480 time requiring 0 memory
^^^^ CUDNN_STATUS_SUCCESS for Algo 1: 0.027488 time requiring 3464 memory
^^^^ CUDNN_STATUS_SUCCESS for Algo 2: 0.047104 time requiring 28800 memory
^^^^ CUDNN_STATUS_SUCCESS for Algo 4: 0.068608 time requiring 207360 memory
^^^^ CUDNN_STATUS_SUCCESS for Algo 7: 0.069632 tihttps://github.com/apple/turicreate/issues/1190me requiring 2057744 memory
Resulting weights from Softmax:
0.0000001 1.0000000 0.0000001 0.0000000 0.0000563 0.0000001 0.0000012 0.0000017 0.0000010 0.0000001
Loading image data/three_28x28.pgm
Performing forward propagation ...
Resulting weights from Softmax:
0.0000000 0.0000000 0.0000000 1.0000000 0.0000000 0.0000714 0.0000000 0.0000000 0.0000000 0.0000000
Loading image data/five_28x28.pgm
Performing forward propagation ...
Resulting weights from Softmax:
0.0000000 0.0000008 0.0000000 0.0000002 0.0000000 1.0000000 0.0000154 0.0000000 0.0000012 0.0000006
Result of classification: 1 3 5
Test passed!
https://github.com/apple/turicreate/issues/1190
So I think I have CUDA running properly. I've also installed TuriCreate as per [here](https://github.com/apple/turicreate#installation) (using conda virtual env) and followed the Linux GPU information [here](https://github.com/apple/turicreate/blob/master/LinuxGPU.md), but using -cu90 instead of -cu80 because I'm using CUDA 9.0 and uninstalled mxnet, as specified; below are the commands I used:
export LD_LIBRARY_PATH=/usr/local/cuda/lib64:$LD_LIBRARY_PATH
(venv) pip uninstall -y mxnet
(venv) pip install mxnet-cu90==1.1.0
I also double-checked that `cuda` is symlinked properly (`cuda -> cuda-9.0`) for CUDA 9.0.
The problem is what I get when I run the training script (the 'Introductory Example', [here](https://apple.github.io/turicreate/docs/userguide/object_detection/)):
python training.py
Traceback (most recent call last):
File "training.py", line 10, in
model = tc.object_detector.create(train_data)
File "/home/gilles/.conda/envs/venv/lib/python2.7/site-packages/turicreate/toolkits/object_detector/object_detector.py", line 181, in create
from ._mx_detector import YOLOLoss as _YOLOLoss
File "/home/gilles/.conda/envs/venv/lib/python2.7/site-packages/turicreate/toolkits/object_detector/_mx_detector.py", line 9, in
import mxnet as _mx
ImportError: No module named mxnet
``
Notice that it says it can't find mxnet. Even withturicreate.config.set_num_gpus(1)` at the top it seems to want to use mxnet instead of mxnet-cu90.
So I'm thinking I'm missing something that remaps/links mxnet to the mxnet-cu90 (version 1.1.0) that I have installed.
But that's just my current best guess, and don't know where to go from here.
Thoughts?
Thanks in advance,
Brandon
A quick thought: Should I be using Python 3.x instead? Apple seemed to recommend Python 2.x, so I went with that on this fresh install/effort.
I've found this so far (here). Going to see if it's indeed that sudo issue for pip:
I figured what the problem was:
mxnet wasn't installed correctldue to lack of premissions.In step 5 need to type: sudo pip install mxnet-cu80 instead of just
"pip install mxnet-cu80 "
Nope, that has no change for me.
Maybe I found something:
nvcc --version returns:
The program 'nvcc' is currently not installed. You can install it by typing:
sudo apt install nvidia-cuda-toolkit
However, when I cd to /usr/local/cuda-9.0/ and cat version.txt, I see I have CUDA Version 9.0.176
Is nvcc not being recognized the issue?
Added it to the path manually... not sure why it wasn't in there to start.
export PATH="/usr/local/cuda-9.0/bin:$PATH"
(venv) gilles@learner:~$ nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2017 NVIDIA Corporation
Built on Fri_Sep__1_21:08:03_CDT_2017
Cuda compilation tools, release 9.0, V9.0.176
And still the same error:
(venv) gilles@learner:~/Downloads$ python training.py
Traceback (most recent call last):
File "training.py", line 12, in <module>
model = tc.object_detector.create(train_data)
File "/home/gilles/.conda/envs/venv/lib/python2.7/site-packages/turicreate/toolkits/object_detector/object_detector.py", line 181, in create
from ._mx_detector import YOLOLoss as _YOLOLoss
File "/home/gilles/.conda/envs/venv/lib/python2.7/site-packages/turicreate/toolkits/object_detector/_mx_detector.py", line 9, in <module>
import mxnet as _mx
ImportError: No module named mxnet
As a sanity check, what packages appear alongside turicreate in /home/gilles/.conda/envs/venv/lib/python2.7/site-packages/? Is it possible that something is misconfigured in your environment/path, and which pip is not using the one from your virtual environment?
Thanks for the help nickjong!
So I'm glad you asked this as I was actually looking at it as well (which means maybe I'm on the right track!).
which pip
/home/gilles/.conda/envs/venv/bin/pip
Here's the mx listings in there:
```(venv) gilles@learner:~/.conda/envs/venv/lib/python2.7/site-packages$ ls mx
mxnet:
attribute.py context.py executor_manager.pyc io.py libmxnet.so misc.pyc ndarray_doc.py random.py symbol visualization.pyc
attribute.pyc context.pyc executor.py io.pyc libquadmath.so.0 model.py ndarray_doc.pyc random.pyc symbol_doc.py
autograd.py contrib executor.pyc kvstore.py log.py model.pyc notebook recordio.py symbol_doc.pyc
autograd.pyc _ctypes gluon kvstore.pyc log.pyc module operator.py recordio.pyc test_utils.py
base.py _cy2 image kvstore_server.py lr_scheduler.py monitor.py operator.pyc registry.py test_utils.pyc
base.pyc _cy3 initializer.py kvstore_server.pyc lr_scheduler.pyc monitor.pyc optimizer.py registry.pyc tools
callback.py engine.py initializer.pyc libgfortran.so.3 metric.py name.py optimizer.pyc rnn torch.py
callback.pyc engine.pyc __init__.py libinfo.py metric.pyc name.pyc profiler.py rtc.py torch.pyc
COMMIT_HASH executor_manager.py __init__.pyc libinfo.pyc misc.py ndarray profiler.pyc rtc.pyc visualization.py
mxnet_cu90-1.1.0.dist-info:
DESCRIPTION.rst INSTALLER METADATA metadata.json RECORD top_level.txt WHEEL
So it actually appears that it is indeed properly there.
So funnily enough, after a restart (and perhaps, unfortunately, other undocumented prodding), it is actually now running, but crashing. I think you're right that there was something in the path or oherwise virtual environment that just hadn't gotten updated.
Here's what I get now, when running the mxtest.py script (reproduced further below):
```python mxtest.py
RuntimeError: module compiled against API version 0xc but this version of numpy is 0xb
[0]
terminate called after throwing an instance of 'dmlc::Error'
what(): [10:31:28] /home/travis/build/dmlc/mxnet-distro/mxnet-build/mshadow/mshadow/./stream_gpu-inl.h:196: Check failed: e == cudaSuccess CUDA: unknown error
Stack trace returned 9 entries:
[bt] (0) /home/gilles/.conda/envs/venv/lib/python2.7/site-packages/mxnet/libmxnet.so(+0x2a9e78) [0x7f93b9898e78]
[bt] (1) /home/gilles/.conda/envs/venv/lib/python2.7/site-packages/mxnet/libmxnet.so(+0x2aa288) [0x7f93b9899288]
[bt] (2) /home/gilles/.conda/envs/venv/lib/python2.7/site-packages/mxnet/libmxnet.so(+0x24318ab) [0x7f93bba208ab]
[bt] (3) /home/gilles/.conda/envs/venv/lib/python2.7/site-packages/mxnet/libmxnet.so(+0x2442c27) [0x7f93bba31c27]
[bt] (4) /home/gilles/.conda/envs/venv/lib/python2.7/site-packages/mxnet/libmxnet.so(+0x2442db6) [0x7f93bba31db6]
[bt] (5) /home/gilles/.conda/envs/venv/lib/python2.7/site-packages/mxnet/libmxnet.so(+0x243f68b) [0x7f93bba2e68b]
[bt] (6) /home/gilles/.conda/envs/venv/bin/../lib/libstdc++.so.6(+0xb8678) [0x7f93a5530678]
[bt] (7) /lib/x86_64-linux-gnu/libpthread.so.0(+0x76ba) [0x7f93f189b6ba]
[bt] (8) /lib/x86_64-linux-gnu/libc.so.6(clone+0x6d) [0x7f93f0ec141d]
Aborted (core dumped)
And that's with numpy-1.13.3. Weird that it's incompatible/home/gilles/.conda/envs/venv/bin/pip
, as mxnet-cu90-1.1.0 says 'has requirement numpy<=1.13.3'. Doing that latest (1.15.4) version of numpy throws the same error... so not sure there.
And I'm not sure if the numpy error is relevant to the crash but I don't think it is.
So in the mxtest.py script, I had it print the number of GPUs available (print(mx.test_utils.list_gpus())), and it gives an answer of [0].
So I think the crash is because mxnet is not finding a GPU to run on, strangely.
Looking into this a bit more, the only real hit I could find is that mxnet + turicreate + CUDA 9.+ is broken:
https://medium.com/@nickzamosenchuk/training-the-model-for-ios-coreml-in-google-colab-60-times-faster-6b3d1669fc46
So at this point I'm going to try CUDA 8 and and cuDNN 5, instead of the CUDA 9 and cuDNN 7 that I have installed now.
Thoughts?
Thanks again!
And here's the basic little mxtest script:
import mxnet as mx
print(mx.test_utils.list_gpus())
a = mx.nd.ones((2, 3), mx.gpu())
b = a * 2 + 1
print(b.asnumpy())
print('Done')
Yep, that fixed it. I installed CUDA 8, cuDNN 5 for CUDA 8, set all the paths again (removing CUDA 9 specific stuff, and actually leveraging symbolic links this time, FWIW), and it runs now:
(venv) gilles@learner:~/Downloads$ python mxtest.py
RuntimeError: module compiled against API version 0xc but this version of numpy is 0xb
[0]
hello
Funnily enough, I still have the numpy issue. Going to see about solving that, but it doesn't seem to be a related problem.
I tried an older version (1.12.1) of numpy (mxnet-cu80 1.1.0 says to use numpy <= 1.13.3), but still get similar:
(venv) gilles@learner:~/Downloads$ python mxtest.py
RuntimeError: module compiled against API version 0xc but this version of numpy is 0xa
[0]
[[ 3. 3. 3.]
[ 3. 3. 3.]]
Done
Anyways, at least the output is what it should be from the GPU (see here).
Oh and curiously, it still returns 0 for the number of GPUs... maybe I'm calling that function wrong.
The numpy issue was solved by installing the latest version, as below:
(venv) gilles@learner:~/Downloads$ pip install --no-cache-dir -U numpy
Collecting numpy
Downloading https://files.pythonhosted.org/packages/de/37/fe7db552f4507f379d81dcb78e58e05030a8941757b1f664517d581b5553/numpy-1.15.4-cp27-cp27mu-manylinux1_x86_64.whl (13.8MB)
100% |鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅| 13.8MB 10.4MB/s
turicreate 5.1 requires mxnet<1.2.0,>=1.1.0, which is not installed.
mxnet-cu80 1.1.0 has requirement numpy<=1.13.3, but you'll have numpy 1.15.4 which is incompatible.
mxnet-cu90 1.1.0 has requirement numpy<=1.13.3, but you'll have numpy 1.15.4 which is incompatible.
Installing collected packages: numpy
Found existing installation: numpy 1.12.1
Uninstalling numpy-1.12.1:
Successfully uninstalled numpy-1.12.1
Successfully installed numpy-1.15.4
Note however that this is counter-intuitive given the error above, that this numpy is incompatible w/ mxnet-cu80 1.1.0.
I also just noticed that mxnet-cu90 is still in there... removing that now.
Well, apparently that was a bad idea. Removing mxnet-cu90 killed it.
Sweet. I figured that out, which actually is a super-set solution to the previous problem of Python not finding mxnet when mxnet-cuxx is installed.
What's the problem?
If you uninstall any of the mxnet-cuxx versions, Python now thinks there is no mxnet installed, even if there's another mxnet-cuxx installed (or even reinstalled).
The way to fix it is to uninstall all the mxnet-cuxx versions, reinstall standard mxnet (which convinces Python that there is an mxnet, and sets up some dependencies, I bet). Then uninstall it and install mxnet-cuxx.
That fixed it for me at least!
Most helpful comment
Sweet. I figured that out, which actually is a super-set solution to the previous problem of Python not finding mxnet when mxnet-cuxx is installed.
What's the problem?
If you uninstall any of the mxnet-cuxx versions, Python now thinks there is no mxnet installed, even if there's another mxnet-cuxx installed (or even reinstalled).
The way to fix it is to uninstall all the mxnet-cuxx versions, reinstall standard mxnet (which convinces Python that there is an mxnet, and sets up some dependencies, I bet). Then uninstall it and install mxnet-cuxx.
That fixed it for me at least!