Turicreate: GPU Support | Ubuntu 16.04 | GEFORCE GTX 1080 Ti

Created on 7 Nov 2018 · 12Comments · Source: apple/turicreate

Hi all!

_In short:_
import mxnet returns:
ImportError: No module named mxnet
I have mxnet-cu90 version 1.1.0 installed. Not sure why it isn't being found.

_In long:_
I'm having a problem using GPU support in Ubuntu 16.04 with Python 2.7 (with Anaconda). I followed the CUDA and cuDNN instructions, and have all the verifications successfully passing. Here's the passing result of the last catch-all test:
```./mnistCUDNN
cudnnGetVersion() : 7301 , CUDNN_VERSION from cudnn.h : 7301 (7.3.1)
Host compiler version : GCC 5.4.0
There are 1 CUDA capable devices on your machine :
device 0 : sms 28 Capabilities 6.1, SmClock 1582.0 Mhz, MemSize (Mb) 11164, MemClock 5505.0 Mhz, Ecc=0, boardGroupID=0
Using device 0

Testing single precision
Loading image data/one_28x28.pgmhttps://github.com/apple/turicreate/issues/1190
Performing forward propagation ...
Testing cudnnGetConvolutionForwardAlgorithm ...
Fastest algorithm is Algo 1
Testing cudnnFindConvolutionForwardAlgorithm ...
^^^^ CUDNN_STATUS_SUCCESS for Algo 0: 0.021504 time requiring 0 memory
^^^^ CUDNN_STATUS_SUCCESS for Algo 1: 0.027648 time requiring 3464 memory
^^^^ CUDNN_STATUS_SUCCESS for Algo 2: 0.034816 time requiring 57600 memory
^^^^ CUDNN_STATUS_SUCCESS for Algo 7: 0.069632 time requiring 2057744 memory
^^^^ CUDNN_STATUS_SUCCESS for Algo 4: 0.070656 time requiring 207360 memory
Resulting weights from Softmax:
0.0000000 0.9999399 0.0000000 0.0000000 0.0000561 0.0000000 0.0000012 0.0000017 0.0000010 0.0000000
Loading image data/three_28x28.pgm
Performing forward propagation ...
Resulting weights from Softmax:
0.0000000 0.0000000 0.0000000 0.9999288 0.0000000 0.0000711 0.0000000 0.0000000 0.0000000 0.0000000
Loading image data/five_28x28.pgm
Performing forward propagation ...
Resulting weights from Softmax:
0.0000000 0.0000008 0.0000000 0.0000002 0.0000000 0.9999820 0.0000154 0.0000000 0.0000012 0.0000006

Result of classification: 1 3 5

Test passed!

Testing half precision (math in single precision)
Loading image data/one_28x28.pgm
Performing forward propagation ...
Testing cudnnGetConvolutionForwardAlgorithm ...
Fastest algorithm is Algo 1
Testing cudnnFindConvolutionForwardAlgorithm ...
^^^^ CUDNN_STATUS_SUCCESS for Algo 0: 0.024480 time requiring 0 memory
^^^^ CUDNN_STATUS_SUCCESS for Algo 1: 0.027488 time requiring 3464 memory
^^^^ CUDNN_STATUS_SUCCESS for Algo 2: 0.047104 time requiring 28800 memory
^^^^ CUDNN_STATUS_SUCCESS for Algo 4: 0.068608 time requiring 207360 memory
^^^^ CUDNN_STATUS_SUCCESS for Algo 7: 0.069632 tihttps://github.com/apple/turicreate/issues/1190me requiring 2057744 memory
Resulting weights from Softmax:
0.0000001 1.0000000 0.0000001 0.0000000 0.0000563 0.0000001 0.0000012 0.0000017 0.0000010 0.0000001
Loading image data/three_28x28.pgm
Performing forward propagation ...
Resulting weights from Softmax:
0.0000000 0.0000000 0.0000000 1.0000000 0.0000000 0.0000714 0.0000000 0.0000000 0.0000000 0.0000000
Loading image data/five_28x28.pgm
Performing forward propagation ...
Resulting weights from Softmax:
0.0000000 0.0000008 0.0000000 0.0000002 0.0000000 1.0000000 0.0000154 0.0000000 0.0000012 0.0000006

Result of classification: 1 3 5

Test passed!

https://github.com/apple/turicreate/issues/1190
So I think I have CUDA running properly.  I've also installed TuriCreate as per [here](https://github.com/apple/turicreate#installation) (using conda virtual env) and followed the Linux GPU information [here](https://github.com/apple/turicreate/blob/master/LinuxGPU.md), but using -cu90 instead of -cu80 because I'm using CUDA 9.0 and uninstalled mxnet, as specified; below are the commands I used:

export LD_LIBRARY_PATH=/usr/local/cuda/lib64:$LD_LIBRARY_PATH
(venv) pip uninstall -y mxnet
(venv) pip install mxnet-cu90==1.1.0

I also double-checked that `cuda` is symlinked properly (`cuda -> cuda-9.0`) for CUDA 9.0.

The problem is what I get when I run the training script (the 'Introductory Example', [here](https://apple.github.io/turicreate/docs/userguide/object_detection/)):

python training.py
Traceback (most recent call last):
File "training.py", line 10, in
model = tc.object_detector.create(train_data)
File "/home/gilles/.conda/envs/venv/lib/python2.7/site-packages/turicreate/toolkits/object_detector/object_detector.py", line 181, in create
from ._mx_detector import YOLOLoss as _YOLOLoss
File "/home/gilles/.conda/envs/venv/lib/python2.7/site-packages/turicreate/toolkits/object_detector/_mx_detector.py", line 9, in
import mxnet as _mx
ImportError: No module named mxnet
``Notice that it says it can't find mxnet. Even withturicreate.config.set_num_gpus(1)` at the top it seems to want to use mxnet instead of mxnet-cu90.

So I'm thinking I'm missing something that remaps/links mxnet to the mxnet-cu90 (version 1.1.0) that I have installed.

But that's just my current best guess, and don't know where to go from here.

Thoughts?

Thanks in advance,
Brandon

question toolkits

Source

Luxonis-Brandon

Most helpful comment

Sweet. I figured that out, which actually is a super-set solution to the previous problem of Python not finding mxnet when mxnet-cuxx is installed.

What's the problem?

If you uninstall any of the mxnet-cuxx versions, Python now thinks there is no mxnet installed, even if there's another mxnet-cuxx installed (or even reinstalled).

The way to fix it is to uninstall all the mxnet-cuxx versions, reinstall standard mxnet (which convinces Python that there is an mxnet, and sets up some dependencies, I bet). Then uninstall it and install mxnet-cuxx.

That fixed it for me at least!

Luxonis-Brandon on 12 Nov 2018

👍2

All 12 comments

A quick thought: Should I be using Python 3.x instead? Apple seemed to recommend Python 2.x, so I went with that on this fresh install/effort.

Luxonis-Brandon on 7 Nov 2018

I've found this so far (here). Going to see if it's indeed that sudo issue for pip:

I figured what the problem was:
mxnet wasn't installed correctldue to lack of premissions.

In step 5 need to type: sudo pip install mxnet-cu80 instead of just
"pip install mxnet-cu80 "

Luxonis-Brandon on 7 Nov 2018

Nope, that has no change for me.

Luxonis-Brandon on 7 Nov 2018

Maybe I found something:
nvcc --version returns:
The program 'nvcc' is currently not installed. You can install it by typing:
sudo apt install nvidia-cuda-toolkit

However, when I cd to /usr/local/cuda-9.0/ and cat version.txt, I see I have CUDA Version 9.0.176

Is nvcc not being recognized the issue?

Luxonis-Brandon on 7 Nov 2018

Added it to the path manually... not sure why it wasn't in there to start.

export PATH="/usr/local/cuda-9.0/bin:$PATH"
(venv) gilles@learner:~$ nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2017 NVIDIA Corporation
Built on Fri_Sep__1_21:08:03_CDT_2017
Cuda compilation tools, release 9.0, V9.0.176

And still the same error:

(venv) gilles@learner:~/Downloads$ python training.py
Traceback (most recent call last):
  File "training.py", line 12, in <module>
    model = tc.object_detector.create(train_data)
  File "/home/gilles/.conda/envs/venv/lib/python2.7/site-packages/turicreate/toolkits/object_detector/object_detector.py", line 181, in create
    from ._mx_detector import YOLOLoss as _YOLOLoss
  File "/home/gilles/.conda/envs/venv/lib/python2.7/site-packages/turicreate/toolkits/object_detector/_mx_detector.py", line 9, in <module>
    import mxnet as _mx
ImportError: No module named mxnet

Luxonis-Brandon on 7 Nov 2018

As a sanity check, what packages appear alongside turicreate in /home/gilles/.conda/envs/venv/lib/python2.7/site-packages/? Is it possible that something is misconfigured in your environment/path, and which pip is not using the one from your virtual environment?

nickjong on 9 Nov 2018

Thanks for the help nickjong!

So I'm glad you asked this as I was actually looking at it as well (which means maybe I'm on the right track!).

which pip
/home/gilles/.conda/envs/venv/bin/pip

Here's the mx listings in there:
```(venv) gilles@learner:~/.conda/envs/venv/lib/python2.7/site-packages$ ls mx
mxnet:
attribute.py context.py executor_manager.pyc io.py libmxnet.so misc.pyc ndarray_doc.py random.py symbol visualization.pyc
attribute.pyc context.pyc executor.py io.pyc libquadmath.so.0 model.py ndarray_doc.pyc random.pyc symbol_doc.py
autograd.py contrib executor.pyc kvstore.py log.py model.pyc notebook recordio.py symbol_doc.pyc
autograd.pyc _ctypes gluon kvstore.pyc log.pyc module operator.py recordio.pyc test_utils.py
base.py _cy2 image kvstore_server.py lr_scheduler.py monitor.py operator.pyc registry.py test_utils.pyc
base.pyc _cy3 initializer.py kvstore_server.pyc lr_scheduler.pyc monitor.pyc optimizer.py registry.pyc tools
callback.py engine.py initializer.pyc libgfortran.so.3 metric.py name.py optimizer.pyc rnn torch.py
callback.pyc engine.pyc __init__.py libinfo.py metric.pyc name.pyc profiler.py rtc.py torch.pyc
COMMIT_HASH executor_manager.py __init__.pyc libinfo.pyc misc.py ndarray profiler.pyc rtc.pyc visualization.py

mxnet_cu90-1.1.0.dist-info:
DESCRIPTION.rst INSTALLER METADATA metadata.json RECORD top_level.txt WHEEL


So it actually appears that it is indeed properly there.

So funnily enough, after a restart (and perhaps, unfortunately, other undocumented prodding), it is actually now running, but crashing.  I think you're right that there was something in the path or oherwise virtual environment that just hadn't gotten updated.

Here's what I get now, when running the mxtest.py script (reproduced further below):
```python mxtest.py 
RuntimeError: module compiled against API version 0xc but this version of numpy is 0xb
[0]
terminate called after throwing an instance of 'dmlc::Error'
  what():  [10:31:28] /home/travis/build/dmlc/mxnet-distro/mxnet-build/mshadow/mshadow/./stream_gpu-inl.h:196: Check failed: e == cudaSuccess CUDA: unknown error

Stack trace returned 9 entries:
[bt] (0) /home/gilles/.conda/envs/venv/lib/python2.7/site-packages/mxnet/libmxnet.so(+0x2a9e78) [0x7f93b9898e78]
[bt] (1) /home/gilles/.conda/envs/venv/lib/python2.7/site-packages/mxnet/libmxnet.so(+0x2aa288) [0x7f93b9899288]
[bt] (2) /home/gilles/.conda/envs/venv/lib/python2.7/site-packages/mxnet/libmxnet.so(+0x24318ab) [0x7f93bba208ab]
[bt] (3) /home/gilles/.conda/envs/venv/lib/python2.7/site-packages/mxnet/libmxnet.so(+0x2442c27) [0x7f93bba31c27]
[bt] (4) /home/gilles/.conda/envs/venv/lib/python2.7/site-packages/mxnet/libmxnet.so(+0x2442db6) [0x7f93bba31db6]
[bt] (5) /home/gilles/.conda/envs/venv/lib/python2.7/site-packages/mxnet/libmxnet.so(+0x243f68b) [0x7f93bba2e68b]
[bt] (6) /home/gilles/.conda/envs/venv/bin/../lib/libstdc++.so.6(+0xb8678) [0x7f93a5530678]
[bt] (7) /lib/x86_64-linux-gnu/libpthread.so.0(+0x76ba) [0x7f93f189b6ba]
[bt] (8) /lib/x86_64-linux-gnu/libc.so.6(clone+0x6d) [0x7f93f0ec141d]


Aborted (core dumped)

And that's with numpy-1.13.3. Weird that it's incompatible/home/gilles/.conda/envs/venv/bin/pip
, as mxnet-cu90-1.1.0 says 'has requirement numpy<=1.13.3'. Doing that latest (1.15.4) version of numpy throws the same error... so not sure there.

And I'm not sure if the numpy error is relevant to the crash but I don't think it is.

So in the mxtest.py script, I had it print the number of GPUs available (print(mx.test_utils.list_gpus())), and it gives an answer of [0].

So I think the crash is because mxnet is not finding a GPU to run on, strangely.

Looking into this a bit more, the only real hit I could find is that mxnet + turicreate + CUDA 9.+ is broken:
https://medium.com/@nickzamosenchuk/training-the-model-for-ios-coreml-in-google-colab-60-times-faster-6b3d1669fc46

So at this point I'm going to try CUDA 8 and and cuDNN 5, instead of the CUDA 9 and cuDNN 7 that I have installed now.

Thoughts?

Thanks again!

And here's the basic little mxtest script:

import mxnet as mx

print(mx.test_utils.list_gpus())

a = mx.nd.ones((2, 3), mx.gpu())
b = a * 2 + 1

print(b.asnumpy())

print('Done')

Luxonis-Brandon on 12 Nov 2018

Yep, that fixed it. I installed CUDA 8, cuDNN 5 for CUDA 8, set all the paths again (removing CUDA 9 specific stuff, and actually leveraging symbolic links this time, FWIW), and it runs now:

(venv) gilles@learner:~/Downloads$ python mxtest.py 
RuntimeError: module compiled against API version 0xc but this version of numpy is 0xb
[0]
hello

Funnily enough, I still have the numpy issue. Going to see about solving that, but it doesn't seem to be a related problem.

I tried an older version (1.12.1) of numpy (mxnet-cu80 1.1.0 says to use numpy <= 1.13.3), but still get similar:

(venv) gilles@learner:~/Downloads$ python mxtest.py 
RuntimeError: module compiled against API version 0xc but this version of numpy is 0xa
[0]
[[ 3.  3.  3.]
 [ 3.  3.  3.]]
Done

Anyways, at least the output is what it should be from the GPU (see here).

Luxonis-Brandon on 12 Nov 2018

Oh and curiously, it still returns 0 for the number of GPUs... maybe I'm calling that function wrong.

Luxonis-Brandon on 12 Nov 2018

The numpy issue was solved by installing the latest version, as below:

(venv) gilles@learner:~/Downloads$ pip install --no-cache-dir -U numpy
Collecting numpy
  Downloading https://files.pythonhosted.org/packages/de/37/fe7db552f4507f379d81dcb78e58e05030a8941757b1f664517d581b5553/numpy-1.15.4-cp27-cp27mu-manylinux1_x86_64.whl (13.8MB)
    100% |████████████████████████████████| 13.8MB 10.4MB/s 
turicreate 5.1 requires mxnet<1.2.0,>=1.1.0, which is not installed.
mxnet-cu80 1.1.0 has requirement numpy<=1.13.3, but you'll have numpy 1.15.4 which is incompatible.
mxnet-cu90 1.1.0 has requirement numpy<=1.13.3, but you'll have numpy 1.15.4 which is incompatible.
Installing collected packages: numpy
  Found existing installation: numpy 1.12.1
    Uninstalling numpy-1.12.1:
      Successfully uninstalled numpy-1.12.1
Successfully installed numpy-1.15.4

Note however that this is counter-intuitive given the error above, that this numpy is incompatible w/ mxnet-cu80 1.1.0.

I also just noticed that mxnet-cu90 is still in there... removing that now.

Luxonis-Brandon on 12 Nov 2018

Well, apparently that was a bad idea. Removing mxnet-cu90 killed it.

Luxonis-Brandon on 12 Nov 2018

Sweet. I figured that out, which actually is a super-set solution to the previous problem of Python not finding mxnet when mxnet-cuxx is installed.

What's the problem?

If you uninstall any of the mxnet-cuxx versions, Python now thinks there is no mxnet installed, even if there's another mxnet-cuxx installed (or even reinstalled).

That fixed it for me at least!

Luxonis-Brandon on 12 Nov 2018

👍2

Was this page helpful?

0 / 5 - 0 ratings