Dlib: Error while calling cudaGetDevice(&the_device_id) in file /home/ubuntu/src/dlib-19.4/dlib/dnn/gpu_data.cpp:178

Created on 31 May 2017 · 16Comments · Source: davisking/dlib

The error happens only when I try to use compute_face_descriptor all other python bindings are working fine as far as I can tell.

The initialization is:
RESNET_MODEL = '/mnt/d1/faces_1/models/dlib_face_recognition_resnet_model_v1.dat' facerec = dlib.face_recognition_model_v1(RESNET_MODEL)

Here is the result

````
File "alignmentCheck.py", line 26, in avgEuclideanDistanceCalculation
fps.append(facerec.compute_face_descriptor(img, shape, 10))
RuntimeError: Error while calling cudaGetDevice(&the_device_id) in file /home/ubuntu/src/dlib-19.4/dlib/dnn/gpu_data.cpp:178. code: 3, reason: initialization error
"""

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
File "alignmentCheck.py", line 51, in
for x in p.imap_unordered(avgEuclideanDistanceCalculation, df['starId'].unique()):
File "/usr/lib/python3.5/multiprocessing/pool.py", line 695, in next
raise value
RuntimeError: Error while calling cudaGetDevice(&the_device_id) in file /home/ubuntu/src/dlib-19.4/dlib/dnn/gpu_data.cpp:178. code: 3, reason: initialization error
````

This is EC2 p2.xlarge instance with Tesla K80 GPU

+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+

$ cat /usr/local/cuda/version.txt
CUDA Version 8.0.61

$ cat /usr/local/cuda/include/cudnn.h | grep CUDNN_MAJOR -A 2

define CUDNN_MAJOR 5

define CUDNN_MINOR 1

define CUDNN_PATCHLEVEL 10

$ uname -a
Linux ip-172-31-23-151 4.4.0-1017-aws #26-Ubuntu SMP Fri Apr 28 19:48:19 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux

````

Source

AMilkov

👍1

Most helpful comment

@AMilkov @terencezl @shang-vikas @OptimusPrimeCao @axmadjon
If using Python 3.4+ on unix-like platforms, you can use 'spawn' instead of 'fork' to start a process. That will not cause the problem.
Example code from https://docs.python.org/3/library/multiprocessing.html:

import multiprocessing as mp

def foo(q):
    q.put('hello')

if __name__ == '__main__':
    ctx = mp.get_context('spawn')
    q = ctx.Queue()
    p = ctx.Process(target=foo, args=(q,))
    p.start()
    print(q.get())
    p.join()

czfhhh on 24 Feb 2018

👍3

All 16 comments

That all looks fine, except that nvidia-smi is showing 99% GPU utilization when nothing is running. That sounds bad. Probably something is messed up with the system.

But in any case, it should work. If you are getting initialization errors usually something is broken about your system's cuda install or GPU hardware.

davisking on 31 May 2017

Here is how TF is managing on the same instance:

````
$ python3
Python 3.5.2 (default, Nov 17 2016, 17:05:23)
[GCC 5.4.0 20160609] on linux
Type "help", "copyright", "credits" or "license" for more information.

import tensorflow as tf
sess = tf.Session(config=tf.ConfigProto(log_device_placement=True))
2017-05-31 17:26:02.464158: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE4.1 instructions, but these are available on your machine and could speed up CPU computations.
2017-05-31 17:26:02.464197: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE4.2 instructions, but these are available on your machine and could speed up CPU computations.
2017-05-31 17:26:02.464212: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use AVX instructions, but these are available on your machine and could speed up CPU computations.
2017-05-31 17:26:02.464226: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use AVX2 instructions, but these are available on your machine and could speed up CPU computations.
2017-05-31 17:26:02.464241: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use FMA instructions, but these are available on your machine and could speed up CPU computations.
2017-05-31 17:26:05.418360: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:901] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2017-05-31 17:26:05.418872: I tensorflow/core/common_runtime/gpu/gpu_device.cc:887] Found device 0 with properties:
name: Tesla K80
major: 3 minor: 7 memoryClockRate (GHz) 0.8235
pciBusID 0000:00:1e.0
Total memory: 11.17GiB
Free memory: 11.11GiB
2017-05-31 17:26:05.418902: I tensorflow/core/common_runtime/gpu/gpu_device.cc:908] DMA: 0
2017-05-31 17:26:05.418918: I tensorflow/core/common_runtime/gpu/gpu_device.cc:918] 0: Y
2017-05-31 17:26:05.418943: I tensorflow/core/common_runtime/gpu/gpu_device.cc:977] Creating TensorFlow device (/gpu:0) -> (device: 0, name: Tesla K80, pci bus id: 0000:00:1e.0)
Device mapping:
/job:localhost/replica:0/task:0/gpu:0 -> device: 0, name: Tesla K80, pci bus id: 0000:00:1e.0
2017-05-31 17:26:05.506712: I tensorflow/core/common_runtime/direct_session.cc:257] Device mapping:
/job:localhost/replica:0/task:0/gpu:0 -> device: 0, name: Tesla K80, pci bus id: 0000:00:1e.0
````

+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 2485 C python3 10869MiB |
+-----------------------------------------------------------------------------+
````

Looks like NUMA node is set to zero forcefully by TF, is it possible that to has something in common with the error?

AMilkov on 31 May 2017

I'm not familiar with whatever amazon is running on this machine so I have no idea.

davisking on 31 May 2017

I manage to resolve this side effect.

Turns out the compute_face_descriptor can be used as a single process only and the GPU can not be shared from multiple compute_face_descriptor processes.
As soon as I removed the multi-processing everything started working as expected.
I guess the first process locks the GPU and do not allow simultaneous GPU uses.

+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 2198 C python3 189MiB |
+-----------------------------------------------------------------------------+

````

Thanks for the support and for the great library!
If this is the expected behavior - feel free to close the issue.

AMilkov on 1 Jun 2017

Normally, you can have multiple processes accessing the same GPU. But maybe your system is configured otherwise. In any case, it looks like you found the issue.

davisking on 1 Jun 2017

I met the same issue with python's multiprocessing on Linux. Can't test on MacOS because I don't have a GPU there. When running in the main process, GPU can be acquired, but running the new cnn_face_detector depend on two scenarios:

If I run detector in the main process before starting the subprocess, main process call goes through, but the new process hangs at cnn_face_detector(frame), with noticeable GPU usage, but it just hangs.
If I run it in the subprocess without running in the main process first, it gives off the error:

RuntimeError: Error while calling cudaGetDevice(&the_device_id) in file /home/user/dlib/dlib/dnn/gpu_data.cpp:178. code: 3, reason: initialization error

The new process can be just a singular process, or from a Pool.

terencezl on 27 Aug 2017

Just confirmed that even in CPU mode, scenario 1 still happens. So that should be a separate problem.

terencezl on 27 Aug 2017

I just tested this kind of usage from a simple multi-threaded C++ program and it works fine. The dlib dnn code is also threadsafe in general and tested for that regularly. So I doubt this has anything to do with dlib. You guys are probably running into limitations of python's threading support. Like the global interpreter lock: https://wiki.python.org/moin/GlobalInterpreterLock.

davisking on 27 Aug 2017

Just tested again. Using threading.Thread, everything runs fine. This is where GIL could happen but as usual, C++ calls get away. But other calls cannot so true parallel CPU utilization elsewhere cannot happen. This might be my temporary workaround.

Only when using multiprocessing, when new processes are generated, do the problems described above occur. I tried to load a new detector by calling the serialization initialization function. It could bypass the hanging scenario, but the second scenario occurs, not getting the device, right at the line of initialization. So for some reason in a different python process, cudaGetDevice(&the_device_id) is not working.

terencezl on 27 Aug 2017

Interesting. Tested again. This time if I serialize the detector directly inside the subprocesses without loading it in the main process beforehand, which can be multiple, all of them work.

I had problems like this before on MacOS, when if I run some C++ calls in the main process, other (some) C++ calls in the subprocesses will generate a seg fault. It was due to MacOS's forking behavior, rather than spawning. But same code works fine Linux. It's not this case particularly but could be a indication that Python's multiprocess support requires a lot of care.

terencezl on 27 Aug 2017

That's not python, that's just how subprocesses and forking work in general. You can't allocate arbitrary resources in a process and then fork and assume those resources are still valid. This is especially true for hardware resources like a GPU context.

Anyway, this isn't a dlib problem. You just have to read the manual carefully for the pyhton multi-processing features you are using and understand their limitations and how to use them.

davisking on 27 Aug 2017

👍2

I had the same problem,was using multiprocessing with cnn_face_detection_v1 module.Looks like the method uses 881 MB of gpu memory. I had 1070ti(8 GB). Just copied the code file 8 times and divided data between them. worked like a charm.

shang-vikas on 17 Jan 2018

@terencezl Hi, how you finally make dlib in python works in sub-process way?

OptimusPrimeCao on 2 Feb 2018

import multiprocessing as mp

def foo(q):
    q.put('hello')

if __name__ == '__main__':
    ctx = mp.get_context('spawn')
    q = ctx.Queue()
    p = ctx.Process(target=foo, args=(q,))
    p.start()
    print(q.get())
    p.join()

czfhhh on 24 Feb 2018

👍3

hello, i have same issue. I dont know why, here in my case.. i got success run with command python3 index.py and use the program, but when i try to deploy in nginx link from here: https://www.digitalocean.com/community/tutorials/how-to-serve-flask-applications-with-uswgi-and-nginx-on-ubuntu-18-04

the error log like this:

cnn_face_detector = dlib.cnn_face_detection_model_v1("mmod_human_face_detector.dat")
RuntimeError: Error while calling cudaGetDevice(&the_device_id) in file /root/dlib-master/dlib/cuda/gpu_data.cpp:178. code: 3, reason: initialization error

i think it's like when the program call cnn_face_detector = dlib.cnn_face_detection_model_v1("mmod_human_face_detector.dat")

any clue ?

blinkbink on 21 Nov 2018

hello, i have same issue. I dont know why, here in my case.. i got success run with command python3 index.py and use the program, but when i try to deploy in nginx link from here: https://www.digitalocean.com/community/tutorials/how-to-serve-flask-applications-with-uswgi-and-nginx-on-ubuntu-18-04

the error log like this:
cnn_face_detector = dlib.cnn_face_detection_model_v1("mmod_human_face_detector.dat")
RuntimeError: Error while calling cudaGetDevice(&the_device_id) in file /root/dlib-master/dlib/cuda/gpu_data.cpp:178. code: 3, reason: initialization error
i think it's like when the program call cnn_face_detector = dlib.cnn_face_detection_model_v1("mmod_human_face_detector.dat")

any clue ?
First you'd better use 'spawn' instead of 'fork' to start a process.
load the cnn_face_detector at first(out of any process).
hope it could work.
https://github.com/davisking/dlib/issues/1013#issuecomment-351728761