The error happens only when I try to use compute_face_descriptor all other python bindings are working fine as far as I can tell.
The initialization is:
RESNET_MODEL = '/mnt/d1/faces_1/models/dlib_face_recognition_resnet_model_v1.dat'
facerec = dlib.face_recognition_model_v1(RESNET_MODEL)
Here is the result
````
File "alignmentCheck.py", line 26, in avgEuclideanDistanceCalculation
fps.append(facerec.compute_face_descriptor(img, shape, 10))
RuntimeError: Error while calling cudaGetDevice(&the_device_id) in file /home/ubuntu/src/dlib-19.4/dlib/dnn/gpu_data.cpp:178. code: 3, reason: initialization error
"""
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "alignmentCheck.py", line 51, in
for x in p.imap_unordered(avgEuclideanDistanceCalculation, df['starId'].unique()):
File "/usr/lib/python3.5/multiprocessing/pool.py", line 695, in next
raise value
RuntimeError: Error while calling cudaGetDevice(&the_device_id) in file /home/ubuntu/src/dlib-19.4/dlib/dnn/gpu_data.cpp:178. code: 3, reason: initialization error
````
This is EC2 p2.xlarge instance with Tesla K80 GPU
````
$ nvidia-smi
Wed May 31 14:55:17 2017
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 375.51 Driver Version: 375.51 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Tesla K80 Off | 0000:00:1E.0 Off | 0 |
| N/A 37C P0 55W / 149W | 0MiB / 11439MiB | 99% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+
$ cat /usr/local/cuda/version.txt
CUDA Version 8.0.61
$ cat /usr/local/cuda/include/cudnn.h | grep CUDNN_MAJOR -A 2
$ uname -a
Linux ip-172-31-23-151 4.4.0-1017-aws #26-Ubuntu SMP Fri Apr 28 19:48:19 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux
````
That all looks fine, except that nvidia-smi is showing 99% GPU utilization when nothing is running. That sounds bad. Probably something is messed up with the system.
But in any case, it should work. If you are getting initialization errors usually something is broken about your system's cuda install or GPU hardware.
Here is how TF is managing on the same instance:
````
$ python3
Python 3.5.2 (default, Nov 17 2016, 17:05:23)
[GCC 5.4.0 20160609] on linux
Type "help", "copyright", "credits" or "license" for more information.
import tensorflow as tf
sess = tf.Session(config=tf.ConfigProto(log_device_placement=True))
2017-05-31 17:26:02.464158: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE4.1 instructions, but these are available on your machine and could speed up CPU computations.
2017-05-31 17:26:02.464197: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE4.2 instructions, but these are available on your machine and could speed up CPU computations.
2017-05-31 17:26:02.464212: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use AVX instructions, but these are available on your machine and could speed up CPU computations.
2017-05-31 17:26:02.464226: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use AVX2 instructions, but these are available on your machine and could speed up CPU computations.
2017-05-31 17:26:02.464241: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use FMA instructions, but these are available on your machine and could speed up CPU computations.
2017-05-31 17:26:05.418360: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:901] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2017-05-31 17:26:05.418872: I tensorflow/core/common_runtime/gpu/gpu_device.cc:887] Found device 0 with properties:
name: Tesla K80
major: 3 minor: 7 memoryClockRate (GHz) 0.8235
pciBusID 0000:00:1e.0
Total memory: 11.17GiB
Free memory: 11.11GiB
2017-05-31 17:26:05.418902: I tensorflow/core/common_runtime/gpu/gpu_device.cc:908] DMA: 0
2017-05-31 17:26:05.418918: I tensorflow/core/common_runtime/gpu/gpu_device.cc:918] 0: Y
2017-05-31 17:26:05.418943: I tensorflow/core/common_runtime/gpu/gpu_device.cc:977] Creating TensorFlow device (/gpu:0) -> (device: 0, name: Tesla K80, pci bus id: 0000:00:1e.0)
Device mapping:
/job:localhost/replica:0/task:0/gpu:0 -> device: 0, name: Tesla K80, pci bus id: 0000:00:1e.0
2017-05-31 17:26:05.506712: I tensorflow/core/common_runtime/direct_session.cc:257] Device mapping:
/job:localhost/replica:0/task:0/gpu:0 -> device: 0, name: Tesla K80, pci bus id: 0000:00:1e.0
````
Here is the nvidia-smi output at the same time:
````
$ nvidia-smi
Wed May 31 17:27:26 2017
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 375.51 Driver Version: 375.51 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Tesla K80 Off | 0000:00:1E.0 Off | 0 |
| N/A 40C P0 53W / 149W | 10873MiB / 11439MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 2485 C python3 10869MiB |
+-----------------------------------------------------------------------------+
````
Looks like NUMA node is set to zero forcefully by TF, is it possible that to has something in common with the error?
I'm not familiar with whatever amazon is running on this machine so I have no idea.
I manage to resolve this side effect.
Turns out the compute_face_descriptor can be used as a single process only and the GPU can not be shared from multiple compute_face_descriptor processes.
As soon as I removed the multi-processing everything started working as expected.
I guess the first process locks the GPU and do not allow simultaneous GPU uses.
Here is how nvidia-smi looks like when single process is running:
````
$ nvidia-smi
Thu Jun 1 01:50:58 2017
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 375.51 Driver Version: 375.51 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Tesla K80 Off | 0000:00:1E.0 Off | 0 |
| N/A 42C P0 62W / 149W | 191MiB / 11439MiB | 33% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 2198 C python3 189MiB |
+-----------------------------------------------------------------------------+
````
Thanks for the support and for the great library!
If this is the expected behavior - feel free to close the issue.
Normally, you can have multiple processes accessing the same GPU. But maybe your system is configured otherwise. In any case, it looks like you found the issue.
I met the same issue with python's multiprocessing on Linux. Can't test on MacOS because I don't have a GPU there. When running in the main process, GPU can be acquired, but running the new cnn_face_detector depend on two scenarios:
If I run detector in the main process before starting the subprocess, main process call goes through, but the new process hangs at cnn_face_detector(frame), with noticeable GPU usage, but it just hangs.
If I run it in the subprocess without running in the main process first, it gives off the error:
RuntimeError: Error while calling cudaGetDevice(&the_device_id) in file /home/user/dlib/dlib/dnn/gpu_data.cpp:178. code: 3, reason: initialization error
The new process can be just a singular process, or from a Pool.
Just confirmed that even in CPU mode, scenario 1 still happens. So that should be a separate problem.
I just tested this kind of usage from a simple multi-threaded C++ program and it works fine. The dlib dnn code is also threadsafe in general and tested for that regularly. So I doubt this has anything to do with dlib. You guys are probably running into limitations of python's threading support. Like the global interpreter lock: https://wiki.python.org/moin/GlobalInterpreterLock.
Just tested again. Using threading.Thread, everything runs fine. This is where GIL could happen but as usual, C++ calls get away. But other calls cannot so true parallel CPU utilization elsewhere cannot happen. This might be my temporary workaround.
Only when using multiprocessing, when new processes are generated, do the problems described above occur. I tried to load a new detector by calling the serialization initialization function. It could bypass the hanging scenario, but the second scenario occurs, not getting the device, right at the line of initialization. So for some reason in a different python process, cudaGetDevice(&the_device_id) is not working.
Interesting. Tested again. This time if I serialize the detector directly inside the subprocesses without loading it in the main process beforehand, which can be multiple, all of them work.
I had problems like this before on MacOS, when if I run some C++ calls in the main process, other (some) C++ calls in the subprocesses will generate a seg fault. It was due to MacOS's forking behavior, rather than spawning. But same code works fine Linux. It's not this case particularly but could be a indication that Python's multiprocess support requires a lot of care.
That's not python, that's just how subprocesses and forking work in general. You can't allocate arbitrary resources in a process and then fork and assume those resources are still valid. This is especially true for hardware resources like a GPU context.
Anyway, this isn't a dlib problem. You just have to read the manual carefully for the pyhton multi-processing features you are using and understand their limitations and how to use them.
I had the same problem,was using multiprocessing with cnn_face_detection_v1 module.Looks like the method uses 881 MB of gpu memory. I had 1070ti(8 GB). Just copied the code file 8 times and divided data between them. worked like a charm.
@terencezl Hi, how you finally make dlib in python works in sub-process way?
@AMilkov @terencezl @shang-vikas @OptimusPrimeCao @axmadjon
If using Python 3.4+ on unix-like platforms, you can use 'spawn' instead of 'fork' to start a process. That will not cause the problem.
Example code from https://docs.python.org/3/library/multiprocessing.html:
import multiprocessing as mp
def foo(q):
q.put('hello')
if __name__ == '__main__':
ctx = mp.get_context('spawn')
q = ctx.Queue()
p = ctx.Process(target=foo, args=(q,))
p.start()
print(q.get())
p.join()
hello, i have same issue. I dont know why, here in my case.. i got success run with command python3 index.py and use the program, but when i try to deploy in nginx link from here: https://www.digitalocean.com/community/tutorials/how-to-serve-flask-applications-with-uswgi-and-nginx-on-ubuntu-18-04
the error log like this:
cnn_face_detector = dlib.cnn_face_detection_model_v1("mmod_human_face_detector.dat")
RuntimeError: Error while calling cudaGetDevice(&the_device_id) in file /root/dlib-master/dlib/cuda/gpu_data.cpp:178. code: 3, reason: initialization error
i think it's like when the program call cnn_face_detector = dlib.cnn_face_detection_model_v1("mmod_human_face_detector.dat")
any clue ?
hello, i have same issue. I dont know why, here in my case.. i got success run with command
python3 index.pyand use the program, but when i try to deploy in nginx link from here: https://www.digitalocean.com/community/tutorials/how-to-serve-flask-applications-with-uswgi-and-nginx-on-ubuntu-18-04the error log like this:
cnn_face_detector = dlib.cnn_face_detection_model_v1("mmod_human_face_detector.dat") RuntimeError: Error while calling cudaGetDevice(&the_device_id) in file /root/dlib-master/dlib/cuda/gpu_data.cpp:178. code: 3, reason: initialization errori think it's like when the program call
cnn_face_detector = dlib.cnn_face_detection_model_v1("mmod_human_face_detector.dat")any clue ?
First you'd better use 'spawn' instead of 'fork' to start a process.
load the cnn_face_detector at first(out of any process).
hope it could work.
https://github.com/davisking/dlib/issues/1013#issuecomment-351728761
Most helpful comment
@AMilkov @terencezl @shang-vikas @OptimusPrimeCao @axmadjon
If using Python 3.4+ on unix-like platforms, you can use 'spawn' instead of 'fork' to start a process. That will not cause the problem.
Example code from https://docs.python.org/3/library/multiprocessing.html: