dlib face descriptor generates NaNs when run on Jetson Nano

Created on 5 Apr 2019  ·  19Comments  ·  Source: davisking/dlib

Expected Behavior

Running example/face_recognition.py prints values of descriptors that are not NaN when running on Nvidia Jetson Nano.

Current Behavior

In reality, when run on Nvidia Jetson Nano, the descriptors come back as NaNs or Very high (~e+15) numbers. Note, this process works as expected (i.e. descriptors values between 0 and 1) when run on Jetson TX2.

Steps to Reproduce

Install dlib on Jetson Nano: pip3 install dlib
Run the example code to get face descriptors from the example images:
python3 face_recognition.py

  • Version: 19.17.0
  • Where did you get dlib: pip3 install dlib
  • Platform: Nvidia Jetson Nano, Ubuntu 18.04.2 LTS
inactive

Most helpful comment

That's weird. You don't modify anything?

I don't have a jetson nano to test on so someone else will have to debug this.

Nope, did not change anything, used the python example file face_recognition.py. I also tried to compile the C++ example for the same and same result.
I asked @e-fominov and he suggested to debug layer by layer implementation, it seems that there could be a bug at the input or output end of things in the CUDA implementation.
Testing is still pending, will update if I am able to get some resolution.

All 19 comments

That's weird. You don't modify anything?

I don't have a jetson nano to test on so someone else will have to debug this.

I have the same problem. I have been testing C++ code and get nan or very large number

That's weird. You don't modify anything?

I don't have a jetson nano to test on so someone else will have to debug this.

Nope, did not change anything, used the python example file face_recognition.py. I also tried to compile the C++ example for the same and same result.
I asked @e-fominov and he suggested to debug layer by layer implementation, it seems that there could be a bug at the input or output end of things in the CUDA implementation.
Testing is still pending, will update if I am able to get some resolution.

Apparently if you run with CUDA memcheck you get these result:

cuda-memcheck python myApp.py

========= CUDA-MEMCHECK
========= Internal Memcheck Error: Initialization failed
========= Saved host backtrace up to driver entry point at error
========= Host Frame:/usr/lib/aarch64-linux-gnu/tegra/libcuda.so.1 (cuDevicePrimaryCtxRetain + 0x154) [0x1fd7d4]

========= Host Frame:/usr/local/lib/python3.6/dist-packages/dlib.cpython-36m-aarch64-linux-gnu.so [0x8389c4]

Hello, I also encountered the same problem with NaN. Can you find a solution?Please tell me, thank you very much!

I found the same issue in the NVidia forum. It seems to be a problem just with the Jetson Nano. It's interesting, that the face location is working with cnn but not face encoding.

NVidia Forum - issues with dlib library

It appears that NVIDIA is currently looking into things on their end with the CUDNN libraries based on their last update in thread listed above.
FWIW the memcheck error appears to come from not running the utility as root. I was able to reproduce this by running the utility as a regular user and resolve it by running the memcheck utility as root.

For others reading the thread, NVIDIA's suggested temporary fix is to apply this diff to dlib source and re-compile dlib python extensions:

diff --git a/dlib/cuda/cudnn_dlibapi.cpp b/dlib/cuda/cudnn_dlibapi.cpp
index a32fcf6..6952584 100644
--- a/dlib/cuda/cudnn_dlibapi.cpp
+++ b/dlib/cuda/cudnn_dlibapi.cpp
@@ -851,7 +851,7 @@ namespace dlib
                         dnn_prefer_fastest_algorithms()?CUDNN_CONVOLUTION_FWD_PREFER_FASTEST:CUDNN_CONVOLUTION_FWD_NO_WORKSPACE,
                         std::numeric_limits<size_t>::max(),
                         &forward_best_algo));
-                forward_algo = forward_best_algo;
+                //forward_algo = forward_best_algo;
                 CHECK_CUDNN(cudnnGetConvolutionForwardWorkspaceSize( 
                         context(),
                         descriptor(data),

See https://devtalk.nvidia.com/default/topic/1049660/jetson-nano/issues-with-dlib-library/2

Dang, so it's a bug in cudnn? Is there a preprocessor macro that can be used to identify this platform and toggle this change?

I haven't found one yet. I'll dig around again. But I did confirm that the patch avoids the bug and the output is correct with it.

Warning: this issue has been inactive for 36 days and will be automatically closed on 2019-06-28 if there is no further activity.

If you are waiting for a response but haven't received one it's possible your question is somehow inappropriate. E.g. it is off topic, you didn't follow the issue submission instructions, or your question is easily answerable by reading the FAQ, dlib's official compilation instructions, dlib's API documentation, or a Google search.

Is this fixed now that cuDNN 7.6.1 is out? The devtalk thread claims that release resolved the issue but it has to be manually compiled for now.

It's fixed in 7.6.1, but AFAIK that version isn't available for the Jetson architecture yet.

Warning: this issue has been inactive for 35 days and will be automatically closed on 2019-09-01 if there is no further activity.

If you are waiting for a response but haven't received one it's possible your question is somehow inappropriate. E.g. it is off topic, you didn't follow the issue submission instructions, or your question is easily answerable by reading the FAQ, dlib's official compilation instructions, dlib's API documentation, or a Google search.

Warning: this issue has been inactive for 43 days and will be automatically closed on 2019-09-01 if there is no further activity.

If you are waiting for a response but haven't received one it's possible your question is somehow inappropriate. E.g. it is off topic, you didn't follow the issue submission instructions, or your question is easily answerable by reading the FAQ, dlib's official compilation instructions, dlib's API documentation, or a Google search.

Notice: this issue has been closed because it has been inactive for 45 days. You may reopen this issue if it has been closed in error.

has the problem been solved? I am facing similar issue. the funny thing I am getting NaN but not always?

any updates? wold be glad to hear something here...

Great news. Problem has been solved in JetPack SDK 4.4. I have tested

Was this page helpful?
0 / 5 - 0 ratings