Dlib: Training gets stuck on GTX 1070... following #1514 fix

Created on 22 Jul 2019  路  11Comments  路  Source: davisking/dlib

Hi,

Since the application of fix #1514, learning a model no longer works on my configuration... which seems however to be standard. The behavior is strictly the one initially reported by @reunanen, namely the blocking waiting in a thread during the learning mechanism.
I tried an update of the compiler, the Nvidia driver and the CuDNN library but without any improvement. However, this problem did not appear before the application of the patches as discussed in previous threads.

Perhaps we should have an option not to apply the patch in question, already to check if it is really a regression. @davisking , from which version can I please for instance test to see if there is indeed a regression? Thanks in advance.

Expected Behavior

Training a model must lead to the use of all the machine cores and not lead to a waiting for any kind of synchronization.

Current Behavior

The process consumes 100% of one CPU core while it's in cudaStreamSynchronize() function.

Steps to Reproduce

Using for example one of samples coming with Dlib: .

  • Version: 19.16
  • Where did you get dlib: https://github.com/davisking/dlib
  • Platform: 聽latest version of Windows 10 pro 64 bit

  • Compiler: Microsoft Visual C++ 2017 14.21, cuDNN 7.6, CUDA Toolkit 10.1 Update 1

Most helpful comment

Indeed. I will check my configuration again and run tests, first with the latest master version (I stopped at 19.7) and then I will gradually downgrade the version to know from which version it no longer blocks. I made a new program using a Resnet network to remain compatible with the old versions and due to the absence of recently introduced definitions for upsampling. I'll keep you informed (it'll take me a little while).

All 11 comments

Hmm. Looks like your CUDA version (10.1) is too recent compared to your dlib version (19.16).

Can you try to use the latest dlib master? Or perhaps manually pick at least #1596 and #1704?

I also did a test with version 19.17 which seems to be the latest stable version of Dlib but I note exactly the same behavior. I'll do a new test with the current version under GitHub to make sure I have all the latest fixes. I'll keep you informed of the result.

Looking at dates only, it doesn't look like 19.17 has #1704. (That's why I suggested latest master instead.)

Thank you for your advice. I'll take the last master and restart the tests.

@cydral, did you verify that using a version prior to https://github.com/davisking/dlib/pull/1514 makes the issue go away?

Hi Davis, according to the change log, fix provided by Juha was applied in January (if I'm not mistaken). It became unstable for me after version 19.9... and since then let's say the last three versions I constantly have the CUDA blocking problem as reported.
I'm going to do some additional tests with the latest master to see if it wouldn't come from the CUDA Toolkit version now...

You need to check if using the older dlib code for this actually fixes this issue though. So try running without the patch and see what happens. We expect it will still hang. If that is not what happens then that is very significant.

Indeed. I will check my configuration again and run tests, first with the latest master version (I stopped at 19.7) and then I will gradually downgrade the version to know from which version it no longer blocks. I made a new program using a Resnet network to remain compatible with the old versions and due to the absence of recently introduced definitions for upsampling. I'll keep you informed (it'll take me a little while).

Thus, I did a lot of tests again to understand what may have changed in the meantime.
All tools and libraries have been updated to the latest versions published (cuDNN, CUDA Toolkit, Visual Studio, ...) and I performed iterative tests starting from version 19.17 and gradually downgrading the version to 19.11. For all these versions, the behavior is basically the same, i. e. a freeze at a certain time in the CUDA stub (synchronization issue).

Because I couldn't just compile and integrate the latest master into the test program (a classifier based on a Resnet network), I nevertheless did a last test replacing all the "cuda" and "dnn" folders in version 19.17 to have all the patches proposed by the developer community and integrated into the current master version. The test program worked perfectly for several hours, where the blocking often occurs after about ten minutes.

I found an old program compiled with a 19.11 version but based on an earlier version of cuDNN that still works - in learning mode - like a charm. We can therefore reasonably think of a regression at the cuDNN level but the conclusion is that on a correctly updated Windows configuration, only the latest version of Dlib will allow a complete running. However, the simple use of a trained model does not seem to show any functional problems (I have also done intensive tests on this point).

I hope that this feedback will help in understanding this specific issue.

Cool. I guess I should push out dlib 19.18 soon then so these fixes become the official stable :)

Sure, thanks a lot.

Was this page helpful?
0 / 5 - 0 ratings