Dlib: Example code freeze while training DNN

Created on 13 Dec 2019  路  8Comments  路  Source: davisking/dlib

Expected Behavior


I am trying to run the _dnn_mmod_train_find_cars_ex example code provided with dlib. I expect the program to run until training is complete.

Current Behavior


The program freezes during training. Sometimes it runs for several hours, sometimes for a few minutes. It seems to freeze after saving the state to file. I added a printout at the bottom of the training while loop:

cout << "Still running: " << cnt << endl;

and the output looks like this after freeze:

Still running: 3989
Still running: 3990
Still running: 3991
Saved state to mmod_cars_sync
Still running: 3992

The only thing that I have changed in the original sample is reducing the batch size from 87 to 8. Otherwise I get an "out of memory" crash somewhere in cude malloc

    int batchSize = 8; // was 87 - not enough GPU memory
    while(trainer.get_learning_rate() >= 1e-4)
    {
        // Every 30 mini-batches we do a testing mini-batch.  
        if (cnt%30 != 0 || images_test.size() == 0)
        {
            cropper(batchSize, images_train, boxes_train, mini_batch_samples, mini_batch_labels);

Steps to Reproduce


If you can: run the sample with Win10, NVIDIA Quadro P500 using VSS2019 and batch size 8

  • Version:
    dlib version: 19.18
  • Where did you get dlib:
    I downloaded dlib from dlib.net
  • Platform:
    Win10 pro, build 18362, 64bit
  • Compiler:
    Compiler: Visual Studio 2019 (v. 16.4.1)
inactive

Most helpful comment

@EinarBjorn What version of CUDA do you have? If 10.2 or so, could you please try to remove the upper limit from this condition here?

@davisking If doing so helps, then we might consider removing the upper limit until there's at least some evidence suggesting that NVidia have actually fixed their drivers?

All 8 comments

I'm not sure there is much I can do about this. There are known bugs in
cuDNN on windows with some graphics cards. You should report this to
NVIDIA and maybe it will help them debug it.

@EinarBjorn What version of CUDA do you have? If 10.2 or so, could you please try to remove the upper limit from this condition here?

@davisking If doing so helps, then we might consider removing the upper limit until there's at least some evidence suggesting that NVidia have actually fixed their drivers?

Yeah, removing the upper limit seems like a good idea if this turns out to be the case.

Thanks @reunanen.
I am in deed using CUDA 10.2 and this seems to have done the trick as the training has been running for about 12 hours without crashing.

I wonder though, after about 62.000 steps, the learning rate is still at 0.1. Is that normal?

I just changed the code to assume cudaStreamSynchronize is broken until CUDA V11. I thought more about removing the upper limit but it would be so easy to forget about this forever and never change it back without an upper limit.

@EinarBjorn That example has the iterations without progress threshold set to 50,000. So that sounds normal.

Warning: this issue has been inactive for 35 days and will be automatically closed on 2020-02-19 if there is no further activity.

If you are waiting for a response but haven't received one it's possible your question is somehow inappropriate. E.g. it is off topic, you didn't follow the issue submission instructions, or your question is easily answerable by reading the FAQ, dlib's official compilation instructions, dlib's API documentation, or a Google search.

Warning: this issue has been inactive for 42 days and will be automatically closed on 2020-02-19 if there is no further activity.

If you are waiting for a response but haven't received one it's possible your question is somehow inappropriate. E.g. it is off topic, you didn't follow the issue submission instructions, or your question is easily answerable by reading the FAQ, dlib's official compilation instructions, dlib's API documentation, or a Google search.

Notice: this issue has been closed because it has been inactive for 45 days. You may reopen this issue if it has been closed in error.

Was this page helpful?
0 / 5 - 0 ratings