Dlib: Training using data loaders and multiple GPUs

Created on 4 Mar 2019 · 16Comments · Source: davisking/dlib

Expected Behavior

I was trying the dnn_semantic_segmentation_train_ex. It works perfectly with 1 GPU, however I can't make it work on multiple GPUs.
According to dnn_introduction2_ex, it should be as straightforward as adding a std::vector<int> to the constructor of the dnn_trainer with the ids of the GPUs to run the training on.
So I changed:
https://github.com/davisking/dlib/blob/master/examples/dnn_semantic_segmentation_train_ex.cpp#L290
to look like:

dnn_trainer<bnet_type> trainer(bnet,sgd(weight_decay, momentum), {0,1});

Current Behavior

I run the training with the modified line above and I get this error:

$ ./build/dnn_semantic_segmentation_train_ex VOCdevkit/VOC2012/



SCANNING PASCAL VOC2012 DATASET

images in dataset: 1464
mini-batch size: 23

dnn_trainer details: 
  net_type::num_layers:  239
  net size: 0.0165014MB
  net architecture hash: ca957805a9e35b40ff4481384dc5d66e
  loss: loss_multiclass_log_per_pixel
  synchronization file:                       pascal_voc2012_trainer_state_file.dat
  trainer.get_solvers()[0]:                   sgd: weight_decay=0.0001, momentum=0.9
  learning rate:                              0.1
  learning rate shrink factor:                0.1
  min learning rate:                          1e-05
  iterations without progress threshold:      5000
  test iterations without progress threshold: 500

step#: 0     learning rate: 0.1   average loss: 0            steps without apparent progress: 0
An unhandled exception was inside a dlib::thread_pool when it was destructed.
It's what string is: 
Error while calling cudaMemset(loss_cuda_work_buffer, 0, sizeof(float)) in file /home/arrufat/dlib/dlib/cuda/cuda_dlib.cu:1676. code: 11, reason: invalid argument
Aborted (core dumped)

I tried the training on multiple GPUs from the dnn_introduction2_ex and it works, so I wonder if I had to handle the threads of the data loaders somehow.

Steps to Reproduce

I just modified that line and run the training.
I also noticed the following behaviour:

if the ids are: {1,2}, GPUs 0, 1, 2 are loaded.
if the ids are: {0,1}, but I launch prepending CUDA_VISIBLE_DEVICES=1,2, only GPUs 1 and 2 are loaded.

It's not a big issue, but is that the expected behavior? Maybe I am supposed to index always from 0 and use the CUDA_VISIBLE_DEVICES environment variable to set the cards, but I found that a bit odd, nonetheless.

Version: latest commit from master: 02ed083c4c8f2d9c5edb5df52523326d13b9d7ae
Where did you get dlib: I built it from this repo
Platform: x86_64 GNU/Linux
Distributor ID: Ubuntu
Description: Ubuntu 16.04.4 LTS
Release: 16.04
Codename: xenial
Compiler: GCC-5.4
CUDA: 9.0

Thanks in advance for all your hard work.

Source

arrufat

Most helpful comment

@arrufat: could you please try if #1717 helps?

reunanen on 5 Apr 2019

❤2 🎉2

All 16 comments

There isn't anything you need to do special with the data loading. You can load data any way you want and the dnn_trainer has no idea how you do it.

As for the question about setting devices, that's how cuda works and is the documented behavior of cuda_extra_devices.http://dlib.net/faq.html#Whereisthedocumentationforobjectfunction

davisking on 4 Mar 2019

Thanks for the information about the cuda_extra_devices, it makes sense that I don't have to rebuild the program to change the GPUs.

So, do you mean it should not crash?

arrufat on 4 Mar 2019

It should not crash.

davisking on 4 Mar 2019

Warning: this issue has been inactive for 30 days and will be automatically closed on 2019-04-08 if there is no further activity.

If you are waiting for a response but haven't received one it's possible your question is somehow inappropriate. E.g. it is off topic, you didn't follow the issue submission instructions, or your question is easily answerable by reading the FAQ, dlib's official compilation instructions, dlib's API documentation, or a Google search.

dlib-issue-bot on 3 Apr 2019

@davisking have you tried it yourself? Does it crash on your machine?

arrufat on 3 Apr 2019

Nothing in the examples crashes. Lots of people use multi gpu training in dlib.

davisking on 3 Apr 2019

Yes, I agree that all the examples work without crashes.
The problem comes when I try to use a threaded image loader with multiple GPUs by modifying the dnn_semantic_segmentation_train_ex.cpp.
I'm just adding {0, 1} to the trainer constructor in order to be able to use multiple GPUs, then it crashes.
That is the only change I made.
I've previously used multi-GPU training, but never with threaded image loaders, so I might be missing something.
I would really appreciate if you could give me some guidance.
Thanks for your amazing library :)

arrufat on 4 Apr 2019

I'm not sure what you mean by threaded data loaded. Some of the example programs use threads, but the threads don't interact with the dnn_trainer in any way in those examples. The dnn trainer only gets touched from the main thread. If you are changing the code so that multiple threads make concurrent calls to the dnn_trainer then, yes, probably it's going to crash.

Or are you saying you literally take dnn_semantic_segmentation_train_ex.cpp and modify line 290 to be this?

    dnn_trainer<bnet_type> trainer(bnet,sgd(weight_decay, momentum),{0,1});

That should be fine if you actually have 2 GPUs.

I don't have a multi-gpu machine on hand to test this though. I think @reunanen runs it with multiple gpus though?

davisking on 4 Apr 2019

Thanks for the quick reply.
Sorry about the misleading term, by threaded data loaders I'm referring to the dnn_semantic_segmentation_train_ex.cpp where you use threads to feed the data to the trainer.
And yes, I have 2 GPUs and I only modified that line. On examples that don't use threads to feed the trainer, the multi GPU training works perfectly.
If @reunanen has some suggestions, I am willing to try and help :)

arrufat on 4 Apr 2019

I can confirm that I have the same problem with this example, if I add the {0,1}.

I too get a cudaMemset error, but specifically from compute_loss_multiclass_log_per_pixel::do_work. I guess this function wasn't tested much with multiple GPUs.

I will try and see if I can come up with some simple fix.

reunanen on 5 Apr 2019

@arrufat: could you please try if #1717 helps?

reunanen on 5 Apr 2019

❤2 🎉2

On examples that don't use threads to feed the trainer, the multi GPU training works perfectly.

My guess is that this non-threaded feeding changes the relative timings, so the race condition is simply not triggered (most of the time at least).

reunanen on 5 Apr 2019

@reunanen Wow, thank you very much! I've just tried the dnn_semantic_segmentation_train_ex.cpp by modifying line 290 to look like

dnn_trainer<bnet_type> trainer(bnet,sgd(weight_decay, momentum),{0,1});

And the training seems to work, I will leave it until the end and let you know how it went.
Thank you so much again for your quick reply :)

arrufat on 6 Apr 2019

Hi! The training completed successfully on 2 Nvidia Tesla V100. Thank you so much for providing a fix so quickly!

arrufat on 6 Apr 2019

Yep, @reunanen is the man 😁

davisking on 6 Apr 2019

🎉1 😄1

Hi again, in case you're interested on the performance improvement with 2 GPUs, here you have the last lines of the dnn_semantic_segmentation_train_ex with batch size of 32:

Using 2 GPUs

step#: 115995  learning rate: 0.0001  average loss: 0.0189884    steps without apparent progress: 4932
Saved state to pascal_voc2012_trainer_state_file.dat
saving network
Testing the network...
train accuracy  :  0.992542
val accuracy    :  0.854127

real    499m24.433s
user    1390m25.272s
sys     349m19.268s

Using 1 GPU

step#: 120322  learning rate: 0.0001  average loss: 0.0155686    steps without apparent progress: 4932
Saved state to pascal_voc2012_trainer_state_file.dat
saving network
Testing the network...
train accuracy  :  0.992885
val accuracy    :  0.846633

real    896m58.066s
user    1768m49.664s
sys     318m16.356s

So it's around 1.8 times faster.
Thank you so much to both of you, let me know if you ever come to Seoul and I'll treat you :)

arrufat on 7 Apr 2019

🎉1

Was this page helpful?

0 / 5 - 0 ratings