I was trying the dnn_semantic_segmentation_train_ex. It works perfectly with 1 GPU, however I can't make it work on multiple GPUs.
According to dnn_introduction2_ex, it should be as straightforward as adding a std::vector<int> to the constructor of the dnn_trainer with the ids of the GPUs to run the training on.
So I changed:
https://github.com/davisking/dlib/blob/master/examples/dnn_semantic_segmentation_train_ex.cpp#L290
to look like:
dnn_trainer<bnet_type> trainer(bnet,sgd(weight_decay, momentum), {0,1});
I run the training with the modified line above and I get this error:
$ ./build/dnn_semantic_segmentation_train_ex VOCdevkit/VOC2012/
SCANNING PASCAL VOC2012 DATASET
images in dataset: 1464
mini-batch size: 23
dnn_trainer details:
net_type::num_layers: 239
net size: 0.0165014MB
net architecture hash: ca957805a9e35b40ff4481384dc5d66e
loss: loss_multiclass_log_per_pixel
synchronization file: pascal_voc2012_trainer_state_file.dat
trainer.get_solvers()[0]: sgd: weight_decay=0.0001, momentum=0.9
learning rate: 0.1
learning rate shrink factor: 0.1
min learning rate: 1e-05
iterations without progress threshold: 5000
test iterations without progress threshold: 500
step#: 0 learning rate: 0.1 average loss: 0 steps without apparent progress: 0
An unhandled exception was inside a dlib::thread_pool when it was destructed.
It's what string is:
Error while calling cudaMemset(loss_cuda_work_buffer, 0, sizeof(float)) in file /home/arrufat/dlib/dlib/cuda/cuda_dlib.cu:1676. code: 11, reason: invalid argument
Aborted (core dumped)
I tried the training on multiple GPUs from the dnn_introduction2_ex and it works, so I wonder if I had to handle the threads of the data loaders somehow.
I just modified that line and run the training.
I also noticed the following behaviour:
{1,2}, GPUs 0, 1, 2 are loaded.{0,1}, but I launch prepending CUDA_VISIBLE_DEVICES=1,2, only GPUs 1 and 2 are loaded.It's not a big issue, but is that the expected behavior? Maybe I am supposed to index always from 0 and use the CUDA_VISIBLE_DEVICES environment variable to set the cards, but I found that a bit odd, nonetheless.
Platform: x86_64 GNU/Linux
Distributor ID: Ubuntu
Description: Ubuntu 16.04.4 LTS
Release: 16.04
Codename: xenial
Compiler: GCC-5.4
Thanks in advance for all your hard work.
There isn't anything you need to do special with the data loading. You can load data any way you want and the dnn_trainer has no idea how you do it.
As for the question about setting devices, that's how cuda works and is the documented behavior of cuda_extra_devices.http://dlib.net/faq.html#Whereisthedocumentationforobjectfunction
Thanks for the information about the cuda_extra_devices, it makes sense that I don't have to rebuild the program to change the GPUs.
So, do you mean it should not crash?
It should not crash.
Warning: this issue has been inactive for 30 days and will be automatically closed on 2019-04-08 if there is no further activity.
If you are waiting for a response but haven't received one it's possible your question is somehow inappropriate. E.g. it is off topic, you didn't follow the issue submission instructions, or your question is easily answerable by reading the FAQ, dlib's official compilation instructions, dlib's API documentation, or a Google search.
@davisking have you tried it yourself? Does it crash on your machine?
Nothing in the examples crashes. Lots of people use multi gpu training in dlib.
Yes, I agree that all the examples work without crashes.
The problem comes when I try to use a threaded image loader with multiple GPUs by modifying the dnn_semantic_segmentation_train_ex.cpp.
I'm just adding {0, 1} to the trainer constructor in order to be able to use multiple GPUs, then it crashes.
That is the only change I made.
I've previously used multi-GPU training, but never with threaded image loaders, so I might be missing something.
I would really appreciate if you could give me some guidance.
Thanks for your amazing library :)
I'm not sure what you mean by threaded data loaded. Some of the example programs use threads, but the threads don't interact with the dnn_trainer in any way in those examples. The dnn trainer only gets touched from the main thread. If you are changing the code so that multiple threads make concurrent calls to the dnn_trainer then, yes, probably it's going to crash.
Or are you saying you literally take dnn_semantic_segmentation_train_ex.cpp and modify line 290 to be this?
dnn_trainer<bnet_type> trainer(bnet,sgd(weight_decay, momentum),{0,1});
That should be fine if you actually have 2 GPUs.
I don't have a multi-gpu machine on hand to test this though. I think @reunanen runs it with multiple gpus though?
Thanks for the quick reply.
Sorry about the misleading term, by threaded data loaders I'm referring to the dnn_semantic_segmentation_train_ex.cpp where you use threads to feed the data to the trainer.
And yes, I have 2 GPUs and I only modified that line. On examples that don't use threads to feed the trainer, the multi GPU training works perfectly.
If @reunanen has some suggestions, I am willing to try and help :)
I can confirm that I have the same problem with this example, if I add the {0,1}.
I too get a cudaMemset error, but specifically from compute_loss_multiclass_log_per_pixel::do_work. I guess this function wasn't tested much with multiple GPUs.
I will try and see if I can come up with some simple fix.
@arrufat: could you please try if #1717 helps?
On examples that don't use threads to feed the trainer, the multi GPU training works perfectly.
My guess is that this non-threaded feeding changes the relative timings, so the race condition is simply not triggered (most of the time at least).
@reunanen Wow, thank you very much! I've just tried the dnn_semantic_segmentation_train_ex.cpp by modifying line 290 to look like
dnn_trainer<bnet_type> trainer(bnet,sgd(weight_decay, momentum),{0,1});
And the training seems to work, I will leave it until the end and let you know how it went.
Thank you so much again for your quick reply :)
Hi! The training completed successfully on 2 Nvidia Tesla V100. Thank you so much for providing a fix so quickly!
Yep, @reunanen is the man 馃榿
Hi again, in case you're interested on the performance improvement with 2 GPUs, here you have the last lines of the dnn_semantic_segmentation_train_ex with batch size of 32:
step#: 115995 learning rate: 0.0001 average loss: 0.0189884 steps without apparent progress: 4932
Saved state to pascal_voc2012_trainer_state_file.dat
saving network
Testing the network...
train accuracy : 0.992542
val accuracy : 0.854127
real 499m24.433s
user 1390m25.272s
sys 349m19.268s
step#: 120322 learning rate: 0.0001 average loss: 0.0155686 steps without apparent progress: 4932
Saved state to pascal_voc2012_trainer_state_file.dat
saving network
Testing the network...
train accuracy : 0.992885
val accuracy : 0.846633
real 896m58.066s
user 1768m49.664s
sys 318m16.356s
So it's around 1.8 times faster.
Thank you so much to both of you, let me know if you ever come to Seoul and I'll treat you :)
Most helpful comment
@arrufat: could you please try if #1717 helps?