Hello.
I have adapted the code from FasterRCNN_train.py to use distributed learning. This is what the learner creation looks like:
# Instantiate the learners and the trainer object
num_quantization_bits = 32
warm_up = 0
lr_schedule = learning_parameter_schedule_per_sample(lr_per_sample)
local_learner = momentum_sgd(others, lr_schedule, mm_schedule, l2_regularization_weight=l2_reg_weight,
unit_gain=False, use_mean_gradient=True)
learner = cntk.distributed.data_parallel_distributed_learner(local_learner,
num_quantization_bits=num_quantization_bits,
distributed_after=warm_up)
bias_lr_per_sample = [v * bias_lr_mult for v in lr_per_sample]
bias_lr_schedule = learning_parameter_schedule_per_sample(bias_lr_per_sample)
bias_local_learner = momentum_sgd(biases, bias_lr_schedule, mm_schedule, l2_regularization_weight=l2_reg_weight,
unit_gain=False, use_mean_gradient=True)
bias_learner = cntk.distributed.data_parallel_distributed_learner(bias_local_learner,
num_quantization_bits=num_quantization_bits,
distributed_after=warm_up)
trainer = Trainer(None, (loss, pred_error), [learner, bias_learner])
After running, the error that I get is this:
CUBLAS failure 14: CUBLAS_STATUS_INTERNAL_ERROR ; GPU=0 ; hostname=jfontes ; expr=cublasgemmHelper(cuHandle, transA, transB, m, n, k, &alpha, a.Data(), (int) a.m_numRows, b.Data(), (int) b.m_numRows, &beta, c.Data(), (int) c.m_numRows)
Please provide a detector name as the single argument. Usage:
python DetectionDemo.py <detector_name>
Available detectors: ['FastRCNN', 'FasterRCNN']
Using default detector: FasterRCNN
training FasterRCNN
Using base model: AlexNet
lr_per_sample: [0.001, 0.001, 0.001, 0.001, 0.001, 0.001, 0.001, 0.001, 0.001, 0.001, 0.0001, 0.0001, 0.0001, 0.0001, 0.0001, 0.0001, 0.0001, 0.0001, 0.0001, 0.0001, 1e-05]
Training model for 1 epochs.
Training 57513152 parameters in 27 parameter tensors.
Traceback (most recent call last):
File "DetectionDemo.py", line 63, in <module>
eval_model = od.train_object_detector(cfg)
File "C:\Users\jfontes\Documents\AGATHA\CNTK\Examples\Image\Detection\utils\od_utils.py", line 21, in train_object_detector
eval_model = train_faster_rcnn(cfg)
File "C:\Users\jfontes\Documents\AGATHA\CNTK\Examples\Image\Detection\FasterRCNN\FasterRCNN_train.py", line 291, in train_faster_rcnn
eval_model = train_faster_rcnn_e2e(cfg)
File "C:\Users\jfontes\Documents\AGATHA\CNTK\Examples\Image\Detection\FasterRCNN\FasterRCNN_train.py", line 329, in train_faster_rcnn_e2e
e2e_lr_per_sample_scaled, mm_schedule, cfg["CNTK"].L2_REG_WEIGHT, cfg["CNTK"].E2E_MAX_EPOCHS, cfg)
File "C:\Users\jfontes\Documents\AGATHA\CNTK\Examples\Image\Detection\FasterRCNN\FasterRCNN_train.py", line 576, in train_model
trainer.train_minibatch(data) # update model with it
File "C:\Users\jfontes\AppData\Local\Continuum\Anaconda3\envs\cntk-py35-gpu\lib\site-packages\cntk\train\trainer.py", line 181, in train_minibatch
arguments, device)
File "C:\Users\jfontes\AppData\Local\Continuum\Anaconda3\envs\cntk-py35-gpu\lib\site-packages\cntk\cntk_py.py", line 2975, in train_minibatch_overload_for_minibatchdata
return _cntk_py.Trainer_train_minibatch_overload_for_minibatchdata(self, *args)
RuntimeError: CUBLAS failure 14: CUBLAS_STATUS_INTERNAL_ERROR ; GPU=0 ; hostname=CVIG-JF ; expr=cublasgemmHelper(cuHandle, transA, transB, m, n, k, &alpha, a.Data(), (int) a.m_numRows, b.Data(), (int) b.m_numRows, &beta, c.Data(), (int) c.m_numRows)
[CALL STACK]
> Microsoft::MSR::CNTK::CudaTimer:: Stop
- Microsoft::MSR::CNTK::CudaTimer:: Stop
- Microsoft::MSR::CNTK::GPUMatrix<float>:: MultiplyAndWeightedAdd
- Microsoft::MSR::CNTK::Matrix<float>:: MultiplyAndWeightedAdd
- Microsoft::MSR::CNTK::TensorView<float>:: DoMatrixProductOf
- Microsoft::MSR::CNTK::TensorView<float>:: AssignMatrixProductOf
- std::enable_shared_from_this<Microsoft::MSR::CNTK::MatrixBase>:: shared_from_this (x2)
- CNTK::Internal:: UseSparseGradientAggregationInDataParallelSGD
- std::enable_shared_from_this<Microsoft::MSR::CNTK::MatrixBase>:: shared_from_this
- CNTK::Internal:: UseSparseGradientAggregationInDataParallelSGD
- CNTK::Function:: Forward
- CNTK:: CreateTrainer
- CNTK::Trainer:: TotalNumberOfUnitsSeen
- CNTK::Trainer:: TrainMinibatch (x2)
Have I done anything wrong? Does the FasterRCNN model supports distributed learning? I'm trying to run the training in a machine with 2x NVIDIA GeForce GTX 1080. Running on one GPU works, but with both I get the CUBLAS error. I have already tested the ConvNet test with CIFAR10 dataset with both GPU's and it worked.
I think this is the problem. For distributed training, each process should use different GPU ID.
When running using mpiexec, the program outputs that GPU[0] and GPU[1] are selected. I think that the program is running on both 馃槙
But I'll try your solution. Just need to add that line to FasterRCNN_config.py? Or remove it?
Hi @KeDengMS , @spandantiwari .
The other day, I registered some modifications for distributed learning for Faster R-CNN here. Is it helpful for resolving the issue?
@kyoro1 your implementation works. Thanks 馃槃
@KeDengMS @spandantiwari : According to the above, how about injecting my modification into the master branch?
Just noticed that @kyoro1 implementation it's very slow. It takes 20 sec or more to process 100 samples. Probably there's still work to do.
@jpedrofontes, In your trial, how long did it take to train with single GPU, i.e. without distributed setting? Just, want to know the situation.
I used a dataset with +/-5000 800x800 images and it took 37 secs to train 100 samples. It is training at a rate of 2.5 samples/s. It's taking more than 50 mins to complete a single epoch on a single GPU.
Hello @kyoro1
I'll detail the full details from the issue reported in the previous conversation:
Traceback (most recent call last):
File "DetectionDemo.py", line 59, in <module>
eval_model = od.train_object_detector(cfg)
File "C:\Users\jfontes\Documents\AGATHA\CNTK\Examples\Image\Detection\utils\od_utils.py", line 21, in train_object_detector
eval_model = train_faster_rcnn(cfg)
File "C:\Users\jfontes\Documents\AGATHA\CNTK\Examples\Image\Detection\FasterRCNN\FasterRCNN_train.py", line 293, in train_faster_rcnn
eval_model = train_faster_rcnn_alternating(cfg)
File "C:\Users\jfontes\Documents\AGATHA\CNTK\Examples\Image\Detection\FasterRCNN\FasterRCNN_train.py", line 447, in train_faster_rcnn_alternating
rpn_rois_input=rpn_rois_input, buffered_rpn_proposals=buffered_proposals_s1)
File "C:\Users\jfontes\Documents\AGATHA\CNTK\Examples\Image\Detection\FasterRCNN\FasterRCNN_train.py", line 541, in train_model
distributed_after = cfg.WARM_UP) # no warm start as default
File "C:\Users\jfontes\AppData\Local\Continuum\Anaconda3\envs\cntk-py35-gpu\lib\site-packages\cntk\internal\swig_helper.py", line 69, in wrapper
result = f(*args, **kwds)
File "C:\Users\jfontes\AppData\Local\Continuum\Anaconda3\envs\cntk-py35-gpu\lib\site-packages\cntk\train\distributed.py", line 143, in data_parallel_distributed_learner
cntk_py.mpicommunicator(),
RuntimeError: MPIWrapperMpi: this is a singleton class that can only be instantiated once per process
[CALL STACK]
> std::enable_shared_from_this<Microsoft::MSR::CNTK::MatrixBase>:: operator=
- std::enable_shared_from_this<Microsoft::MSR::CNTK::MatrixBase>::enable_shared_from_this<Microsoft::MSR::CNTK::MatrixBase> (x2)
- CNTK:: QuantizedMPICommunicator
- CNTK:: MPICommunicator
- PyInit__cntk_py
- PyCFunction_Call
- PyEval_GetFuncDesc
- PyEval_EvalFrameEx (x2)
- PyFunction_SetAnnotations
- PyObject_Call
- PyEval_GetFuncDesc
- PyEval_EvalFrameEx (x2)
- PyEval_GetFuncDesc
@jpedrofontes Oh, really... So, we should wait for resolving the bugs as @KeDengMS said as above.
I'll be waiting. Anything you need from me, just text here 馃槃
Most helpful comment
@kyoro1 your implementation works. Thanks 馃槃