Hi,
Have you tried to run training on multiple gpus?
I am getting the below error when I try to do that.thank you
Traceback (most recent call last):
File "train.py", line 194, in <module>
main(opt)
File "train.py", line 128, in main
loss = model(imgs, targets, requestPrecision=True)
File "/opt/anaconda/envs/pytorch_p35/lib/python3.5/site-packages/torch/nn/modules/module.py", line 477, in __call__
result = self.forward(*input, **kwargs)
File "/opt/anaconda/envs/pytorch_p35/lib/python3.5/site-packages/torch/nn/parallel/data_parallel.py", line 123, in forward
outputs = self.parallel_apply(replicas, inputs, kwargs)
File "/opt/anaconda/envs/pytorch_p35/lib/python3.5/site-packages/torch/nn/parallel/data_parallel.py", line 133, in parallel_apply
return parallel_apply(replicas, inputs, kwargs, self.device_ids[:len(replicas)])
File "/opt/anaconda/envs/pytorch_p35/lib/python3.5/site-packages/torch/nn/parallel/parallel_apply.py", line 77, in parallel_apply
raise output
File "/opt/anaconda/envs/pytorch_p35/lib/python3.5/site-packages/torch/nn/parallel/parallel_apply.py", line 53, in _worker
output = module(*input, **kwargs)
File "/opt/anaconda/envs/pytorch_p35/lib/python3.5/site-packages/torch/nn/modules/module.py", line 477, in __call__
result = self.forward(*input, **kwargs)
TypeError: forward() missing 1 required positional argument: 'x'
@xiao1228 The requirements clearly state Python 3.6. I'd advise you to follow them.
Multi-GPU training is still a work in progress. If you could help debug this after upgrading your Python that would be great!
hi I am getting the same error after upgrade to Python 3.6. I will work on that if I can fix it will update you
Getting the error below by trying to move all the variables to cuda
utils/utils.py", line 293, in build_targets
TP[b, i] = (pconf > 0.5) & (iou_pred > 0.5) & (pcls == tc)
RuntimeError: Assertion `THCTensor_(checkGPU)(state, 3, self_, src1, src2)' failed. at /opt/conda/conda-bld/pytorch_1535491974311/work/aten/src/THC/generated/../generic/THCTensorMathPointwise.cu:688
@xiao1228 Have you solved your 1st problem in this issue? I wanna turn the code into multi-gpu and met it, too. I'm still confused. Thank you.
@xiao1228 @zhaoyang10 the code does not support multi-GPU yet unfortunately. I only have a single-GPU machine so I have not been able to debug this issue. If you come up with a solution please advise me, or submit a pull request. Many thanks!!
I've changed the code to raise an error when multi-GPU operation is attempted, until this is resolved.
https://github.com/ultralytics/yolov3/blob/af0033c9e96a4a8dcc04f8fd737d0ccad3364f10/train.py#L60-L63
I've added multi-GPU training support to the pipeline.
@alexpolichroniadis , I run your code ,may not work on 4 x 1080ti.....
Tested on 8x1080tis.
What's the trace?
Epoch Batch xy wh conf cls total nTargets time
C:\Users\NJ\Anaconda3\lib\site-packages\torch\nn\parallel_functions.py:61: UserWarning: Was asked to gather along dimension 0, but all input tensors were scalars; will instead unsqueeze and return a vector.
warnings.warn('Was asked to gather along dimension 0, but all '
C:\Users\NJ\Anaconda3\lib\site-packages\torch\cuda\nccl.py:24: UserWarning: PyTorch is not compiled with NCCL support
warnings.warn('PyTorch is not compiled with NCCL support')
0/99 0/3643 1.13 5.46 555 13.3 575 1.6e+03 26.3
0/99 1/3643 1.65 8.2 833 20 863 3.2e+03 3.28
0/99 2/3643 2.19 11.2 1.11e+03 26.7 1.15e+03 4.8e+03 1.77
0/99 3/3643 2.72 14.1 1.39e+03 33.3 1.44e+03 6.4e+03 1.77
0/99 4/3643 3.25 17.1 1.67e+03 40 1.73e+03 8e+03 1.82
0/99 5/3643 3.78 20.1 1.94e+03 46.6 2.01e+03 9.6e+03 1.64
0/99 6/3643 4.31 23.1 2.22e+03 53.3 2.3e+03 1.12e+04 1.66
0/99 7/3643 4.83 26.1 2.5e+03 59.9 2.59e+03 1.28e+04 1.73
0/99 8/3643 5.35 29.1 2.78e+03 66.6 2.88e+03 1.44e+04 1.81
0/99 9/3643 5.87 32.1 3.05e+03 73.2 3.16e+03 1.6e+04 1.62
0/99 10/3643 6.4 35.1 3.33e+03 79.9 3.45e+03 1.76e+04 1.74
0/99 11/3643 6.92 38.1 3.61e+03 86.6 3.74e+03 1.92e+04 1.65
0/99 12/3643 7.45 41.1 3.89e+03 93.2 4.03e+03 2.08e+04 1.73
....
....
0/99 333/3643 176 830 5.25e+04 2.2e+03 5.57e+04 5.34e+05 1.88
0/99 334/3643 177 831 5.25e+04 2.21e+03 5.57e+04 5.36e+05 1.7
0/99 335/3643 177 833 5.25e+04 2.21e+03 5.58e+04 5.38e+05 1.79
0/99 336/3643 178 835 5.26e+04 2.22e+03 5.58e+04 5.39e+05 1.8
0/99 337/3643 178 836 5.26e+04 2.23e+03 5.59e+04 5.41e+05 1.68
0/99 338/3643 179 838 5.27e+04 2.23e+03 5.59e+04 5.42e+05 1.85
0/99 339/3643 179 840 5.27e+04 2.24e+03 5.6e+04 5.44e+05 1.91
0/99 340/3643 180 841 5.27e+04 2.24e+03 5.6e+04 5.46e+05 1.84
0/99 341/3643 180 843 5.28e+04 2.25e+03 5.6e+04 5.47e+05 1.83
0/99 342/3643 181 845 5.28e+04 2.26e+03 5.61e+04 5.49e+05 1.7
0/99 343/3643 181 846 5.29e+04 2.26e+03 5.61e+04 5.5e+05 1.95
0/99 344/3643 182 848 5.29e+04 2.27e+03 5.62e+04 5.52e+05 1.89
loss is growing...
Looks like it's working. You are training and batches are being pushed through your model. Lack of NCCL support is a problem with your installation of Pytorch, not the code of this repo. The UserWarning can be ignored.
Also what is your batch size, specified when running train.py (--batch-size) ?
parser.add_argument('--epochs', type=int, default=100, help='number of epochs')
parser.add_argument('--batch-size', type=int, default=32, help='size of each image batch')
parser.add_argument('--accumulated-batches', type=int, default=1, help='number of batches before optimizer step')
parser.add_argument('--cfg', type=str, default='cfg/yolov3.cfg', help='cfg file path')
parser.add_argument('--data-cfg', type=str, default='cfg/coco.data', help='coco.data file path')
parser.add_argument('--multi-scale', action='store_true', help='random image sizes per batch 320 - 608')
parser.add_argument('--img-size', type=int, default=32 * 13, help='pixels')
parser.add_argument('--resume', action='store_true', help='resume training flag')
parser.add_argument('--num-workers', type=int, default=0, help='number of workers for dataloader')
parser.add_argument('--var', type=float, default=0, help='test variable')
@alexpolichroniadis thank you for your reply :)
I use the default params but the modify batch size
@alexpolichroniadis thank you for your reply :)
I use the default params but the modify batch size
Looking at your output, it looks like you pulled an earlier commit from that PR, based on the exploding loss report I'm seeing. I fixed that today in the latest commit. Are you sure you are working off the latest commit of that PR?
Today,I just download from https://github.com/alexpolichroniadis/yolov3
Maybe, use :https://github.com/alexpolichroniadis/yolov3/tree/multigpu will be better and I will try
Maybe, use :https://github.com/alexpolichroniadis/yolov3/tree/multigpu will be better and I will try
Yes, that is the correct branch to work off. Master still has an older version.
This code runs like this:
219 module.104.batch_norm_104.bias True 256 [256] 0 0
220 module.105.conv_105.weight True 65280 [255, 256, 1, 1] 0.000114 0.0362
221 module.105.conv_105.bias True 255 [255] -0.00154 0.036
Model Summary: 222 layers, 6.19491e+07 parameters, 6.19491e+07 gradients
Epoch Batch xy wh conf cls total nTargets time
and then it keep this state up to now.It looks like this program may not running
This code runs like this:
219 module.104.batch_norm_104.bias True 256 [256] 0 0
220 module.105.conv_105.weight True 65280 [255, 256, 1, 1] 0.000114 0.0362
221 module.105.conv_105.bias True 255 [255] -0.00154 0.036
Model Summary: 222 layers, 6.19491e+07 parameters, 6.19491e+07 gradientsEpoch Batch xy wh conf cls total nTargets time
and then it keep this state up to now.It looks like this program may not running
I noticed that you are running this code on Windows. Keep in mind that pytorch's DataParallel might not be operational on Windows machines due to lack of NCCL support, see here. My testing was on an Ubuntu machine.
@alexpolichroniadis ,Thanks for your help,this code work well now :)
@alexpolichroniadis I get an error when I run the following on a GCP PyTorch instance with 2 GPUs. I noticed you changed coco.data from the darknet default, so I updated this to point back to the default, and this fixed the error.
sudo rm -rf yolov3 && git clone -b multigpu --depth 1 https://github.com/alexpolichroniadis/yolov3
cd yolov3 && python3 train.py
Namespace(accumulated_batches=1, batch_size=16, cfg='cfg/yolov3.cfg', data_cfg='cfg/coco.data', epochs=100, img_size=416, multi_scale=Fa
lse, num_workers=0, resume=False, var=0)
Using CUDA. Available devices:
0 - Tesla P100-PCIE-16GB - 16280MB
1 - Tesla P100-PCIE-16GB - 16280MB
Traceback (most recent call last):
File "train.py", line 234, in <module>
var=opt.var,
File "train.py", line 46, in train
train_loader = ImageLabelDataset(train_path, batch_size, img_size, multi_scale=multi_scale, augment=True)
File "/home/ultralytics/yolov3/utils/datasets.py", line 105, in __init__
for x in self.img_files]
AttributeError: 'ImageLabelDataset' object has no attribute 'img_files'
Now I see a seperate problem though, there doesn't appear to be any speedup. Single P100 takes about 0.6s, the same as 2 P100s here:
Epoch Batch xy wh conf cls total nTargets time
/opt/anaconda3/lib/python3.7/site-packages/torch/nn/parallel/_functions.py:61: UserWarning: Was asked to gather along dimension 0, but a
ll input tensors were scalars; will instead unsqueeze and return a vector.
warnings.warn('Was asked to gather along dimension 0, but all '
0/99 0/7327 0.51 2.76 277 6.66 287 121 7
0/99 1/7327 0.52 2.7 278 6.65 287 99 0.712
0/99 2/7327 0.539 2.83 278 6.65 288 143 0.631
0/99 3/7327 0.543 2.83 278 6.64 288 123 0.608
...
Whats your batch size? If your batch can fit perfectly on one GPU, then (in most cases) you are better off using a single GPU. The benefits of a multi GPU setup is cranking up the batch size and having more images be processed in the same amount of time. Try setting your --batch-size to 128 (or something outside what a single GPU can handle) for example and re-testing.
Since batching happens on the CPU, there are also cases where the CPU becomes the bottleneck then (the GPU waits for the batch to be created). This becomes more apparent with big batch sizes. In all there is balance that needs to be found and is not directly apparent.
One other thing: With nn.Dataparallel, there is preliminary loading of the GPUs with a copy of the model each. This happens on the first batch and is reflected in the higher time reported when processing the first batch.
An example:
On my setup, for a batch size of 128, processing time per batch is 1 sec.
For a batch size of 256, it is 1.6sec.
(all cases with dataparallel on).
On a single 1080Ti a batch size of 128 is not doable.
@door5719 @alexpolichroniadis thanks for the info. We started on our own multi_gpu branch (https://github.com/ultralytics/yolov3/tree/multi_gpu), with a secondary goal of trying out a different loss approach, selecting a single anchor from the 9 available for each target. The new loss produced significantly worse results, so it appears the current method of selecting one anchor from each yolo layer is correct. In the process we did get multi_gpu operational, though not with the speedups expected. We did not attempt to use a multithreaded PyTorch dataloader, nor PIL in place of OpenCV, as we found both of these slower in our single-GPU profiling last year.
We don't have multiple gpu machines on premise so we tested this with GCP Deep Learning VMs. We used batch_size=26 (max that 1 P100 can handle) times the number of GPUs. All other training setting were defaults. We selected the fastest batch out of the first 30 for timing purposes. Results are below for our branch and the https://github.com/ultralytics/yolov3/pull/121 PR. In both cases the speedups were very poor. It's possible the IO ops were constrained by GCP due to the limited SSD size, we will try again with a larger SSD but we wanted to get these results out here for feedback. If anyone has another repo or PR we can compare against please let us know!
https://cloud.google.com/deep-learning-vm/
Machine type: n1-highmem-4 (4 vCPUs, 26 GB memory)
CPU platform: Intel Skylake
GPUs: 1-4 x NVIDIA Tesla P100
HDD: 500 GB SSD
GPUs | batch_size | yolov3/tree/multi_gpu | yolov3/pull/121
--- |---| --- | ---
(P100) | (images) | (s/batch) | (s/batch)
1 | 26 | 0.91s | 1.05s
2 | 52 | 1.60s | 1.76s
4 | 104 | 2.26s | 2.81s
I think torch.nn.parallel.DistributedDataParallel is better than nn.DataParallel. The usage of DataParallel should be bottleneck.
Because the box2 is torch.FloatTensor, the anchor_vec is on cpu. while the box1 is on GPU.
so, just use .cuda() to transform the data into torch.cuda.FloatTensor()
` box2 = anchor_vec.cuda().unsqueeze(1)
inter_area = torch.min(box1, box2).prod(2)`
but, when you fix this, the below will also come out some bug.
` txy[b, a, gj, gi] = gxy - gxy.floor()
# Width and height
twh[b, a, gj, gi] = torch.log(gwh/ anchor_vec[a]) `
you need to transform the data type to GPU or Cuda according to the error info.
However, the main reason for multi-GPU training lies in
for i, (imgs, targets, _, _) in enumerate(dataloader):
where the imgs is a tensor, but the targets are lists. When parallel the imgs.to(device). The imgs are divided into batch_size/GPU_nums. But the targets cannot targets.to(device)(since it is a list), and the targets are the same num as the batch_size, cannot distribute into every GPUs.
if nM > 0:
lxy = k * MSELoss(xy[mask], txy[mask])
lwh = k * MSELoss(wh[mask], twh[mask])
the xy, txy, wh, twh is not the same dims as the batch_size.
the xy, wh is batch_size/GPU_nums.
but the txy, twh is the targets_nums( batch_size). There will occur some error.
@longxianlei we just PRd our under-development multi_gpu branch into the master branch, so multi-GPU functionality now works. Many of the items you raised above should be resolved. Can you try the latest commit and see if it works for you? See #135 for more info.
@glenn-jocher keep in mind that batch sizes should be integer multiples of the number of available GPUs. For a batch size of 26 on 4 GPUs, you are essentially pushing 26//4 = 6 images on all GPUs and the two remaining ones are pushed on the last GPU. This is unbalanced as each GPU processes batch sizes of 6/6/6/8.
The ideal batch size to test here would be 4*6=24. And multiples of 24 thereafter. Also it is true that the actual bottleneck might be IO at this point.
Updated times with batch_size=24, and comparison to existing study.
https://cloud.google.com/deep-learning-vm/
Machine type: n1-highmem-4 (4 vCPUs, 26 GB memory)
CPU platform: Intel Skylake
GPUs: 1-4 x NVIDIA Tesla P100
HDD: 100 GB SSD
GPUs | batch_size | 613ce1be1aac84c729bb1e77b029df260644cc1c | COCO epoch
--- |---| --- | ---
(P100) | (images) | (s/batch) | (min/epoch)
1 | 24 | 0.84s | 70min
2 | 48 | 1.27s | 53min
4 | 96 | 2.11s | 44min
Comparison results from https://github.com/ilkarman/DeepLearningFrameworks

@alexpolichroniadis, @longxianlei, @LightToYang Great news! Lack of multithreading in the dataloader was slowing down multi-GPU significantly (#141). I reimplented support for DataLoader multithreading, and speeds have improved greatly (more than double in some cases). The new test results are below for the latest commit.
https://cloud.google.com/deep-learning-vm/
Machine type: n1-standard-8 (8 vCPUs, 30 GB memory)
CPU platform: Intel Skylake
GPUs: 1-4 x NVIDIA Tesla P100
HDD: 100 GB SSD
GPUs | batch_size | speed | COCO epoch
--- |---| --- | ---
(P100) | (images) | (s/batch) | (min/epoch)
1 | 16 | 0.39s | 48min
2 | 32 | 0.48s | 29min
4 | 64 | 0.65s | 20min
@glenn-jocher I never noticed that the default for the dataloder's num_workers set to 0 because I set it manually all the time, whoops. 😅
Good results indeed. In line with what I was getting.
Had the same issue. But I've used this repo on multi-GPU before and it's worked well. Somebody had posted saying the batch-size in the last iteration might be lesser than the batch-size given during training so I removed a few images to make the validation set images a multiple of 8, as I'd given 8 as my batch-size during training and it solved the issue.
Most helpful comment
@alexpolichroniadis, @longxianlei, @LightToYang Great news! Lack of multithreading in the dataloader was slowing down multi-GPU significantly (#141). I reimplented support for DataLoader multithreading, and speeds have improved greatly (more than double in some cases). The new test results are below for the latest commit.
https://cloud.google.com/deep-learning-vm/
Machine type: n1-standard-8 (8 vCPUs, 30 GB memory)
CPU platform: Intel Skylake
GPUs: 1-4 x NVIDIA Tesla P100
HDD: 100 GB SSD
GPUs |
batch_size| speed | COCO epoch--- |---| --- | ---
(P100) | (images) | (s/batch) | (min/epoch)
1 | 16 | 0.39s | 48min
2 | 32 | 0.48s | 29min
4 | 64 | 0.65s | 20min