Yolov3: Multi-GPU Training

Created on 2 Oct 2018 · 33Comments · Source: ultralytics/yolov3

Hi,
Have you tried to run training on multiple gpus?
I am getting the below error when I try to do that.thank you

Traceback (most recent call last):
  File "train.py", line 194, in <module>
    main(opt)
  File "train.py", line 128, in main
    loss = model(imgs, targets, requestPrecision=True)
  File "/opt/anaconda/envs/pytorch_p35/lib/python3.5/site-packages/torch/nn/modules/module.py", line 477, in __call__
    result = self.forward(*input, **kwargs)
  File "/opt/anaconda/envs/pytorch_p35/lib/python3.5/site-packages/torch/nn/parallel/data_parallel.py", line 123, in forward
    outputs = self.parallel_apply(replicas, inputs, kwargs)
  File "/opt/anaconda/envs/pytorch_p35/lib/python3.5/site-packages/torch/nn/parallel/data_parallel.py", line 133, in parallel_apply
    return parallel_apply(replicas, inputs, kwargs, self.device_ids[:len(replicas)])
  File "/opt/anaconda/envs/pytorch_p35/lib/python3.5/site-packages/torch/nn/parallel/parallel_apply.py", line 77, in parallel_apply
    raise output
  File "/opt/anaconda/envs/pytorch_p35/lib/python3.5/site-packages/torch/nn/parallel/parallel_apply.py", line 53, in _worker
    output = module(*input, **kwargs)
  File "/opt/anaconda/envs/pytorch_p35/lib/python3.5/site-packages/torch/nn/modules/module.py", line 477, in __call__
    result = self.forward(*input, **kwargs)
TypeError: forward() missing 1 required positional argument: 'x'

bug help wanted

Source

xiao1228

Most helpful comment

@alexpolichroniadis, @longxianlei, @LightToYang Great news! Lack of multithreading in the dataloader was slowing down multi-GPU significantly (#141). I reimplented support for DataLoader multithreading, and speeds have improved greatly (more than double in some cases). The new test results are below for the latest commit.

https://cloud.google.com/deep-learning-vm/
Machine type: n1-standard-8 (8 vCPUs, 30 GB memory)
CPU platform: Intel Skylake
GPUs: 1-4 x NVIDIA Tesla P100
HDD: 100 GB SSD

GPUs | batch_size | speed | COCO epoch
--- |---| --- | ---
(P100) | (images) | (s/batch) | (min/epoch)
1 | 16 | 0.39s | 48min
2 | 32 | 0.48s | 29min
4 | 64 | 0.65s | 20min

glenn-jocher on 21 Mar 2019

👍3 🎉2

All 33 comments

@xiao1228 The requirements clearly state Python 3.6. I'd advise you to follow them.

Multi-GPU training is still a work in progress. If you could help debug this after upgrading your Python that would be great!

glenn-jocher on 2 Oct 2018

hi I am getting the same error after upgrade to Python 3.6. I will work on that if I can fix it will update you

xiao1228 on 3 Oct 2018

👍2

Getting the error below by trying to move all the variables to cuda

utils/utils.py", line 293, in build_targets
TP[b, i] = (pconf > 0.5) & (iou_pred > 0.5) & (pcls == tc)
RuntimeError: Assertion `THCTensor_(checkGPU)(state, 3, self_, src1, src2)' failed. at /opt/conda/conda-bld/pytorch_1535491974311/work/aten/src/THC/generated/../generic/THCTensorMathPointwise.cu:688

xiao1228 on 10 Oct 2018

@xiao1228 Have you solved your 1st problem in this issue? I wanna turn the code into multi-gpu and met it, too. I'm still confused. Thank you.

zhaoyang10 on 31 Oct 2018

@xiao1228 @zhaoyang10 the code does not support multi-GPU yet unfortunately. I only have a single-GPU machine so I have not been able to debug this issue. If you come up with a solution please advise me, or submit a pull request. Many thanks!!

glenn-jocher on 2 Nov 2018

I've changed the code to raise an error when multi-GPU operation is attempted, until this is resolved.
https://github.com/ultralytics/yolov3/blob/af0033c9e96a4a8dcc04f8fd737d0ccad3364f10/train.py#L60-L63

glenn-jocher on 29 Nov 2018

I've added multi-GPU training support to the pipeline.

See here: https://github.com/ultralytics/yolov3/pull/121

alexpolichroniadis on 7 Mar 2019

@alexpolichroniadis , I run your code ,may not work on 4 x 1080ti.....

door5719 on 8 Mar 2019

Tested on 8x1080tis.

What's the trace?

alexpolichroniadis on 8 Mar 2019

Epoch Batch xy wh conf cls total nTargets time
C:\Users\NJ\Anaconda3\lib\site-packages\torch\nn\parallel_functions.py:61: UserWarning: Was asked to gather along dimension 0, but all input tensors were scalars; will instead unsqueeze and return a vector.
warnings.warn('Was asked to gather along dimension 0, but all '
C:\Users\NJ\Anaconda3\lib\site-packages\torch\cuda\nccl.py:24: UserWarning: PyTorch is not compiled with NCCL support
warnings.warn('PyTorch is not compiled with NCCL support')

    0/99      0/3643      1.13      5.46       555      13.3       575   1.6e+03      26.3
    0/99      1/3643      1.65       8.2       833        20       863   3.2e+03      3.28
    0/99      2/3643      2.19      11.2  1.11e+03      26.7  1.15e+03   4.8e+03      1.77
    0/99      3/3643      2.72      14.1  1.39e+03      33.3  1.44e+03   6.4e+03      1.77
    0/99      4/3643      3.25      17.1  1.67e+03        40  1.73e+03     8e+03      1.82
    0/99      5/3643      3.78      20.1  1.94e+03      46.6  2.01e+03   9.6e+03      1.64
    0/99      6/3643      4.31      23.1  2.22e+03      53.3   2.3e+03  1.12e+04      1.66
    0/99      7/3643      4.83      26.1   2.5e+03      59.9  2.59e+03  1.28e+04      1.73
    0/99      8/3643      5.35      29.1  2.78e+03      66.6  2.88e+03  1.44e+04      1.81
    0/99      9/3643      5.87      32.1  3.05e+03      73.2  3.16e+03   1.6e+04      1.62
    0/99     10/3643       6.4      35.1  3.33e+03      79.9  3.45e+03  1.76e+04      1.74
    0/99     11/3643      6.92      38.1  3.61e+03      86.6  3.74e+03  1.92e+04      1.65
    0/99     12/3643      7.45      41.1  3.89e+03      93.2  4.03e+03  2.08e+04      1.73
....
....
    0/99    333/3643       176       830  5.25e+04   2.2e+03  5.57e+04  5.34e+05      1.88
    0/99    334/3643       177       831  5.25e+04  2.21e+03  5.57e+04  5.36e+05       1.7
    0/99    335/3643       177       833  5.25e+04  2.21e+03  5.58e+04  5.38e+05      1.79
    0/99    336/3643       178       835  5.26e+04  2.22e+03  5.58e+04  5.39e+05       1.8
    0/99    337/3643       178       836  5.26e+04  2.23e+03  5.59e+04  5.41e+05      1.68
    0/99    338/3643       179       838  5.27e+04  2.23e+03  5.59e+04  5.42e+05      1.85
    0/99    339/3643       179       840  5.27e+04  2.24e+03   5.6e+04  5.44e+05      1.91
    0/99    340/3643       180       841  5.27e+04  2.24e+03   5.6e+04  5.46e+05      1.84
    0/99    341/3643       180       843  5.28e+04  2.25e+03   5.6e+04  5.47e+05      1.83
    0/99    342/3643       181       845  5.28e+04  2.26e+03  5.61e+04  5.49e+05       1.7
    0/99    343/3643       181       846  5.29e+04  2.26e+03  5.61e+04   5.5e+05      1.95
    0/99    344/3643       182       848  5.29e+04  2.27e+03  5.62e+04  5.52e+05      1.89

loss is growing...

door5719 on 8 Mar 2019

Looks like it's working. You are training and batches are being pushed through your model. Lack of NCCL support is a problem with your installation of Pytorch, not the code of this repo. The UserWarning can be ignored.

alexpolichroniadis on 8 Mar 2019

Also what is your batch size, specified when running train.py (--batch-size) ?

alexpolichroniadis on 8 Mar 2019

parser.add_argument('--epochs', type=int, default=100, help='number of epochs')
parser.add_argument('--batch-size', type=int, default=32, help='size of each image batch')
parser.add_argument('--accumulated-batches', type=int, default=1, help='number of batches before optimizer step')
parser.add_argument('--cfg', type=str, default='cfg/yolov3.cfg', help='cfg file path')
parser.add_argument('--data-cfg', type=str, default='cfg/coco.data', help='coco.data file path')
parser.add_argument('--multi-scale', action='store_true', help='random image sizes per batch 320 - 608')
parser.add_argument('--img-size', type=int, default=32 * 13, help='pixels')
parser.add_argument('--resume', action='store_true', help='resume training flag')
parser.add_argument('--num-workers', type=int, default=0, help='number of workers for dataloader')
parser.add_argument('--var', type=float, default=0, help='test variable')

door5719 on 8 Mar 2019

@alexpolichroniadis thank you for your reply :)
I use the default params but the modify batch size

door5719 on 8 Mar 2019

@alexpolichroniadis thank you for your reply :)
I use the default params but the modify batch size

Looking at your output, it looks like you pulled an earlier commit from that PR, based on the exploding loss report I'm seeing. I fixed that today in the latest commit. Are you sure you are working off the latest commit of that PR?

alexpolichroniadis on 8 Mar 2019

Today,I just download from https://github.com/alexpolichroniadis/yolov3

door5719 on 8 Mar 2019

Maybe, use :https://github.com/alexpolichroniadis/yolov3/tree/multigpu will be better and I will try

door5719 on 8 Mar 2019

Maybe, use :https://github.com/alexpolichroniadis/yolov3/tree/multigpu will be better and I will try

Yes, that is the correct branch to work off. Master still has an older version.

alexpolichroniadis on 8 Mar 2019

This code runs like this：
219 module.104.batch_norm_104.bias True 256 [256] 0 0
220 module.105.conv_105.weight True 65280 [255, 256, 1, 1] 0.000114 0.0362
221 module.105.conv_105.bias True 255 [255] -0.00154 0.036
Model Summary: 222 layers, 6.19491e+07 parameters, 6.19491e+07 gradients

Epoch Batch xy wh conf cls total nTargets time

and then it keep this state up to now.It looks like this program may not running

door5719 on 8 Mar 2019

This code runs like this：
219 module.104.batch_norm_104.bias True 256 [256] 0 0
220 module.105.conv_105.weight True 65280 [255, 256, 1, 1] 0.000114 0.0362
221 module.105.conv_105.bias True 255 [255] -0.00154 0.036
Model Summary: 222 layers, 6.19491e+07 parameters, 6.19491e+07 gradients

Epoch Batch xy wh conf cls total nTargets time

and then it keep this state up to now.It looks like this program may not running

I noticed that you are running this code on Windows. Keep in mind that pytorch's DataParallel might not be operational on Windows machines due to lack of NCCL support, see here. My testing was on an Ubuntu machine.

alexpolichroniadis on 8 Mar 2019

@alexpolichroniadis ,Thanks for your help,this code work well now :)

door5719 on 8 Mar 2019

🎉1

@alexpolichroniadis I get an error when I run the following on a GCP PyTorch instance with 2 GPUs. I noticed you changed coco.data from the darknet default, so I updated this to point back to the default, and this fixed the error.

sudo rm -rf yolov3 && git clone -b multigpu --depth 1 https://github.com/alexpolichroniadis/yolov3
cd yolov3 && python3 train.py

Namespace(accumulated_batches=1, batch_size=16, cfg='cfg/yolov3.cfg', data_cfg='cfg/coco.data', epochs=100, img_size=416, multi_scale=Fa
lse, num_workers=0, resume=False, var=0)
Using CUDA. Available devices: 
0 - Tesla P100-PCIE-16GB - 16280MB
1 - Tesla P100-PCIE-16GB - 16280MB
Traceback (most recent call last):
  File "train.py", line 234, in <module>
    var=opt.var,
  File "train.py", line 46, in train
    train_loader = ImageLabelDataset(train_path, batch_size, img_size, multi_scale=multi_scale, augment=True)
  File "/home/ultralytics/yolov3/utils/datasets.py", line 105, in __init__
    for x in self.img_files]
AttributeError: 'ImageLabelDataset' object has no attribute 'img_files'

Now I see a seperate problem though, there doesn't appear to be any speedup. Single P100 takes about 0.6s, the same as 2 P100s here:

   Epoch       Batch        xy        wh      conf       cls     total  nTargets      time
/opt/anaconda3/lib/python3.7/site-packages/torch/nn/parallel/_functions.py:61: UserWarning: Was asked to gather along dimension 0, but a
ll input tensors were scalars; will instead unsqueeze and return a vector.
  warnings.warn('Was asked to gather along dimension 0, but all '
    0/99      0/7327      0.51      2.76       277      6.66       287       121         7
    0/99      1/7327      0.52       2.7       278      6.65       287        99     0.712
    0/99      2/7327     0.539      2.83       278      6.65       288       143     0.631
    0/99      3/7327     0.543      2.83       278      6.64       288       123     0.608
...

glenn-jocher on 8 Mar 2019

Whats your batch size? If your batch can fit perfectly on one GPU, then (in most cases) you are better off using a single GPU. The benefits of a multi GPU setup is cranking up the batch size and having more images be processed in the same amount of time. Try setting your --batch-size to 128 (or something outside what a single GPU can handle) for example and re-testing.

Since batching happens on the CPU, there are also cases where the CPU becomes the bottleneck then (the GPU waits for the batch to be created). This becomes more apparent with big batch sizes. In all there is balance that needs to be found and is not directly apparent.

One other thing: With nn.Dataparallel, there is preliminary loading of the GPUs with a copy of the model each. This happens on the first batch and is reflected in the higher time reported when processing the first batch.

alexpolichroniadis on 8 Mar 2019

An example:
On my setup, for a batch size of 128, processing time per batch is 1 sec.
For a batch size of 256, it is 1.6sec.
(all cases with dataparallel on).
On a single 1080Ti a batch size of 128 is not doable.

alexpolichroniadis on 8 Mar 2019

@door5719 @alexpolichroniadis thanks for the info. We started on our own multi_gpu branch (https://github.com/ultralytics/yolov3/tree/multi_gpu), with a secondary goal of trying out a different loss approach, selecting a single anchor from the 9 available for each target. The new loss produced significantly worse results, so it appears the current method of selecting one anchor from each yolo layer is correct. In the process we did get multi_gpu operational, though not with the speedups expected. We did not attempt to use a multithreaded PyTorch dataloader, nor PIL in place of OpenCV, as we found both of these slower in our single-GPU profiling last year.

We don't have multiple gpu machines on premise so we tested this with GCP Deep Learning VMs. We used batch_size=26 (max that 1 P100 can handle) times the number of GPUs. All other training setting were defaults. We selected the fastest batch out of the first 30 for timing purposes. Results are below for our branch and the https://github.com/ultralytics/yolov3/pull/121 PR. In both cases the speedups were very poor. It's possible the IO ops were constrained by GCP due to the limited SSD size, we will try again with a larger SSD but we wanted to get these results out here for feedback. If anyone has another repo or PR we can compare against please let us know!

https://cloud.google.com/deep-learning-vm/
Machine type: n1-highmem-4 (4 vCPUs, 26 GB memory)
CPU platform: Intel Skylake
GPUs: 1-4 x NVIDIA Tesla P100
HDD: 500 GB SSD

GPUs | batch_size | yolov3/tree/multi_gpu | yolov3/pull/121
--- |---| --- | ---
(P100) | (images) | (s/batch) | (s/batch)
1 | 26 | 0.91s | 1.05s
2 | 52 | 1.60s | 1.76s
4 | 104 | 2.26s | 2.81s

glenn-jocher on 16 Mar 2019

I think torch.nn.parallel.DistributedDataParallel is better than nn.DataParallel. The usage of DataParallel should be bottleneck.

LightToYang on 17 Mar 2019

Because the box2 is torch.FloatTensor, the anchor_vec is on cpu. while the box1 is on GPU.
so, just use .cuda() to transform the data into torch.cuda.FloatTensor()
` box2 = anchor_vec.cuda().unsqueeze(1)

    inter_area = torch.min(box1, box2).prod(2)`

but, when you fix this, the below will also come out some bug.
` txy[b, a, gj, gi] = gxy - gxy.floor()

    # Width and height
    twh[b, a, gj, gi] = torch.log(gwh/ anchor_vec[a]) `

you need to transform the data type to GPU or Cuda according to the error info.
However, the main reason for multi-GPU training lies in
for i, (imgs, targets, _, _) in enumerate(dataloader):
where the imgs is a tensor, but the targets are lists. When parallel the imgs.to(device). The imgs are divided into batch_size/GPU_nums. But the targets cannot targets.to(device)(since it is a list), and the targets are the same num as the batch_size, cannot distribute into every GPUs.
if nM > 0: lxy = k * MSELoss(xy[mask], txy[mask]) lwh = k * MSELoss(wh[mask], twh[mask])
the xy, txy, wh, twh is not the same dims as the batch_size.
the xy, wh is batch_size/GPU_nums.
but the txy, twh is the targets_nums( batch_size). There will occur some error.

longxianlei on 18 Mar 2019

@longxianlei we just PRd our under-development multi_gpu branch into the master branch, so multi-GPU functionality now works. Many of the items you raised above should be resolved. Can you try the latest commit and see if it works for you? See #135 for more info.

glenn-jocher on 18 Mar 2019

@glenn-jocher keep in mind that batch sizes should be integer multiples of the number of available GPUs. For a batch size of 26 on 4 GPUs, you are essentially pushing 26//4 = 6 images on all GPUs and the two remaining ones are pushed on the last GPU. This is unbalanced as each GPU processes batch sizes of 6/6/6/8.

The ideal batch size to test here would be 4*6=24. And multiples of 24 thereafter. Also it is true that the actual bottleneck might be IO at this point.

alexpolichroniadis on 18 Mar 2019

👍1

Updated times with batch_size=24, and comparison to existing study.

https://cloud.google.com/deep-learning-vm/
Machine type: n1-highmem-4 (4 vCPUs, 26 GB memory)
CPU platform: Intel Skylake
GPUs: 1-4 x NVIDIA Tesla P100
HDD: 100 GB SSD

GPUs | batch_size | 613ce1be1aac84c729bb1e77b029df260644cc1c | COCO epoch
--- |---| --- | ---
(P100) | (images) | (s/batch) | (min/epoch)
1 | 24 | 0.84s | 70min
2 | 48 | 1.27s | 53min
4 | 96 | 2.11s | 44min

Comparison results from https://github.com/ilkarman/DeepLearningFrameworks
Screenshot 2019-03-19 at 16 36 24

glenn-jocher on 19 Mar 2019

👍1

https://cloud.google.com/deep-learning-vm/
Machine type: n1-standard-8 (8 vCPUs, 30 GB memory)
CPU platform: Intel Skylake
GPUs: 1-4 x NVIDIA Tesla P100
HDD: 100 GB SSD

GPUs | batch_size | speed | COCO epoch
--- |---| --- | ---
(P100) | (images) | (s/batch) | (min/epoch)
1 | 16 | 0.39s | 48min
2 | 32 | 0.48s | 29min
4 | 64 | 0.65s | 20min

glenn-jocher on 21 Mar 2019

👍3 🎉2

@glenn-jocher I never noticed that the default for the dataloder's num_workers set to 0 because I set it manually all the time, whoops. 😅

Good results indeed. In line with what I was getting.

alexpolichroniadis on 21 Mar 2019

Had the same issue. But I've used this repo on multi-GPU before and it's worked well. Somebody had posted saying the batch-size in the last iteration might be lesser than the batch-size given during training so I removed a few images to make the validation set images a multiple of 8, as I'd given 8 as my batch-size during training and it solved the issue.