Yolov3: Multi-scale incompatibility with Rectangular Training

Created on 3 Jul 2019 · 6Comments · Source: ultralytics/yolov3

Describe the bug
I used 1024*288 images as a training image, using four 1080ti gpus. I tried to enable both rectangular_training and multi-scale_learning but tensor size mismatch occurred on route modules. The error seems to occur because the input image doesn't have multiples of 32.
It seems the images are resized and padded to have multiples of 32, but they are changed by resizing part of multi-scale module. Switching the location of the two modules may fix the problem. Or using estimated width and height to scale the image instead of using resizing factor would help too.

bug

Source

cmiller2air

All 6 comments

@cmiller2air lets see. Rectangular training attempts to sort all of your training images by aspect ratio, and then groups similar aspect ratio images into a single batch. It will then pad all of the images in the batch as necessary to achieve a minimum pad given the constraint that the image dimensions must be multiples of 32.

I'm not sure if rectangular training and multi scale are mutually exclusive, though one important problem with rectangular training is that you need to be sure that the images are not shuffled by the data loader.

test.py uses rectangular images by default in this repo, so the rectangular functionality is operating properly by itself. Can you post your command and the screen outputs of the command?

glenn-jocher on 3 Jul 2019

I took a look at this over here. It seems like the interpolation operation in train.py is resizing the long size to a new multiple of 32, though the new image is no longer subject to the 32-multiple constraint on the shorter side. This is what is causing the errors.

https://github.com/ultralytics/yolov3/blob/ab141fcc1ff976fa9d1bd983e11fe43ba4628e2e/train.py#L199-L206

This will need some significant additional logic to correct. We will leave this issue open. I can't give you a timeline for a fix, but an immediate workaround is to use rectangular training with --single-scale.

glenn-jocher on 3 Jul 2019

👍1

I have experienced the same problem. I cannot use --rect flag together with --multi-scale when I try to do this I got belowing error:


Traceback (most recent call last):
  File "train.py", line 334, in <module>
    accumulate=opt.accumulate)
  File "train.py", line 207, in train
    pred = model(imgs)
  File "/home/tomekb/.local/lib/python3.6/site-packages/torch/nn/modules/module.py", line 493, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/tomekb/pytorch_yolov3/models.py", line 189, in forward
    x = torch.cat([layer_outputs[i] for i in layer_i], 1)
RuntimeError: invalid argument 0: Sizes of tensors must match except in dimension 1. Got 27 and 28 in dimension 2 at /pytorch/aten/src/THC/generic/THCTensorMath.cu:71

Bienqq on 19 Jul 2019

@Bienqq @cmiller2air since we now had multiple requests we elevated the issue status and have now implemented a fix in https://github.com/ultralytics/yolov3/commit/44b340321fef16ee21e9fba43caa7ddebef03c2f. Multiscale training should now be compatible with rectangular training.

Please git pull and try again. Thank you!

glenn-jocher on 20 Jul 2019

🎉1

Everything appears to be operating correctly in our COCO tests:

python3 train.py --data data/coco.data --img-size 320 --rect --multi-scale

Namespace(accumulate=4, batch_size=16, bucket='', cfg='cfg/yolov3-spp.cfg', data='data/coco.data', epochs=100, evolve=False, img_size=320, multi_scale=True, nosave=False, notest=False, num_workers=4, rect=True, resume=False, transfer=False, var=0, xywh=False)
Using CUDA with Apex device0 _CudaDeviceProperties(name='Tesla T4', total_memory=15079MB)

Reading image shapes: 100% 117263/117263 [03:24<00:00, 572.30it/s]
Model Summary: 225 layers, 6.29987e+07 parameters, 6.29987e+07 gradients

     Epoch   gpu_mem   GIoU/xy        wh       obj       cls     total   targets  img_size
      0/99      3.7G     0.565         0      10.6      13.2      24.4        93       352:   4% 297/7329 [01:55<33:54,  3.46it/s]

glenn-jocher on 20 Jul 2019

Closing as issue should be resolved.

glenn-jocher on 23 Jul 2019

Was this page helpful?

0 / 5 - 0 ratings