Yolov5: torch.nn.modules.module.ModuleAttributeError in DP and DDP mode

Created on 9 Aug 2020  路  2Comments  路  Source: ultralytics/yolov5

馃悰 Bug

Due to latest update on 3c6e2f7668ea178287040c14c4cf81f45357d50b , DP and DDP mode would error because they wrap around the model, so the attribute stride cannot be accessed.

To Reproduce (REQUIRED)

Input:

python train.py --weights yolov5s.pt --epochs 3 --img 320 --device 0,1 # DP
python -m torch.distributed.launch --nproc_per_node 2 train.py --weights yolov5s.pt --epochs 3 --img 320 --device 0,1 # DDP

Output in DDP mode (DP mode output is just a bit different):

Transferred 370/370 items from yolov5s.pt
Optimizer groups: 62 .bias, 70 conv.weight, 59 other
Transferred 370/370 items from yolov5s.pt
Optimizer groups: 62 .bias, 70 conv.weight, 59 other
Traceback (most recent call last):
  File "train.py", line 439, in <module>
Traceback (most recent call last):
  File "train.py", line 439, in <module>
    train(hyp, opt, device, tb_writer)
  File "train.py", line 144, in train
    gs = int(max(model.stride))  # grid size (max stride)
  File ".conda/envs/py37/lib/python3.7/site-packages/torch/nn/modules/module.py", line 772, in __getattr__
    train(hyp, opt, device, tb_writer)
  File "train.py", line 144, in train
    gs = int(max(model.stride))  # grid size (max stride)
  File ".conda/envs/py37/lib/python3.7/site-packages/torch/nn/modules/module.py", line 772, in __getattr__
    type(self).__name__, name))
torch.nn.modules.module.ModuleAttributeError: 'DistributedDataParallel' object has no attribute 'stride'
    type(self).__name__, name))
torch.nn.modules.module.ModuleAttributeError: 'DistributedDataParallel' object has no attribute 'stride'
Traceback (most recent call last):
  File ".conda/envs/py37/lib/python3.7/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File ".conda/envs/py37/lib/python3.7/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File ".conda/envs/py37/lib/python3.7/site-packages/torch/distributed/launch.py", line 261, in <module>
    main()
  File ".conda/envs/py37/lib/python3.7/site-packages/torch/distributed/launch.py", line 257, in main
    cmd=cmd)
subprocess.CalledProcessError: Command '['.conda/envs/py37/bin/python', '-u', 'train.py', '--local_rank=1', '--weights', 'yolov5s.pt', '--epochs', '3', '--img', '320', '--device', '0,1']' returned non-zero exit status 1.

Expected behavior

Run like Single GPU mode

Environment

  • OS: Ubuntu
  • GPU: V100s

Additional context

Solution is to move the line below, above DP/ DDP wrappers, particularly Line 144.
https://github.com/ultralytics/yolov5/blob/a0ac5adb7b71fb7a4b4747b3b37463f87247e1fa/train.py#L143-L145

to line 126
https://github.com/ultralytics/yolov5/blob/a0ac5adb7b71fb7a4b4747b3b37463f87247e1fa/train.py#L126-L129

I added PR for your convenience. This is minimal needed to change. Tested working on my unit test. If you want to keep image sizes near the dataloaders, it "may be" possible to move them above the DP/DDP wrappers, but I'm not sure.

I think having a CI or unit test in DDP/DP mode should be important as it's easy to miss bugs like these. Of course, I understand that resources are expensive.

On a side note, this would be how to do pretrain right? Just pass the weights yolov5s.pt without the yolov5s.yaml file?


My Unit test (includes DDP/DP mode)


set -e 
rm -rf yolov5 && git clone https://github.com/ultralytics/yolov5.git && cd yolov5

#pip install -r requirements.txt onnx
#python3 -c "from utils.google_utils import *; gdrive_download('1n_oKgR81BJtqk75b00eAjdv03qVCQn2f', 'coco128.zip')" && mv -n ./coco128 ../
export PYTHONPATH="$PWD" # to run *.py. files in subdirectories
for x in yolov5s #yolov5m yolov5l yolov5x # models
do
  python -m torch.distributed.launch --master_port 9990 --nproc_per_node 2 train.py --weights $x.pt --epochs 3 --img 320 --device 0,1 # DDP train
  for di in 0,1 0 cpu # inference devices
  do
    python train.py --weights $x.pt --epochs 3 --img 320 --device $di  # train
    python detect.py --weights $x.pt --device $di  # detect official
    python detect.py --weights runs/exp0/weights/last.pt --device $di  # detect custom
    python test.py --weights $x.pt --device $di # test official
    python test.py --weights runs/exp0/weights/last.pt --device $di # test custom
  done
  python models/yolo.py --cfg $x.yaml # inspect
  python models/export.py --weights $x.pt --img 640 --batch 1 # export
done

bug

Most helpful comment

@NanoCode012 thanks, just saw your PR and merged!

Yes, single and multi-GPU CI would be awesome. It's a very rare use-case though, so I think there is only one company offering support for it, which charges hourly. Alternatively I think Github actions can use self hosted runners that you can point to a cloud instance. This article just appeared a few days ago:
https://github.blog/2020-08-04-github-actions-self-hosted-runners-on-google-cloud/

If this could spin up a 2x K80 GPU VM (the cheapest and slowest GPUs on GCP), then we could run additional CI tests on linux at least on single and double GPU, and then immediately shut it down afterwards, the costs should be manageable.

But the blog post also notes:

鈿狅笍 Note that these use cases are considered experimental and not officially supported by GitHub at this time. Additionally, it鈥檚 recommended not to use self-hosted runners on public repositories for a number of security reasons.

All 2 comments

@NanoCode012 thanks, just saw your PR and merged!

Yes, single and multi-GPU CI would be awesome. It's a very rare use-case though, so I think there is only one company offering support for it, which charges hourly. Alternatively I think Github actions can use self hosted runners that you can point to a cloud instance. This article just appeared a few days ago:
https://github.blog/2020-08-04-github-actions-self-hosted-runners-on-google-cloud/

If this could spin up a 2x K80 GPU VM (the cheapest and slowest GPUs on GCP), then we could run additional CI tests on linux at least on single and double GPU, and then immediately shut it down afterwards, the costs should be manageable.

But the blog post also notes:

鈿狅笍 Note that these use cases are considered experimental and not officially supported by GitHub at this time. Additionally, it鈥檚 recommended not to use self-hosted runners on public repositories for a number of security reasons.

@NanoCode012 oh about your other question, yes, now we can 'finetune', or start training from pretrained weights just by supplying the --weights, the --cfg is no longer required.

If you pass both a --cfg and --weights, the --cfg is used to create a model, and then any matching layers are transferred from the --weights. The anchors are on an exclude list of layers not to transfer, but I need to review this for the --resume use case.

Also the hyps are now in their own file in data/hyp.yaml. If pretrained weights are supplied then the finetuning hyps are used. If no pretrained weights are supplied then the from-scratch hyps are used. If you supply your own --hyp those are used instead. They two hyp files are identical for now, but may change in the future.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

Single430 picture Single430  路  4Comments

lisa676 picture lisa676  路  3Comments

KangHoyong picture KangHoyong  路  3Comments

xinxin342 picture xinxin342  路  3Comments

linhaoqi027 picture linhaoqi027  路  4Comments