Due to latest update on 3c6e2f7668ea178287040c14c4cf81f45357d50b , DP and DDP mode would error because they wrap around the model, so the attribute stride cannot be accessed.
Input:
python train.py --weights yolov5s.pt --epochs 3 --img 320 --device 0,1 # DP
python -m torch.distributed.launch --nproc_per_node 2 train.py --weights yolov5s.pt --epochs 3 --img 320 --device 0,1 # DDP
Output in DDP mode (DP mode output is just a bit different):
Transferred 370/370 items from yolov5s.pt
Optimizer groups: 62 .bias, 70 conv.weight, 59 other
Transferred 370/370 items from yolov5s.pt
Optimizer groups: 62 .bias, 70 conv.weight, 59 other
Traceback (most recent call last):
File "train.py", line 439, in <module>
Traceback (most recent call last):
File "train.py", line 439, in <module>
train(hyp, opt, device, tb_writer)
File "train.py", line 144, in train
gs = int(max(model.stride)) # grid size (max stride)
File ".conda/envs/py37/lib/python3.7/site-packages/torch/nn/modules/module.py", line 772, in __getattr__
train(hyp, opt, device, tb_writer)
File "train.py", line 144, in train
gs = int(max(model.stride)) # grid size (max stride)
File ".conda/envs/py37/lib/python3.7/site-packages/torch/nn/modules/module.py", line 772, in __getattr__
type(self).__name__, name))
torch.nn.modules.module.ModuleAttributeError: 'DistributedDataParallel' object has no attribute 'stride'
type(self).__name__, name))
torch.nn.modules.module.ModuleAttributeError: 'DistributedDataParallel' object has no attribute 'stride'
Traceback (most recent call last):
File ".conda/envs/py37/lib/python3.7/runpy.py", line 193, in _run_module_as_main
"__main__", mod_spec)
File ".conda/envs/py37/lib/python3.7/runpy.py", line 85, in _run_code
exec(code, run_globals)
File ".conda/envs/py37/lib/python3.7/site-packages/torch/distributed/launch.py", line 261, in <module>
main()
File ".conda/envs/py37/lib/python3.7/site-packages/torch/distributed/launch.py", line 257, in main
cmd=cmd)
subprocess.CalledProcessError: Command '['.conda/envs/py37/bin/python', '-u', 'train.py', '--local_rank=1', '--weights', 'yolov5s.pt', '--epochs', '3', '--img', '320', '--device', '0,1']' returned non-zero exit status 1.
Run like Single GPU mode
Solution is to move the line below, above DP/ DDP wrappers, particularly Line 144.
https://github.com/ultralytics/yolov5/blob/a0ac5adb7b71fb7a4b4747b3b37463f87247e1fa/train.py#L143-L145
to line 126
https://github.com/ultralytics/yolov5/blob/a0ac5adb7b71fb7a4b4747b3b37463f87247e1fa/train.py#L126-L129
I added PR for your convenience. This is minimal needed to change. Tested working on my unit test. If you want to keep image sizes near the dataloaders, it "may be" possible to move them above the DP/DDP wrappers, but I'm not sure.
I think having a CI or unit test in DDP/DP mode should be important as it's easy to miss bugs like these. Of course, I understand that resources are expensive.
On a side note, this would be how to do pretrain right? Just pass the weights yolov5s.pt without the yolov5s.yaml file?
My Unit test (includes DDP/DP mode)
set -e
rm -rf yolov5 && git clone https://github.com/ultralytics/yolov5.git && cd yolov5
#pip install -r requirements.txt onnx
#python3 -c "from utils.google_utils import *; gdrive_download('1n_oKgR81BJtqk75b00eAjdv03qVCQn2f', 'coco128.zip')" && mv -n ./coco128 ../
export PYTHONPATH="$PWD" # to run *.py. files in subdirectories
for x in yolov5s #yolov5m yolov5l yolov5x # models
do
python -m torch.distributed.launch --master_port 9990 --nproc_per_node 2 train.py --weights $x.pt --epochs 3 --img 320 --device 0,1 # DDP train
for di in 0,1 0 cpu # inference devices
do
python train.py --weights $x.pt --epochs 3 --img 320 --device $di # train
python detect.py --weights $x.pt --device $di # detect official
python detect.py --weights runs/exp0/weights/last.pt --device $di # detect custom
python test.py --weights $x.pt --device $di # test official
python test.py --weights runs/exp0/weights/last.pt --device $di # test custom
done
python models/yolo.py --cfg $x.yaml # inspect
python models/export.py --weights $x.pt --img 640 --batch 1 # export
done
@NanoCode012 thanks, just saw your PR and merged!
Yes, single and multi-GPU CI would be awesome. It's a very rare use-case though, so I think there is only one company offering support for it, which charges hourly. Alternatively I think Github actions can use self hosted runners that you can point to a cloud instance. This article just appeared a few days ago:
https://github.blog/2020-08-04-github-actions-self-hosted-runners-on-google-cloud/
If this could spin up a 2x K80 GPU VM (the cheapest and slowest GPUs on GCP), then we could run additional CI tests on linux at least on single and double GPU, and then immediately shut it down afterwards, the costs should be manageable.
But the blog post also notes:
鈿狅笍 Note that these use cases are considered experimental and not officially supported by GitHub at this time. Additionally, it鈥檚 recommended not to use self-hosted runners on public repositories for a number of security reasons.
@NanoCode012 oh about your other question, yes, now we can 'finetune', or start training from pretrained weights just by supplying the --weights, the --cfg is no longer required.
If you pass both a --cfg and --weights, the --cfg is used to create a model, and then any matching layers are transferred from the --weights. The anchors are on an exclude list of layers not to transfer, but I need to review this for the --resume use case.
Also the hyps are now in their own file in data/hyp.yaml. If pretrained weights are supplied then the finetuning hyps are used. If no pretrained weights are supplied then the from-scratch hyps are used. If you supply your own --hyp those are used instead. They two hyp files are identical for now, but may change in the future.
Most helpful comment
@NanoCode012 thanks, just saw your PR and merged!
Yes, single and multi-GPU CI would be awesome. It's a very rare use-case though, so I think there is only one company offering support for it, which charges hourly. Alternatively I think Github actions can use self hosted runners that you can point to a cloud instance. This article just appeared a few days ago:
https://github.blog/2020-08-04-github-actions-self-hosted-runners-on-google-cloud/
If this could spin up a 2x K80 GPU VM (the cheapest and slowest GPUs on GCP), then we could run additional CI tests on linux at least on single and double GPU, and then immediately shut it down afterwards, the costs should be manageable.
But the blog post also notes: