Hi Glenn-jocher, I run command "python train.py --data data/coco128.yaml --cfg models/yolov5s.yaml --weights '' --batch-size 16 --device '0,1', It show error "Single-Process Multi-GPU is not the recommended mode for".
How I use mutli-gpu to train?
@xjohnxjohn your environment may not be set up correctly. You could try our docker image below to see if you can reproduce the same error there:
To access an up-to-date working environment (with all dependencies including CUDA/CUDNN, Python and PyTorch preinstalled), consider a:
@xjohnxjohn i meet the same problem, seems like it cause by the pytorch1.5
i use
model = torch.nn.DataParallel(model)
to replace the
model = torch.nn.parallel.DistributedDataParallel(model)
@glenn-jocher, i remember you mention this problem in yolov3,
but i'm not sure that yolov5 will work on pytorch1.4
@yxNONG @glenn-jocher Great, thank you.
@xjohnxjohn I have tried it with pytorch1.4, it runs very well without any change.
but i use model = torch.nn.DataParallel(model) to replace
model = torch.nn.parallel.DistributedDataParallel(model)
other error have happen:
Traceback (most recent call last):
File "pytorch-yolov5/train.py", line 407, in
train(hyp)
File "pytorch-yolov5/train.py", line 210, in train
check_best_possible_recall(dataset, anchors=model.model[-1].anchor_grid, thr=hyp['anchor_t'], imgsz=imgsz)
File "python3.7/site-packages/torch/nn/modules/module.py", line 594, in __getattr__
type(self).__name__, name))
AttributeError: 'DataParallel' object has no attribute 'model'
@yxNONG @glenn-jocher
@zengjianyou this issue is already resolved a couple days ago. Git pull to get the latest code including this fix.
@zengjianyou
my code is the version pull at 6.12
what i did is replace
dist.init_process_group(....)
model = torch.nn.parallel.DistributedDataParallel(model)
by
model = torch.nn.DataParallel(model)
and remove the code
dist.destroy_process_group() if torch.cuda.device_count() > 1 else None
before
return results
Beside, i try to run the code in pytorch1.4 with 4GPU
model = torch.nn.parallel.DistributedDataParallel(model)
it work but the ap is about 1/4 of the ap train with 1GPU,
seems like it just report the result in 1GPU (so the result is 1/4 )
@glenn-jocher is the latest version fix this?
@zengjianyou
i use pytorch 1.5 and set mixed_precision = False all the time
@zengjianyou this issue is already resolved a couple days ago. Git pull to get the latest code including this fix.
ok
@zengjianyou what's your batch_size and the GPU num?
i solve it when i use lastest test code ,thank you!@yxNONG
@zengjianyou you means that in latest code
model = torch.nn.parallel.DistributedDataParallel(model)
work, right?
@xjohnxjohn your environment may not be set up correctly. You could try our docker image below to see if you can reproduce the same error there:
Reproduce Our Environment
To access an up-to-date working environment (with all dependencies including CUDA/CUDNN, Python and PyTorch preinstalled), consider a:
- GCP Deep Learning VM with $300 free credit offer: See our GCP Quickstart Guide
- Google Colab Notebook with 12 hours of free GPU time.
- Docker Image https://hub.docker.com/r/ultralytics/yolov5. See Docker Quickstart Guide
I tried using Docker Image, the multi-gpu not woking also. The message show different before. (I had 2 gpu-card, RTX2080(8GB) and GTX1080Ti(11GB), the memory not the same)
@xjohnxjohn you should only train with identical gpus. To use cuda devices 0 and 1 for example python train.py --device 0,1
Most helpful comment
@xjohnxjohn i meet the same problem, seems like it cause by the pytorch1.5
i use
model = torch.nn.DataParallel(model)to replace the
model = torch.nn.parallel.DistributedDataParallel(model)@glenn-jocher, i remember you mention this problem in yolov3,
but i'm not sure that yolov5 will work on pytorch1.4