Yolov5: How to train with multi-GPU?

Created on 17 Jun 2020 · 14Comments · Source: ultralytics/yolov5

Hi Glenn-jocher, I run command "python train.py --data data/coco128.yaml --cfg models/yolov5s.yaml --weights '' --batch-size 16 --device '0,1', It show error "Single-Process Multi-GPU is not the recommended mode for".

How I use mutli-gpu to train?

Source

xjohnxjohn

Most helpful comment

@xjohnxjohn i meet the same problem, seems like it cause by the pytorch1.5
i use
model = torch.nn.DataParallel(model)
to replace the
model = torch.nn.parallel.DistributedDataParallel(model)

@glenn-jocher, i remember you mention this problem in yolov3,
but i'm not sure that yolov5 will work on pytorch1.4

yxNONG on 17 Jun 2020

👍4

All 14 comments

@xjohnxjohn your environment may not be set up correctly. You could try our docker image below to see if you can reproduce the same error there:

Reproduce Our Environment

To access an up-to-date working environment (with all dependencies including CUDA/CUDNN, Python and PyTorch preinstalled), consider a:

GCP Deep Learning VM with $300 free credit offer: See our GCP Quickstart Guide
Google Colab Notebook with 12 hours of free GPU time.
Docker Image https://hub.docker.com/r/ultralytics/yolov5. See Docker Quickstart Guide

glenn-jocher on 17 Jun 2020

👍1

@xjohnxjohn i meet the same problem, seems like it cause by the pytorch1.5
i use
model = torch.nn.DataParallel(model)
to replace the
model = torch.nn.parallel.DistributedDataParallel(model)

@glenn-jocher, i remember you mention this problem in yolov3,
but i'm not sure that yolov5 will work on pytorch1.4

yxNONG on 17 Jun 2020

👍4

@yxNONG @glenn-jocher Great, thank you.

xjohnxjohn on 17 Jun 2020

@xjohnxjohn I have tried it with pytorch1.4, it runs very well without any change.

AIFAN-Lab on 18 Jun 2020

👍3

but i use model = torch.nn.DataParallel(model) to replace
model = torch.nn.parallel.DistributedDataParallel(model)
other error have happen：
Traceback (most recent call last):
File "pytorch-yolov5/train.py", line 407, in
train(hyp)
File "pytorch-yolov5/train.py", line 210, in train
check_best_possible_recall(dataset, anchors=model.model[-1].anchor_grid, thr=hyp['anchor_t'], imgsz=imgsz)
File "python3.7/site-packages/torch/nn/modules/module.py", line 594, in __getattr__
type(self).__name__, name))
AttributeError: 'DataParallel' object has no attribute 'model'
@yxNONG @glenn-jocher

zengjianyou on 18 Jun 2020

@zengjianyou this issue is already resolved a couple days ago. Git pull to get the latest code including this fix.

glenn-jocher on 18 Jun 2020

@zengjianyou
my code is the version pull at 6.12
what i did is replace
dist.init_process_group(....)
model = torch.nn.parallel.DistributedDataParallel(model)
by
model = torch.nn.DataParallel(model)

and remove the code
dist.destroy_process_group() if torch.cuda.device_count() > 1 else None
before
return results

Beside, i try to run the code in pytorch1.4 with 4GPU
model = torch.nn.parallel.DistributedDataParallel(model)
it work but the ap is about 1/4 of the ap train with 1GPU,
seems like it just report the result in 1GPU (so the result is 1/4 )
@glenn-jocher is the latest version fix this?

yxNONG on 18 Jun 2020

@zengjianyou
i use pytorch 1.5 and set mixed_precision = False all the time

yxNONG on 18 Jun 2020

👍1

@zengjianyou this issue is already resolved a couple days ago. Git pull to get the latest code including this fix.

zengjianyou on 18 Jun 2020

@zengjianyou what's your batch_size and the GPU num?

yxNONG on 18 Jun 2020

i solve it when i use lastest test code ，thank you！@yxNONG

zengjianyou on 18 Jun 2020

@zengjianyou you means that in latest code
model = torch.nn.parallel.DistributedDataParallel(model)
work, right?

yxNONG on 19 Jun 2020

@xjohnxjohn your environment may not be set up correctly. You could try our docker image below to see if you can reproduce the same error there:

Reproduce Our Environment

To access an up-to-date working environment (with all dependencies including CUDA/CUDNN, Python and PyTorch preinstalled), consider a:

GCP Deep Learning VM with $300 free credit offer: See our GCP Quickstart Guide

Google Colab Notebook with 12 hours of free GPU time.

Docker Image https://hub.docker.com/r/ultralytics/yolov5. See Docker Quickstart Guide

I tried using Docker Image, the multi-gpu not woking also. The message show different before. (I had 2 gpu-card, RTX2080(8GB) and GTX1080Ti(11GB), the memory not the same)

xjohnxjohn on 19 Jun 2020

@xjohnxjohn you should only train with identical gpus. To use cuda devices 0 and 1 for example python train.py --device 0,1

glenn-jocher on 19 Jun 2020

👍1

Was this page helpful?

0 / 5 - 0 ratings