Yolov5: How to train with multi-GPU?

Created on 17 Jun 2020  ·  14Comments  ·  Source: ultralytics/yolov5

Hi Glenn-jocher, I run command "python train.py --data data/coco128.yaml --cfg models/yolov5s.yaml --weights '' --batch-size 16 --device '0,1', It show error "Single-Process Multi-GPU is not the recommended mode for".

How I use mutli-gpu to train?

Most helpful comment

@xjohnxjohn i meet the same problem, seems like it cause by the pytorch1.5
i use
model = torch.nn.DataParallel(model)
to replace the
model = torch.nn.parallel.DistributedDataParallel(model)

@glenn-jocher, i remember you mention this problem in yolov3,
but i'm not sure that yolov5 will work on pytorch1.4

All 14 comments

@xjohnxjohn your environment may not be set up correctly. You could try our docker image below to see if you can reproduce the same error there:

Reproduce Our Environment

To access an up-to-date working environment (with all dependencies including CUDA/CUDNN, Python and PyTorch preinstalled), consider a:

@xjohnxjohn i meet the same problem, seems like it cause by the pytorch1.5
i use
model = torch.nn.DataParallel(model)
to replace the
model = torch.nn.parallel.DistributedDataParallel(model)

@glenn-jocher, i remember you mention this problem in yolov3,
but i'm not sure that yolov5 will work on pytorch1.4

@yxNONG @glenn-jocher Great, thank you.

@xjohnxjohn I have tried it with pytorch1.4, it runs very well without any change.

but i use model = torch.nn.DataParallel(model) to replace
model = torch.nn.parallel.DistributedDataParallel(model)
other error have happen:
Traceback (most recent call last):
File "pytorch-yolov5/train.py", line 407, in
train(hyp)
File "pytorch-yolov5/train.py", line 210, in train
check_best_possible_recall(dataset, anchors=model.model[-1].anchor_grid, thr=hyp['anchor_t'], imgsz=imgsz)
File "python3.7/site-packages/torch/nn/modules/module.py", line 594, in __getattr__
type(self).__name__, name))
AttributeError: 'DataParallel' object has no attribute 'model'
@yxNONG @glenn-jocher

@zengjianyou this issue is already resolved a couple days ago. Git pull to get the latest code including this fix.

@zengjianyou
my code is the version pull at 6.12
what i did is replace
dist.init_process_group(....)
model = torch.nn.parallel.DistributedDataParallel(model)
by
model = torch.nn.DataParallel(model)

and remove the code
dist.destroy_process_group() if torch.cuda.device_count() > 1 else None
before
return results


Beside, i try to run the code in pytorch1.4 with 4GPU
model = torch.nn.parallel.DistributedDataParallel(model)
it work but the ap is about 1/4 of the ap train with 1GPU,
seems like it just report the result in 1GPU (so the result is 1/4 )
@glenn-jocher is the latest version fix this?

@zengjianyou
i use pytorch 1.5 and set mixed_precision = False all the time

@zengjianyou this issue is already resolved a couple days ago. Git pull to get the latest code including this fix.

ok

@zengjianyou what's your batch_size and the GPU num?

i solve it when i use lastest test code ,thank you!@yxNONG

@zengjianyou you means that in latest code
model = torch.nn.parallel.DistributedDataParallel(model)
work, right?

@xjohnxjohn your environment may not be set up correctly. You could try our docker image below to see if you can reproduce the same error there:

Reproduce Our Environment

To access an up-to-date working environment (with all dependencies including CUDA/CUDNN, Python and PyTorch preinstalled), consider a:

  • GCP Deep Learning VM with $300 free credit offer: See our GCP Quickstart Guide
  • Google Colab Notebook with 12 hours of free GPU time. Open In Colab
  • Docker Image https://hub.docker.com/r/ultralytics/yolov5. See Docker Quickstart Guide

I tried using Docker Image, the multi-gpu not woking also. The message show different before. (I had 2 gpu-card, RTX2080(8GB) and GTX1080Ti(11GB), the memory not the same)

@xjohnxjohn you should only train with identical gpus. To use cuda devices 0 and 1 for example python train.py --device 0,1

Was this page helpful?
0 / 5 - 0 ratings

Related issues

we1pingyu picture we1pingyu  ·  3Comments

nanometer34688 picture nanometer34688  ·  3Comments

Alex-afka picture Alex-afka  ·  3Comments

xinxin342 picture xinxin342  ·  3Comments

linhaoqi027 picture linhaoqi027  ·  4Comments