Yolov3: [Multi-GPU training error]RuntimeError: Model replicas must have an equal number of parameters.

Created on 27 Apr 2020  路  15Comments  路  Source: ultralytics/yolov3

Thanks for your great work! But I met some problem during distributed training on single machine(DDP).

馃悰 Bug

When training on multi GPUs(DDP), following error shows:

Traceback (most recent call last):
  File "train.py", line 422, in <module>
    train()  # train normally
  File "train.py", line 172, in train
    model = torch.nn.parallel.DistributedDataParallel(model, find_unused_parameters=True)
  File "/data/bohang/conda_env/lib/python3.6/site-packages/torch/nn/parallel/distributed.py", line 287, in __init__
    self._ddp_init_helper()
  File "/data/bohang/conda_env/lib/python3.6/site-packages/torch/nn/parallel/distributed.py", line 380, in _ddp_init_helper
    expect_sparse_gradient)
RuntimeError: Model replicas must have an equal number of parameters.

To Reproduce

REQUIRED: Code to reproduce your issue below

python3 train.py --data data/coco64.data --img-size 320 --epochs 3 --nosave

Expected behavior

Should be normally trained as I can do on single GPU.

Environment

If applicable, add screenshots to help explain your problem.

  • OS: [Ubuntu 16.04]
  • GPU [V100 * 4]

Additional context

It seems a bug from pytorch1.5, but as the README.md mentioned, I am using the latest pytorch1.5. However due to my env reason, I keep using python3.6.

Stale bug

Most helpful comment

@glenn-jocher Thanks a lot for your information! I think the segmentation faults may because the --cache option or the data itself. I will try your suggestion and will give a feedback here.

Try to use this command:

pip install torch==1.4.0+cu100 torchvision==0.5.0+cu100 -f https://download.pytorch.org/whl/torch_stable.html

I have got the same issue and I have solved using this Pytorch version.

All 15 comments

Hello @vandesa003, thank you for your interest in our work! Please visit our Custom Training Tutorial to get started, and see our Google Colab Notebook, Docker Image, and GCP Quickstart Guide for example environments.

If this is a bug report, please provide screenshots and minimum viable code to reproduce your issue, otherwise we can not help you.

@vandesa003 yes, seems like a pytorch 1.5 bug. You can still use 1.4 without issue, we simply put this in the requirements because we have problems with people submitting bug reports on older versions of dependencies.

I suppose we will leave this open for now and follow the pytorch thread.

@vandesa003 yes, seems like a pytorch 1.5 bug. You can still use 1.4 without issue, we simply put this in the requirements because we have problems with people submitting bug reports on older versions of dependencies.

I suppose we will leave this open for now and follow the pytorch thread.

Got it. Thanks for your reply. I will try on pytorch1.4 then.

Ok! BTW, the docker images also use 1.5, but a pre-release development version, which works bug free for multigpu (we are using it now with 2 T4's without issue).

Ok! BTW, the docker images also use 1.5, but a pre-release development version, which works bug free for multigpu (we are using it now with 2 T4's without issue).

Wow, good to hear that. I've already turn to pytorch1.4 and the problem got fixed! Now I can start training, but sometimes I will got segmentation fault (core dumped) error. When I am training on a very small sampled dataset(~5k images), the system worked perfectly. But when I sampled more data(from the same source) around 500k, I will get the error.

I guess it's a kind of memory access error, but I don't know where and why. I am still working on solving the problem. Would you please give me some advice on it? Thanks a lot anyway!

Hi @vandesa003,

Are you using 4 GPUs when you get "segmentation fault"? We are also getting the same error when we are using 4 GPUs, not when single or two GPUs. We have investigated a little bit and found out that it is caused by Pytorch. We get the error code with "faulthandler" module in Python.

File "/root/.trains/venvs-builds/3.6/lib/python3.6/site-packages/torch/autograd/__init__.py", line 99 in backward

File "/root/.trains/venvs-builds/3.6/lib/python3.6/site-packages/torch/tensor.py", line 195 in backward

There are several issues in PyTorch stating that it is fixed in version 1.5 but when we install the version 1.5, we get "model replica" error.

I am writing this to you, just to give insight and if you find any solutions to this issue, could you please let me know? Thanks.

@glenn-jocher I also want to ask you about this problem. Do you have any advice on it? Thanks..

@kaanakan @vandesa003 sorry, no insights on this, though you should check your system RAM (not GPU ram) consumption, and not use --cache on large datasets. Other than that we only train with 1-2 GPUs using the docker images and never see segmentation faults, perhaps try the docker image:

Reproduce Our Environment

To access an up-to-date working environment (with all dependencies including CUDA/CUDNN, Python and PyTorch preinstalled), consider a:

@glenn-jocher Thanks a lot for your information! I think the segmentation faults may because the --cache option or the data itself. I will try your suggestion and will give a feedback here.

@glenn-jocher Thanks a lot for your information! I think the segmentation faults may because the --cache option or the data itself. I will try your suggestion and will give a feedback here.

Try to use this command:

pip install torch==1.4.0+cu100 torchvision==0.5.0+cu100 -f https://download.pytorch.org/whl/torch_stable.html

I have got the same issue and I have solved using this Pytorch version.

@adrianosantospb @glenn-jocher Thanks for your help! I fix the issue by degrade pytorch to 1.4 and for the nan loss, after I use the Apex, the issues are all gone. Amazing!
BTW, Apex really saved me a lot of time, it can speed up the training by x2.

This issue is stale because it has been open 30 days with no activity. Remove Stale label or comment or this will be closed in 5 days.

this bug still exists in pytorch 1.5.

@jinfagang are you using Apex for training? Unfortunately we do almost all of our training on single GPU (T4 usually, V100 sometimes), so I've never run into this error message.

@jinfagang and everybody, I also had this bug and had to downgrade pytorch to 1.4. However, after updating to yesterday's pytorch 1.5.1 release, things are working again!

@tjiagoM Seems work, but...

Was this page helpful?
0 / 5 - 0 ratings

Related issues

Deep-Learner picture Deep-Learner  路  5Comments

Rajasekhar06 picture Rajasekhar06  路  3Comments

aluds123 picture aluds123  路  4Comments

yoga-0125 picture yoga-0125  路  4Comments

mehrdadazizi72 picture mehrdadazizi72  路  3Comments