Thanks for your great work! But I met some problem during distributed training on single machine(DDP).
When training on multi GPUs(DDP), following error shows:
Traceback (most recent call last):
File "train.py", line 422, in <module>
train() # train normally
File "train.py", line 172, in train
model = torch.nn.parallel.DistributedDataParallel(model, find_unused_parameters=True)
File "/data/bohang/conda_env/lib/python3.6/site-packages/torch/nn/parallel/distributed.py", line 287, in __init__
self._ddp_init_helper()
File "/data/bohang/conda_env/lib/python3.6/site-packages/torch/nn/parallel/distributed.py", line 380, in _ddp_init_helper
expect_sparse_gradient)
RuntimeError: Model replicas must have an equal number of parameters.
REQUIRED: Code to reproduce your issue below
python3 train.py --data data/coco64.data --img-size 320 --epochs 3 --nosave
Should be normally trained as I can do on single GPU.
If applicable, add screenshots to help explain your problem.
It seems a bug from pytorch1.5, but as the README.md mentioned, I am using the latest pytorch1.5. However due to my env reason, I keep using python3.6.
Hello @vandesa003, thank you for your interest in our work! Please visit our Custom Training Tutorial to get started, and see our Google Colab Notebook, Docker Image, and GCP Quickstart Guide for example environments.
If this is a bug report, please provide screenshots and minimum viable code to reproduce your issue, otherwise we can not help you.
@vandesa003 yes, seems like a pytorch 1.5 bug. You can still use 1.4 without issue, we simply put this in the requirements because we have problems with people submitting bug reports on older versions of dependencies.
I suppose we will leave this open for now and follow the pytorch thread.
@vandesa003 yes, seems like a pytorch 1.5 bug. You can still use 1.4 without issue, we simply put this in the requirements because we have problems with people submitting bug reports on older versions of dependencies.
I suppose we will leave this open for now and follow the pytorch thread.
Got it. Thanks for your reply. I will try on pytorch1.4 then.
Ok! BTW, the docker images also use 1.5, but a pre-release development version, which works bug free for multigpu (we are using it now with 2 T4's without issue).
Ok! BTW, the docker images also use 1.5, but a pre-release development version, which works bug free for multigpu (we are using it now with 2 T4's without issue).
Wow, good to hear that. I've already turn to pytorch1.4 and the problem got fixed! Now I can start training, but sometimes I will got segmentation fault (core dumped) error. When I am training on a very small sampled dataset(~5k images), the system worked perfectly. But when I sampled more data(from the same source) around 500k, I will get the error.
I guess it's a kind of memory access error, but I don't know where and why. I am still working on solving the problem. Would you please give me some advice on it? Thanks a lot anyway!
Hi @vandesa003,
Are you using 4 GPUs when you get "segmentation fault"? We are also getting the same error when we are using 4 GPUs, not when single or two GPUs. We have investigated a little bit and found out that it is caused by Pytorch. We get the error code with "faulthandler" module in Python.
File "/root/.trains/venvs-builds/3.6/lib/python3.6/site-packages/torch/autograd/__init__.py", line 99 in backward
File "/root/.trains/venvs-builds/3.6/lib/python3.6/site-packages/torch/tensor.py", line 195 in backward
There are several issues in PyTorch stating that it is fixed in version 1.5 but when we install the version 1.5, we get "model replica" error.
I am writing this to you, just to give insight and if you find any solutions to this issue, could you please let me know? Thanks.
@glenn-jocher I also want to ask you about this problem. Do you have any advice on it? Thanks..
@kaanakan @vandesa003 sorry, no insights on this, though you should check your system RAM (not GPU ram) consumption, and not use --cache on large datasets. Other than that we only train with 1-2 GPUs using the docker images and never see segmentation faults, perhaps try the docker image:
To access an up-to-date working environment (with all dependencies including CUDA/CUDNN, Python and PyTorch preinstalled), consider a:
@glenn-jocher Thanks a lot for your information! I think the segmentation faults may because the --cache option or the data itself. I will try your suggestion and will give a feedback here.
@glenn-jocher Thanks a lot for your information! I think the segmentation faults may because the --cache option or the data itself. I will try your suggestion and will give a feedback here.
Try to use this command:
pip install torch==1.4.0+cu100 torchvision==0.5.0+cu100 -f https://download.pytorch.org/whl/torch_stable.html
I have got the same issue and I have solved using this Pytorch version.
@adrianosantospb @glenn-jocher Thanks for your help! I fix the issue by degrade pytorch to 1.4 and for the nan loss, after I use the Apex, the issues are all gone. Amazing!
BTW, Apex really saved me a lot of time, it can speed up the training by x2.
This issue is stale because it has been open 30 days with no activity. Remove Stale label or comment or this will be closed in 5 days.
this bug still exists in pytorch 1.5.
@jinfagang are you using Apex for training? Unfortunately we do almost all of our training on single GPU (T4 usually, V100 sometimes), so I've never run into this error message.
@jinfagang and everybody, I also had this bug and had to downgrade pytorch to 1.4. However, after updating to yesterday's pytorch 1.5.1 release, things are working again!
@tjiagoM Seems work, but...
Most helpful comment
Try to use this command:
pip install torch==1.4.0+cu100 torchvision==0.5.0+cu100 -f https://download.pytorch.org/whl/torch_stable.htmlI have got the same issue and I have solved using this Pytorch version.