Hello, @glenn-jocher
From your advice, Multi-process DistributedDataParallel should be better than the Single process we have now.
I have been trying to change it here at my fork on my ddp branch. (Apologize that it's messy). However, I've reached into many issues.
Since it's in testing, I didn't account for the device being cpu as of now.
What I did so far
Added setup method to init_process_group and set torch.cuda device
Called torch.multiprocessing.spawn on the modified train function
Created a new argument called world_size to be called when running script (we can change this to counting # of device later)
Added condition checks for only 1 process to download weights file, remove the batch.jpg, saving checkpoints
~Added dist.barrier() while waiting for first process to do its job~
~Replaced all .to(device) to .to(rank) for each process.~
~Changed map_location for loading weights.~
Added more parameters to train function because the processes cannot see the global variables
Added DistributedSampler for multiple gpu on dataset so they each get a different sample
~Turned off tensorboard as I needed to pass tb_writer to train as argument to be able to use it.~
Things to fix
Do not divide dataset for validation set in create_dataloader
Reduce the need to call world_size as argument to say that we want multiprocess
~Cleaning up~
Fixing the inconsistent output prints (All process printing at once makes it hard to track)
~Enable tensorboard again~
Splitting batch_size/learning rate/epoch for multiple GPU
Figure out why global variables are always recalled (I disabled print(hyp) because of this)
Problems
Since I am still learning, it is very likely I messed up the training. The learnt information from each epoch for each process may not be distributed among themselves because I tested training at it's 0 mAP. I read somewhere that a SyncBatch layer may be needed, but I am not sure how to add it.
Saving checkpoints is done by only the first process as multiprocess saving concurrently causes problem for strip_optimizer later on. I am not sure if this is the correct way.
I am testing it, and it is much slower than using just one GPU, but I figured that if the training is fixed, it can be a good boost for multi-gpu.
I also understand that this isn't as high in your priority list but maybe some guidance would be nice. Thank you.
Hello @NanoCode012, thank you for your interest in our work! Please visit our Custom Training Tutorial to get started, and see our Jupyter Notebook , Docker Image, and Google Cloud Quickstart Guide for example environments.
If this is a bug report, please provide screenshots and minimum viable code to reproduce your issue, otherwise we can not help you.
If this is a custom model or data training question, please note that Ultralytics does not provide free personal support. As a leader in vision ML and AI, we do offer professional consulting, from simple expert advice up to delivery of fully customized, end-to-end production solutions for our clients, such as:
For more information please visit https://www.ultralytics.com.
Hello, I think I set it up properly according to many examples from pytorch's official docs as well as other's implementation of DDP. However, it is a lot slower than running single GPU. (I tried 2 GPU for now). Also, the mAP stays 0 throughout. I am not sure why specifically.
Furthermore, I notice the global variables being recalled when we enumerate(dataloader). This could be the cause of it slowing.
I'm just passing by but DDP should be faster (and IS faster in my runs) than DP. you probably missed something. it also depends on how you launch it.
Check my working DDP train.py for classification maybe you would notice the difference with yours. My implementation could be run on 1 GPU simply by calling python3 train.py. If you want DDP the correct way to launch it is using the following command. python3 -m torch.distributed.launch --nproc_per_node=$NUM_GPUS. There is absolutely no need to pass anything else explicitly. launch would set environment variables for you and you could get them in any place in the script using something like
def env_world_size():
return int(os.environ.get("WORLD_SIZE", 1))
def env_rank():
return int(os.environ.get("RANK", 0))
Thanks for checking it out. In my main function, I use multiprocessing.spawn to create N processes(1 per GPU). I believe they are the same. I will look over your code.
Something weird I notice is what I mentioned.
Furthermore, I notice the global variables being recalled when we enumerate(dataloader).
I added a print("Test") outside of every function and I noticed it being called 8 times per GPU/process. Do you know of any reason that may happen? (I believe 8 is the number of workers passed to the Dataloader).
The below is the line that caused the problem.
https://github.com/NanoCode012/yolov5/blob/89d195ea9f81645a8c3764c1126aca959977b77a/train.py#L240
where dataloader comes from,
https://github.com/NanoCode012/yolov5/blob/89d195ea9f81645a8c3764c1126aca959977b77a/utils/datasets.py#L44-L73
EDIT: I just ran a quick toy model and DDP easily beated DP as expected. I guess it's somewhere in the code messing it up.
@bonlime , something I've noticed is that you mentioned that we should call ema before wrapping our model in DDP. However, the yolov5 does the opposite. It wraps the DDP with ema. Do you think it is related?
https://github.com/bonlime/sota_imagenet/blob/2fb3e46a82fbf9d767df75f3ba4fd6d8517cd567/train.py#L246-L249
https://github.com/ultralytics/yolov5/blob/3bdea3f697d4fce36c8e24a0701c0f419fa8f63a/train.py#L161
https://github.com/ultralytics/yolov5/blob/3bdea3f697d4fce36c8e24a0701c0f419fa8f63a/train.py#L196
Actually, one more thing is that, the Single Process DDP that's already in the original code does not seem to be faster in my tests compared to just one single GPU.
regarding order of EMA and DDP - it's an implementation specific issue. my version would probably fail if used after DDP. I don't think the order would cause any slowdown but you could test by commenting it out.
Regarding Single Process DDP - I don't really understand what you mean. Single Process Distributed Data Parallel is called Data Parallel, isn't it? why would you expect it to be faster?
Uhm, I am not sure if they are the same, but they are listed under two different docs.
https://pytorch.org/docs/master/generated/torch.nn.DataParallel.html
I think you're right in how they work similarly but maybe there's some difference in the implementation?
Btw, do the processes regulate GPU memory usage? (aka, should they be the same?) My first GPU can take 25GB of memory, whereas second GPU would take 13GB. Then they would randomly swap.
Also, it appears the training works but very poorly in performance.
Single GPU(original code): mAP starts rising from 12th epoch.
Double GPU(my code): mAP starts to crawl from 30th epoch
DP is very different from DDP as the docs clearly show.
I've trained a lot of models using DDP and never faced performance issues. it all depends on implementation thought. try to check some other codebases to understand how to make DDP work and to avoid bugs. I'm pretty sure the issue is some silly bug somewhere 🙃
about GPU memory. for me usually the rank 0 process has slightly larger memory consumption (like by 1-2 GB). after 1 epoch memory consumption doesn't really change
@bonlime , so it would be weird that my memory usage are so different and swap every epoch right ?
Also I checked out multiple pytorch examples from official and others' github
The main things are:
Set init process
Set cuda process .to() for tensors and set_device()
Set map location
Set mp.spawn
Use DistributedSampler
Set epoch for trainsampler for each epoch
Please tell me if I missed anything
I tested my current branch against yours (taken after my ema patch) with coco 2017 for 10 epochs to test speed using two GPU on yolov5s.
python train.py --weights "" --data coco.yaml --cfg "yolov5s.yaml" --epochs 10 --img 640 --device 0,1 --batch-size 128 --nosave
Here are the results.
My branch
# train
Epoch gpu_mem GIoU obj cls total targets img_size
8/9 21.4G 0.06038 0.09437 0.04652 0.2013 64 640
8/9 22.3G 0.06021 0.09389 0.04625 0.2004 64 640
Class Images Targets P R [email protected]
all 5e+03 3.63e+04 0.238 0.321 0.234 0.116
Epoch gpu_mem GIoU obj cls total targets img_size
9/9 22.3G 0.05955 0.09358 0.04501 0.1981 123 640
9/9 21.4G 0.05959 0.09371 0.0452 0.1985 55 640
Class Images Targets P R [email protected]
all 5e+03 3.63e+04 0.245 0.334 0.248 0.126
Average Precision (AP) @[ IoU=0.50:0.95 | area= all | maxDets=100 ] = 0.134
Average Precision (AP) @[ IoU=0.50 | area= all | maxDets=100 ] = 0.254
Average Precision (AP) @[ IoU=0.75 | area= all | maxDets=100 ] = 0.128
Average Precision (AP) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.061
Average Precision (AP) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.152
Average Precision (AP) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.168
Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets= 1 ] = 0.171
Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets= 10 ] = 0.318
Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets=100 ] = 0.361
Average Recall (AR) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.175
Average Recall (AR) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.402
Average Recall (AR) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.487
Optimizer stripped from weights/last_.pt
10 epochs completed in 1.638 hours.
# test
Caching labels ../coco128/labels/train2017 (126 found, 0 missing, 2 empty, 0 dup
Class Images Targets P R [email protected]
all 128 929 0.258 0.359 0.315 0.176
Speed: 3.2/3.3/6.4 ms inference/NMS/total per 640x640 image at batch-size 32
My Patch-1 branch (Ema-patch)
# train
Epoch gpu_mem GIoU obj cls total targets img_size
8/9 14.9G 0.05715 0.09235 0.03983 0.1893 202 640
Class Images Targets P R [email protected]
all 5e+03 3.63e+04 0.292 0.415 0.331 0.182
Epoch gpu_mem GIoU obj cls total targets img_size
9/9 14.9G 0.05591 0.09165 0.03808 0.1856 204 640
Class Images Targets P R [email protected]
all 5e+03 3.63e+04 0.291 0.439 0.348 0.194
Average Precision (AP) @[ IoU=0.50:0.95 | area= all | maxDets=100 ] = 0.204
Average Precision (AP) @[ IoU=0.50 | area= all | maxDets=100 ] = 0.355
Average Precision (AP) @[ IoU=0.75 | area= all | maxDets=100 ] = 0.209
Average Precision (AP) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.098
Average Precision (AP) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.230
Average Precision (AP) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.268
Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets= 1 ] = 0.214
Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets= 10 ] = 0.379
Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets=100 ] = 0.429
Average Recall (AR) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.228
Average Recall (AR) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.475
Average Recall (AR) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.568
Optimizer stripped from weights/last.pt
10 epochs completed in 2.302 hours.
# test
Caching labels ../coco128/labels/train2017 (126 found, 0 missing, 2 empty, 0 dup
Class Images Targets P R [email protected]
all 128 929 0.267 0.441 0.401 0.246
Speed: 2.1/2.3/4.3 ms inference/NMS/total per 640x640 image at batch-size 32
From these results, we can see that it's a lot faster than Single process DDP now. The main drawback however is accuracy. I am not sure what is the problem. From what I read on multi process DDP, it automatically does a sync, so all the values are the same at the end, so I should not need to modify the code significantly.
I also do not know why the load on GPU are so different. I am thinking it can be related to each GPU creating its own dataloader in Multi-process in comparison to all GPU sharing dataloaders in Single-process.
Do you have any opinions on this @glenn-jocher ? I am now running Single GPU on both branch to benchmark it.
Should I run a full 300 epoch? Should I change model?
I have been working on DDP improve week ago!
See the issue #177 !
There are lots of original code to be revised to make DDP work. It's because the original codebase is complicated!
You have to make sure that every thing in synchrosized in multiple processes!
I will make pull request very soon, if my next experment comes out good. See the code then!
By the way, I found that training 10 epochs are not quite enough to analyze the performance. Unless you use SyncBN! The BN will be a problem in the early training.
In case you are hurry to use DDP, see my fork
Test still running. But I think this would be the final version if SyncBN not added.
If you have GPU resources, you can also help run the test!
bash
python -m torch.distributed.launch --nproc_per_node 4 train.py --data data/coco.yaml --batch-size 64 --cfg models/yolov5s.yaml --weights '' --epochs 300 --device 0,
1,2,3
@MagicFrogSJTU
Ohh, I saw your thread, but I mistook it for an issue in another repo. Will check it out. Can you tell me what you've changed so we can compare notes?
By the way, I found that training 10 epochs are not quite enough to analyze the performance. Unless you use SyncBN!
Yes, I want to do this, but I'm not sure where in the code it is.
So far, my small experiment on Single GPU is done.
My branch
Epoch gpu_mem GIoU obj cls total targets img_size
9/9 21.4G 0.05582 0.0913 0.03765 0.1848 194 640
Class Images Targets P R [email protected]
all 5e+03 3.63e+04 0.294 0.437 0.35 0.194
10 epochs completed in 2.456 hours.
For some reason, there were semaphore errors despite running in Single Process.
Traceback (most recent call last):
File "python3.7/multiprocessing/util.py", line 277, in _run_finalizers
finalizer()
File "python3.7/multiprocessing/util.py", line 201, in __call__
res = self._callback(*self._args, **self._kwargs)
File "python3.7/multiprocessing/synchronize.py", line 87, in _cleanup
sem_unlink(name)
FileNotFoundError: [Errno 2] No such file or directory
python3.7/multiprocessing/semaphore_tracker.py:144: UserWarning: semaphore_tracker: There appear to be 33 leaked semaphores to clean up at shutdown
len(cache))
python3.7/multiprocessing/semaphore_tracker.py:156: UserWarning: semaphore_tracker: '/mp-b3j04ac7': [Errno 2] No such file or directory
warnings.warn('semaphore_tracker: %r: %s' % (name, e))
Branch from ema-patch
Epoch gpu_mem GIoU obj cls total targets img_size
9/9 21.4G 0.05592 0.09151 0.03777 0.1852 204 640
Class Images Targets P R [email protected]
all 5e+03 3.63e+04 0.289 0.435 0.348 0.194
10 epochs completed in 2.200 hours.
This explains why my earlier test on Multiple GPU Single process took half the load, it's because the load was shared between two GPUs.
@MagicFrogSJTU , hello, I looked over your code. It is really nice. There're a few things I would like to add.
1) There was an update to torch.utils, so ema is cleaner now.
2) Do you have an article on where to initalize ema? ~I also see that you only allow 1 process to go through ema. That's something I was thinking of because it was redundant to do multiple deep copies. You beat me right to it!~ Maybe I misread. Not sure if you do this now.
3) I think you can set local_rank to 0 for single GPU, it will clean your code up a bit.
4) I think it's more reasonable if the batch_size is the batch_size per GPU as it's much easier for user. I first planned to split batch size, but chose not to later.
5) I think using spawn to create processes to be easier as it'll be more user friendly(no need to change current commands). But you'll have to scope the variables outside functions into functions because dataloaders' workers will call them. This took a while to figure out.
I plan to update my code to be more legible, and use arguments to check whether it's distributed or not instead.
Edit: https://github.com/MagicFrogSJTU/yolov5/blob/96fa40a3a925e4ffd815fe329e1b5181ec92adc8/train.py#L432
I don't think this is very friendly.
Cleaned my code up, but after reading @MagicFrogSJTU 's fork, I see that you have done most of the heavy lifting already, so maybe I may close my Issue and send PR instead.
Could you enable Issues for your fork?
@MagicFrogSJTU , I'm setting your code to run with SyncBatch norm.
@MagicFrogSJTU , I'm setting your code to run with SyncBatch norm.
Cool!
I was just trying to add SyncBatch. Now that you have begun the work, you can take this job!
Theoretically, DDP with batch size 64 and gpu 4 should have similar performance with single gpu with batch size 64 (4 accumulations).
If SyncBN is correctly added (and every other things are right), the eval [email protected] of first training epoch should be around 0.01.
Let me know if you have done the job!
@MagicFrogSJTU , I'm setting your code to run with SyncBatch norm.
Cool!
I was just trying to add SyncBatch. Now that you have begun the work, you can take this job!Theoretically, DDP with batch size 64 and gpu 4 should have similar performance with single gpu with batch size 64 (4 accumulations).
If SyncBN is correctly added (and every other things are right), the eval [email protected] of first training epoch should be around 0.01.Let me know if you have done the job!
I thought it would work, so I made the PR. However, SyncBatchNorm convert only works with torch.nn.parallel.ddp not with apex.parallel. I think I'll have to change this, so I'm looking how to deal with mixed_precision.
Edit: I just found out there's one in apex as well. However, when I read around, someone suggested to use torch.nn.parallel.ddp as its more forward compatible.
@MagicFrogSJTU , hello, I looked over your code. It is really nice. There're a few things I would like to add.
- There was an update to torch.utils, so ema is cleaner now.
- Do you have an article on where to initalize ema? ~I also see that you only allow 1 process to go through ema. That's something I was thinking of because it was redundant to do multiple deep copies. You beat me right to it!~ Maybe I misread. Not sure if you do this now.
- I think you can set local_rank to 0 for single GPU, it will clean your code up a bit.
- I think it's more reasonable if the batch_size is the batch_size per GPU as it's much easier for user. I first planned to split batch size, but chose not to later.
- I think using
spawnto create processes to be easier as it'll be more user friendly(no need to change current commands). But you'll have to scope the variables outside functions into functions because dataloaders' workers will call them. This took a while to figure out.I plan to update my code to be more legible, and use arguments to check whether it's distributed or not instead.
Edit: https://github.com/MagicFrogSJTU/yolov5/blob/96fa40a3a925e4ffd815fe329e1b5181ec92adc8/train.py#L432
I don't think this is very friendly.
- I will take a look of your code!
- https://github.com/rwightman/pytorch-image-models/blob/master/train.py
- I aggree with you but let's keep the old way until DDP is correctly set!
3 and 5. I am following other's best practices. spawn introduces heavy extra burden.
I thought it would work, so I made the PR. However, SyncBatchNorm convert only works with torch.nn.parallel.ddp not with apex.parallel. I think I'll have to change this, so I'm looking how to deal with mixed_precision.
Edit: I just found out there's one in apex as well. However, when I read around, someone suggested to use
torch.nn.parallel.ddpas its more forward compatible.
Keep in mind that apex will be removed from the code soon, as pytorch is introducing native mixed precision support in torch 1.6. It's already available in nightly for testing, but I'm waiting for the stable 1.6 release before making this switch.
https://pytorch.org/docs/stable/amp.html
I thought it would work, so I made the PR. However, SyncBatchNorm convert only works with torch.nn.parallel.ddp not with apex.parallel. I think I'll have to change this, so I'm looking how to deal with mixed_precision.
Edit: I just found out there's one in apex as well. However, when I read around, someone suggested to usetorch.nn.parallel.ddpas its more forward compatible.Keep in mind that
apexwill be removed from the code soon, as pytorch is introducing native mixed precision support in torch 1.6. It's already available in nightly for testing, but I'm waiting for the stable 1.6 release before making this switch.
https://pytorch.org/docs/stable/amp.html
Got
@MagicFrogSJTU , is your branch up to date to your current work? I seem to get mAP zero while testing.
Plus got warning:
UserWarning: Detected call of `lr_scheduler.step()` before `optimizer.step()`. In PyTorch 1.1.0 and later, you should call them in the opposite order: `optimizer.step()` before `lr_scheduler.step()`. Failure to do this will result in PyTorch skipping the first value of the learning rate schedule. See more details at https://pytorch.org/docs/stable/optim.html#how-to-adjust-learning-rate
"https://pytorch.org/docs/stable/optim.html#how-to-adjust-learning-rate", UserWarning)
@MagicFrogSJTU , is your branch up to date to your current work? I seem to get mAP zero while testing.
Plus got warning:
UserWarning: Detected call of `lr_scheduler.step()` before `optimizer.step()`. In PyTorch 1.1.0 and later, you should call them in the opposite order: `optimizer.step()` before `lr_scheduler.step()`. Failure to do this will result in PyTorch skipping the first value of the learning rate schedule. See more details at https://pytorch.org/docs/stable/optim.html#how-to-adjust-learning-rate "https://pytorch.org/docs/stable/optim.html#how-to-adjust-learning-rate", UserWarning)
Yes, up to date.
Can you paste more the logs?
I reduced batch size and the warning is gone now. I was doing a quick test with sync batch for coco128 first. To make sure there weren't any code errors. I plan to remove the apex.parallel and use torch.nn.parallel instead. Will see how it goes.
Theoretically, DDP with batch size 64 and gpu 4 should have similar performance with single gpu with batch size 64 (4 accumulations).
If SyncBN is correctly added (and every other things are right), the eval [email protected] of first training epoch should be around 0.01.
Hello @MagicFrogSJTU , I am very curious about this. How accurate is this as a measurement? Can you tell me how your current one works right now?
| Branch | Model | GPU | Batch size (per GPU) | GPU memory(GB each) | First epoch | Second epoch | Sync Batch | Time to run 2 epoch (in h) | Last epoch @ mAP |
| --- | --- | --- | --- | --- | --- | --- | --- | --- | --- |
| Ultralytics default | 5s | 1 | 64 | 8-11 | 0.013 | 0.0536 | No | 0.698 | - |
| Ultralytics default | 5s | 2 | 256/2 | 19-25 | 0.00477 | 0.0414 | No | - | 73 @ 0.439 |
| Ultralytics default | 5m | 1 | 64 | 20 | 0.0203 | 0.0798 | No | 0.776 | - |
| Ultralytics default | 5l | 1 | 64 | 30 | 0.025 | 0.0963 | No | 1.088 | - |
| My ddp branch | 5s | 2 | 128 | 21 | 0.000625 | 0.0104 | No | - | 101 @ 0.493 |
| Magic (torch) post-merge | 5s | 1 | 64 | 12 | 0.014 | 0.0624 | Yes | 0.688 | - |
| Magic (torch) post-merge | 5s | 2 | 64/2 | 6 | 0.00362 | 0.0587 | Yes | 0.466 | - |
| Magic (torch) post-merge drop-last | 5s | 2 | 64/2 | 6 | 0.0124 0.0109 | 0.055 0.0673 | Yes | 0.45 - | - |
| Magic (torch) post-merge drop-last | 5m | 2 | 64/2 | 6 | 0.0193 | 0.0872 | Yes | 0.663 | - |
| Magic (torch) pre-merge | 5s | 4 | 64/4 | 4-6 | 0.00499 | 0.0437 | Yes | - | - |
| Magic (apex) pre-merge | 5s | 4 | 64/4 | 4-6 | 0.00531 | 0.0368 | Yes | - | - |
The reason I chose high batch size was to try running them at the highest batch size for speed. I am not sure if it affects performance since optimzer goes by batch size 64.
Edit: Updated table. “” to divide data for multiple runs.
@NanoCode012 Updated.
Mine is
| branch | model | gpu | totoal batch size | first epoch [email protected] | second epoch [email protected] |
| --- | --- | --- | --- | --- | --- |
| default | v5s |1 | 64 | 0.0122 | 0.0654 |
| MagicFrog (DDP) | v5s | 2 | 64 | 0.00979 | - |
| MagicFrog (DDP) +dropLastForTrain | v5s | 2 | 64 | 0.0105 | - |
| MagicFrog (DP) | v5s | 4 | 64 | 0.0129 | - |
| MagicFrog (DDP) | v5s | 4 | 64 | 0.00626 | 0.0402 |
| MagicFrog (DP) | v5m | 4 | 64 | 0.0206 | 0.113 |
| MagicFrog (DDP) | v5m | 4 | 64 | 0.00778 | 0.0598 |
@NanoCode012
In my implementation, The first epoch is around 0.005. It should be 0.01 as default single-gpu.
I have checked the code multiple times, and found nothing more to fix.
This is frustrating.
@MagicFrogSJTU , should we focus only on the first epoch? Should a larger epoch be a better benchmark?
Since you said that it should be the same as single gpu (0.01) for first epoch. I’ll split my third run’s gpu usage into two and test out different variations.
I do agree that’s it’s frustrating. When I checked documentation and other’s implementation, it’s just setting up process group, launch, .to(device), and wrapping in ddp .
Can you try for another model and see how they fare? Maybe a bigger model could be better for DDP?
Also, wouldn’t it also be proper to test for Single GPU in DDP against the default in different batch sizes?
@MagicFrogSJTU , should we focus only on the first epoch? Should a larger epoch be a better benchmark?
Since you said that it should be the same as single gpu (0.01) for first epoch. I’ll split my third run’s gpu usage into two and test out different variations.
Theoretically, should be the first epoch.
However, our target is reproduce performance. So as long as the final epoch can reproduce equal or higher performance, I think bigger epoch is okay.
You could let it continue training, and see what perf the final epoch generates.
Yeah. This should be easy. Now I am questioning if there are some special implementations in network or loss functions.
Can you try for another model and see how they fare? Maybe a bigger model could be better for DDP?
Also, wouldn’t it also be proper to test for Single GPU in DDP against the default in different batch sizes?
I will tried v5m.
DDP in batchsize 64 and gpu 4 is like single-gpu in batchsize 16 and accumulation 4 (which you will get when runing batchsize 64 with default code). This is why I am testing DDP with batch size 64 and gpu 4.
I will tried v5m.
Can you please add model column too? May be easier to see.
I am also curious why our implementation of Sync batch (Apex) gets different results. I think I ran your code without changes for apex.
I will tried v5m.
Can you please add model column too? May be easier to see.
I am also curious why our implementation of Sync batch (Apex) gets different results. I think I ran your code without changes for apex.
I have pushed my newest code. I am now only using torch.nn.distributed
Cool. I am thinking of removing amp.scaled loss, did you do that?
Cool. I am thinking of removing amp.scaled loss, did you do that?
Why?
I have tried not using mixed_precision, but performance remains the same.
Oh I see. I didn't test it, but since apex is going to be fleshed out, we should try without it. But since performance is the same, we can keep it for now.
Oh I see. I didn't test it, but since apex is going to be fleshed out, we should try without it. But since performance is the same, we can keep it for now.
If you are available, I suggest you take a look at the network structure. I suspect that there are something that can't be broadcast in DDP.
I am currently occupied in other things
If you are available, I suggest you take a look at the network structure. I suspect that there are something that can't be broadcast in DDP.
I am not confident in my ability to do so, but I will see. If there were things that cannot be broadcasted, I believe it would be listed in the documentation for us.
@MagicFrogSJTU , I added drop_last to test and it actually reached 0.01 for first epoch but dropped for second. It could've been a fluke. But it shows that we shouldn't take only first epoch as the goal.
I'm setting a test to run for 5m for 2 GPU, and 1 more on default as a benchmark.
@MagicFrogSJTU @NanoCode012 hi guys, nice table!! Unfortunately the non-deterministic nature of training is showing up here making comparisons very difficult. I would say you should ignore epoch 1 mAP, it is a very noisy metric, even for the same model same everything it may be +/-50% from one training run to the next. Epoch 2 mAP is probably better, but still may vary up to +/-20% in my experience.
I'm not really sure if larger models produce more stable mAPs early on.
Yes I think bs64 should be used for everything here.
LR changes will dramatically affect mAP. The default repo does not modify the LR for different batch-sizes, instead it accumulates differently, always trying to reach batch-size 64. If you use --batch 8 for example it will accumulate the gradient 8 times before optimizer update. If you use --batch 64 or higher it will run optimizer update every batch.
@MagicFrogSJTU , I reran the drop_last for 5s, and it gave good results. Maybe there were duplicates in data for last batch causing accuracy loss.
~There’s a function to broadcast buffers.~
~https://github.com/facebookresearch/ClassyVision/commit/16a66a85f58dacf725e11b1a3643178b4616e48d~
@glenn-jocher , what should our benchmark be then? 2 epoch ? 3? 5? 10?
@MagicFrogSJTU , I reran the drop_last for 5s, and it gave good results. Maybe there were duplicates in data for last batch causing accuracy loss.
There’s a function to broadcast buffers.
facebookresearch/ClassyVision@16a66a8@glenn-jocher , what should our benchmark be then? 2 epoch ? 3? 5? 10?
Do you mean drop_last for the dataloader class?
And drop_last for test dataloader, but not for train data loader?
What is the purpose of broadcast buffer and how?
As for benchmarching, I suggest using default LR and default batchsize(64), 2 epochs.
Do you mean drop_last for the dataloader class?
And drop_last for test dataloader, but not for train data loader?
@MagicFrogSJTU Yes. I set drop_last = (train_sampler is not None) I sent a PR to your branch.
For broadcast buffers, it's just something I read, but is said to sync params across GPUs, but I'm not sure if it does the same as SyncBatch.
Edit: Sorry, when reading, I thought it was within DDP.py, but it was in another of Fb's repo.
I added benchmark for running 5m.
Do you mean drop_last for the dataloader class?
And drop_last for test dataloader, but not for train data loader?@MagicFrogSJTU Yes. I set
drop_last = (train_sampler is not None)I sent a PR to your branch.For broadcast buffers, it's just something I read, but is said to sync params across GPUs, but I'm not sure if it does the same as SyncBatch.
Edit: Sorry, when reading, I thought it was within DDP.py, but it was in another of Fb's repo.
I added benchmark for running 5m.
I trained in your drop_last way. But get the same 0.006. I wonder if you have made other changes to the code?
Plus, by setting python drop_last = (train_sampler is not None) you are setting it to True during training but test.
For that run, I changed two things.
~The reason I set drop_last, to my understanding, is if the data that is divided to the GPUs aren't divided evenly, it may cause optimzer to run when _each_ gpu process only (for ex 1 more image) for last batch. We don't need it to drop for testing because it's running on only one GPU, but we can try.~
Edit: I also want to test setting shuffle to train_sampler is None and setting the num_replicas and rank parameters belonging to DistributedSampler, but I don't have GPU available.
I trained in your drop_last way. But get the same 0.006. I wonder if you have made other changes to the code?
The result was for 5s on 2gpu. I haven't tested on 4 gpu yet because don't have available.
For that run, I changed two things.
- drop_last
- init process to use tcp instead. The reason was because I am running the 2-3 codes simultaneously , so had to change the port that they communicate (Could tcp be more accurate?)
~The reason I set drop_last, to my understanding, is if the data that is divided to the GPUs aren't divided evenly, it may cause optimzer to run when _each_ gpu process only (for ex 1 more image) for last batch. We don't need it to drop for testing because it's running on only one GPU, but we can try.~
Edit: I also want to test setting shuffle to
train_sampler is Noneand setting thenum_replicasandrankparameters belonging to DistributedSampler, but I don't have GPU available.I trained in your drop_last way. But get the same 0.006. I wonder if you have made other changes to the code?
The result was for 5s on 2gpu. I haven't tested on 4 gpu yet because don't have available.
From my test experiments, the perf should be:
1GPU = DDP 1GPU > DDP 2GPU > DDP 4GPUS.
Drop_last may not affect the performance.
I have tested the code. Didn't gain better performance.
Plus, Use
python -m torch.distributed.launch --master_port $RANDOM_PORT --nproc_per_node 4 train.py
to allow parallel trainings.
python -m torch.distributed.launch --master_port $RANDOM_PORT --nproc_per_node 4 train.py
Thanks. I misunderstood this at first. You mean DP runs right? or do you mean two different training at same time
From my table, the 1 GPU DDP and 2 GPU DDP are almost equal with each other and with the Default repo's 1 GPU.
I have tested the code. Didn't gain better performance.
I would like to re-run without to see, but I want my current ones to reach 300 epochs for once.
python -m torch.distributed.launch --master_port $RANDOM_PORT --nproc_per_node 4 train.py
Thanks. I misunderstood this at first. You mean DP runs right? or do you mean two different training at same time
From my table, the 1 GPU DDP and 2 GPU DDP are almost equal with each other and with the Default repo's 1 GPU.
I have tested the code. Didn't gain better performance.
I would like to re-run without to see.
I mean DDP. You don't have to run in tcp mode. But anyway, as long as you can run in parallel now.
That's weird. You mean 1GPU = DDP 1GPU = DDP 2GPU > DDP 4GPUS ?

From table above, values are quite close. DDP in 1 GPU = 2 GPU, but time reduced. (Look at first and third row) I do not have for DDP 4 GPUS yet.
From table above, values are quite close. DDP in 1 GPU = 2 GPU, but time reduced. (Look at first and third row) I do not have for DDP 4 GPUS yet.
I am taking a 2-gpu run to try reproducing your results.
I cloned ur feature branch and changed the two things I noted above. That's all. I also checked with git status and git diff to be sure
I cloned ur feature branch and changed the two things I noted above. That's all. I also checked with
git statusandgit diffto be sure
tcp is not needed. it is actually the same as the original env way.
I cloned ur feature branch and changed the two things I noted above. That's all. I also checked with
git statusandgit diffto be sure
I couldn't reproduce your results. See my table above.
The DDP 2gpu without droplast also get high score.
The DDP 2gpu without droplast also get high score.
Interesting. I think I won't be able to reproduce my results for today as I am still training. (Want to see performance in long run). I think drop_last is also not bad. I see many places using it.
Edit: I attached the results.txt for those two below. 5s running 2 epoch for 2 GPUs on coco2017. Can you please check if something is off?
# no drop
0/1 4.89G 0.09067 0.09974 0.1115 0.3019 61 640 0.02479 0.003097 0.003624 0.0009761 0.086 0.09304 0.08909
1/1 5.32G 0.07448 0.1009 0.0743 0.2497 95 640 0.1698 0.06755 0.06125 0.02524 0.07035 0.08714 0.06414
# drop first time
0/1 5.47G 0.09052 0.0998 0.1116 0.302 475 640 0.06563 0.01121 0.01243 0.003548 0.08014 0.09001 0.08033
1/1 5.35G 0.0744 0.1009 0.07411 0.2494 306 640 0.1528 0.08374 0.05756 0.02394 0.06993 0.08806 0.06423
# drop second time
0/299 5.47G 0.09078 0.09982 0.1124 0.303 475 640 0.04087 0.007801 0.01088 0.003195 0.08014 0.08866 0.08194
1/299 5.35G 0.07564 0.09939 0.07276 0.2478 306 640 0.1644 0.08547 0.06728 0.02641 0.06972 0.08669 0.06268
Edit: I also want to test setting
shuffle = train_sampler is Noneand setting thenum_replicasandrankparameters belonging to DistributedSampler
Do you think this will make a difference?
The DDP 2gpu without droplast also get high score.
Does this mean the issue only occurs for 4 GPU?
drop_last can be saftely added.
It is weird that DDP works in 2GPU but 4GPU. I tend to believe that this is a fluctuation.
Let's see your long-run performance.
Use this:
single_gpu default code
| epoch | mAP |
| --- | --- |
|1 | 0.0115/0.00382 |
|5 | 0.245/0.0.126 |
|10 | 0.341/0.19 |
|50 | 0.458/0.275 |
Result as of now:
Magic 4 GPU Pre-merge Total Batch Size = 64
| epoch | mAP @0.5 |
| --- | --- |
| 0 | 0.00559 |
| 5 | 0.194 |
|10 | 0.259 |
| 50 | 0.392 |
| 100 | 0.437 |
| 150 | 0.464 |
| 170 | 0.473 |
Magic 2 GPU Post-merge Drop_last Total Batch Size = 64
| epoch | mAP @0.5 |
| --- | --- |
| 0 | 0.0109 |
| 5 | 0.27 |
|10 | 0.336 |
| 50 | 0.454 |
| 100 | 0.488 |
The 2 GPU looks very similar to the 1 GPU from default code. Do you have for higher epochs too?
Result as of now:
Magic 4 GPU Pre-merge Total Batch Size = 64
epoch mAP @0.5
0 0.00559
5 0.194
10 0.259
50 0.392
100 0.437
150 0.464
Magic 2 GPU Post-merge Drop_last Total Batch Size = 64epoch mAP @0.5
0 0.0109
5 0.27
10 0.336
50 0.454
100 0.488
The 2 GPU looks very similar to the 1 GPU from default code. Do you have for higher epochs too?
Sorry, No.
The 2GPU DDP looks good! I think epoch 50 is enough!
Please train a 4GPU DDP with droplast!
Hmm, I think glenn can provide it for us. The chart on ReadMe doesn't give the numbers unfortunately. I will stop my current 4 GPU run because it doesn't seem good any more.
Hmm, I think glenn can provide it for us. The chart on ReadMe doesn't give the numbers unfortunately. I will stop my current 4 GPU run because it doesn't seem good any more.
Don't worry. Epoch 50 is long enough.
@NanoCode012
Just to feed you:
I am currently working on https://github.com/pytorch/pytorch/issues/41101
results_YOLOv5s.txt
results_YOLOv5m.txt
I've attached 5s and 5m results.txt for official weights.
50 epochs should be more than enough for comparison. Make sure to use python train.py --epochs 300 and CTRL+C at epoch 50 though rather than using python train.py --epochs 50. The second command will give you much better results at epoch 50, as the LR scheduler runs fully.
What is drop_last? We can't be dropping any batches from the training or testing (!).
Ok. I’ve reset it to run without it. Apologize for my incorrect assumption.
Edit: Add tables
Some git error caused the run to be paused, and I didn't notice till now.
| Model | Epoch 1 | Epoch 2 | Epoch 5 | Epoch 10 | Epoch 25 | Epoch 50 |
| --- | --- | --- | --- | --- | --- | --- |
| Default | 0.01011 | 0.05264 | 0.2201 | 0.3411 | 0.3907 | 0.4519 |
| Magic 5s 2 GPU | 0.0123 | 0.0696 | 0.239 | 0.334 | 0.397 | 0.455 |
| Magic 5s 4 GPU | 0.00703 | 0.0463 | 0.168 | 0.253 | 0.326 | 0.3922 |
| Magic 5s 4 GPU Torch 1.6 | 0.0040.00613 | 0.03870.0428 | -0.165 | -0.252 | - | - |
| Magic 5s 8 GPU Torch 1.6 | 0.003761 | 0.02131 | 0.08683 | 0.1417 | 0.2052 | - |
We see that drop_last was not the reason for 2 GPU's accuracy. It was just like that. However, what's confusing is that if it worked for 2, why didn't it work for 4?
A warning I got on first epoch.
0/299 3.64G 0.09153 0.1 0.1131 0.3047 51 640
python3.7/site-packages/torch/optim/lr_scheduler.py:114: UserWarning: Seems like `optimizer.step()` has been overridden after learning rate scheduler initialization. Please, make sure to call `optimizer.step()` before `lr_scheduler.step()`. See more details at https://pytorch.org/docs/stable/optim.html#how-to-adjust-learning-rate
Ok. I’ve reset it to run without it. Apologize for my incorrect assumption.
Edit: Add tables
Some git error caused the run to be paused, and I didn't notice till now.
Model Epoch 1 Epoch 2 Epoch 5 Epoch 10 Epoch 25 Epoch 50
Default 0.01011 0.05264 0.2201 0.3411 0.3907 0.4519
Magic 5s 2 GPU 0.0123 0.0696 - - - -
Magic 5s 4 GPU 0.00703 0.0463 - - - -
We see thatdrop_lastwas not the reason for 2 GPU's accuracy. It was just like that. However, what's confusing is that if it worked for 2, why didn't it work for 4?A warning I got on first epoch.
0/299 3.64G 0.09153 0.1 0.1131 0.3047 51 640 python3.7/site-packages/torch/optim/lr_scheduler.py:114: UserWarning: Seems like `optimizer.step()` has been overridden after learning rate scheduler initialization. Please, make sure to call `optimizer.step()` before `lr_scheduler.step()`. See more details at https://pytorch.org/docs/stable/optim.html#how-to-adjust-learning-rate
I fix this in the latest branch.
It is weird why it didn't work!
My assumption is that there is an error growing in exponential order to the gpu num.
I fix this in the latest branch.
Thanks! I saw it.
My assumption is that there is an error growing in exponential order to the gpu num.
But why does 2 GPU DDP slightly outperform 1 GPU default if there is an error?
Edit: 2 GPU DDP is starting to even out with 1 GPU default now.
I fix this in the latest branch.
Thanks! I saw it.
My assumption is that there is an error growing in exponential order to the gpu num.
But why does 2 GPU DDP slightly outperform 1 GPU default if there is an error?
Edit: 2 GPU DDP is starting to even out with 1 GPU default now.
I have a question. What is your running environment? In particular, your pytorch verison? r u using pytorch1.60?
In particular, your pytorch verison?
I'm following the ReadMe which says min version is 1.5, mine is 1.5.1 running in conda env python 3.7
Waiting on epoch 50 for both runs.
In particular, your pytorch verison?
I'm following the ReadMe which says min version is 1.5, mine is 1.5.1 running in conda env python 3.7
Waiting on epoch 50 for both runs.
Can you run on pytorch1.6? For example, build a docker from pytroch:20.06.
There is a possibilty that 4GPU DDP will work with pytorch1.6.
I can't do that because I can't change the nvidia driver on my machine. pytorch:20.06 need latest nvidia driver
Can you run on pytorch1.6? For example, build a docker from pytroch:20.06.
Hi, I have never used docker before, so may I ask a few questions.
1) I do not see a version for pytorch 1.6 on dockerhub. Do you mean to build it myself from source?
2) pytorch nightly build via conda is version 1.7
Can you run on pytorch1.6? For example, build a docker from pytroch:20.06.
Hi, I have never used docker before, so may I ask a few questions.
- I do not see a version for pytorch 1.6 on dockerhub. Do you mean to build it myself from source?
- pytorch nightly build via conda is version 1.7
Use dockerfile in the yolov5 repos. You can see that it is built from nvcr.io/nvidia/pytorch:20.06-py3
@MagicFrogSJTU , hello, I built docker and ran under compatibility mode for nvidia drivers. I updated results above. I'm gonna re-run again to be sure. (Checked that it's pytorch 1.6.0a0+9907a3e)
@MagicFrogSJTU , hello, I built docker and ran under compatibility mode for nvidia drivers. I updated results above. I'm gonna re-run again to be sure. (Checked that it's pytorch 1.6.0a0+9907a3e)
It is getting worse. Damn!
Could be the randomness. The second time was much better. Gonna get it to epoch 5 to see.
Could be the randomness. The second time was much better. Gonna get it to epoch 5 to see.
I don't have a clue now. Emm..What do you think?
I have no clue. Could be that 2 GPU is the limit?
I have no clue. Could be that 2 GPU is the limit?
Weeks ago, I have just trained BERT model with 8-GPU DDP. Although I didn't train 1-gpu model and verify its correctness, I don't think there is a limit of 2GPU. It's just too silly.
I have just trained a model with nightly built pytorch (1.7). Got 0.00713 for epoch 1. Damn.
Yep, that's why I say that I have no clue. If the 2 GPU one was also performing poorly, then we can say that something went wrong, but it's not, which is causing this issue. Hm, how about 8 GPU? Can you try? How would it perform?
Edit: I added my test for 8.
Yep, that's why I say that I have no clue. If the 2 GPU one was also performing poorly, then we can say that something went wrong, but it's not, which is causing this issue. Hm, how about 8 GPU? Can you try? How would it perform?
Edit: I added my test for 8.
I suggest we train a whole 300 epochs for DDP 4gpus.
If the network converge to same accuracy at last, let's end this problem and leave for future. This is too tiring.
If not, we just leave it for Glenn for decision.
What do you think
I suggest we train a whole 300 epochs for DDP 4gpus.
If the network converge to same accuracy at last, let's end this problem and leave for future. This is too tiring.
If not, we just leave it for Glenn for decision.
What do you think
Sure. I updated the table for 8. It seems that more GPU decreases accuracy at the start. We should see how long it takes for them to converge, (if they do, that is)
@MagicFrogSJTU , hello, my test for 4 GPU is done. It took 41.485h for 300 epoch, which doesn't seem right, and the results did not converge at the end. See graph below for comparison between official and it.

How did yours go?
@NanoCode012 the curves look very similar, no obvious problems stick out. Was this with batch-norm sync enabled?
41 hours sounds a bit high for 4 gpu training, depending on the gpus of course. yolov5s typically takes about 3 days on a V100 or 6 days on a T4. But more worrisome is the mAP drap :(
EDIT: if a T4 takes 6 days by itself, and you have less than 2 days there, then yes, it might be reasonable for 4 T4's to take less than 2 days. The latest multi-gpu benchmarking I did was a while ago on our old yolov3 repo, though a lot of the code will be similar to what was used there:
https://github.com/ultralytics/yolov3#speed
EDIT2: don't worry about the objectness loss differences, the latest models (blue) train with a slightly lower hype['obj'] value there (0.6-0.7) typically vs the 1.0 in master, which helps slightly in lessening overtraining towards the end, but this produces very small mAP changes throughout the entire training (<0.005) , and is definitely not the source of the difference.
@NanoCode012 the curves look very similar, no obvious problems stick out. Was this with batch-norm sync enabled?
Yes, it is.
Order of init: Model>SyncBatch>Ema>DDP (If this is wrong, it might be cause. It was originally Ema>SyncBatch.)
41 hours sounds a bit high for 4 gpu training, depending on the gpus of course. yolov5s typically takes about 3 days on a V100 or 6 days on a T4. But more worrisome is the mAP drap :(
EDIT: if a T4 takes 6 days by itself, and you have less than 2 days there, then yes, it might be reasonable for 4 T4's to take less than 2 days. The latest multi-gpu benchmarking I did was a while ago on our old yolov3 repo, though a lot of the code will be similar to what was used there:
https://github.com/ultralytics/yolov3#speed
It was run on 4 V100s. That is the part that confused me. I ran without notest and nosave. ~I could not recall precisely the time per epoch. It was about 12-17 minutes for training and about 7-10 minutes for testing.~ Edit: Table below.
I am thinking whether I will re-run for 2 GPU since it produced good results for <50 epoch runs.
Yes, it is.
Order of init: Model>SyncBatch>Ema>DDP (If this is wrong, it might be cause. It was originally Ema>SyncBatch.)
I have tried training with bothModel>SyncBatch>Ema>DDP and Model>Ema>SyncBatch>DDP , and got similar results.
By the way, I have even tried training without Ema, and got similar results. @NanoCode012 You may want to take a try, because there is chance that I did it incorrectlly.
EDIT2: don't worry about the objectness loss differences, the latest models (blue) train with a slightly lower hype['obj'] value there (0.6-0.7) typically vs the 1.0 in master, which helps slightly in lessening overtraining towards the end, but this produces very small mAP changes throughout the entire training (<0.005) , and is definitely not the source of the difference.
DDP is now working on 2-GPUs but 4-GPUs. 2-GPUs generated similar mAP, while 4-GPUs got worse.
I and @NanoCode012 have done many tests to try finding the source of difference but failed.
This seems strange and I don't have a clue now.
@NanoCode012 the curves look very similar, no obvious problems stick out. Was this with batch-norm sync enabled?
Yes, it is.
Order of init: Model>SyncBatch>Ema>DDP (If this is wrong, it might be cause. It was originally Ema>SyncBatch.)41 hours sounds a bit high for 4 gpu training, depending on the gpus of course. yolov5s typically takes about 3 days on a V100 or 6 days on a T4. But more worrisome is the mAP drap :(
EDIT: if a T4 takes 6 days by itself, and you have less than 2 days there, then yes, it might be reasonable for 4 T4's to take less than 2 days. The latest multi-gpu benchmarking I did was a while ago on our old yolov3 repo, though a lot of the code will be similar to what was used there:
https://github.com/ultralytics/yolov3#speedIt was run on 4 V100s. That is the part that confused me. I ran without
notestandnosave. I could not recall precisely the time per epoch. It was about 12-17 minutes for training and about 7-10 minutes for testing.I am thinking whether I will re-run for 2 GPU since it produced good results for <50 epoch runs.
Oh boy, yes if this was with 4 V100's then there is definitely a problem somewhere. I don't know what yolov5s time should be on single V100 since I use T4 to train 5s (V100 for m, l, x), but I know testing time alone should only be about 1 minute total per epoch (certainly no more than 2 minutes). I'll post a screenshot here, this is a current GCP VM with one V100 training yolov5m.yaml, all default settings (I'm retraining all models with a few tweaks this week).

Hello, I just did a quick run to check speed. I was way off mark. I think I may have remembered a time while testing another model.
Now, I also set OMP_NUM_THREADS=1 as recommended by,
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
Using Magic feature branch,
| Num GPU | Train min per epoch | Train iter per second | Test min per epoch | first epoch mAP |
| ---| --- | --- | --- | -- |
| 1 | 9-11 | 2.8-3 | 1:08 | 0.0125 |
| 2 | 8-9 | 3.4-3.6 | 1:14 | 0.00988 |
| 4 | 6-7 | 4.3-4.5 | 1:12 | - |
@MagicFrogSJTU , since DataParallel worked quite well for 1-4 GPU, why don't we use that instead of DDP?
@NanoCode012 oh, those are much faster. Ok that makes more sense. So the problem is not the speed, the problem lies in reproducing the mAP.
EDIT: Also the speed multiple is not as high as it could be since I assume you are keeping batch-size fixed. In practice you'd probably want to increase your batch size linearly with your gpu count to take advantage of your extra gpu ram.
Hello, I just did a quick run to check speed. I was way off mark. I think I may have remembered a time while testing another model.
Now, I also setOMP_NUM_THREADS=1as recommended by,Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.Using Magic feature branch,
Num GPU Train min per epoch Train iter per second Test min per epoch first epoch mAP
1 9-11 2.8-3 1:08 0.0125
2 8-9 3.4-3.6 1:14 0.00988
4 6-7 4.3-4.5 1:12 -
@MagicFrogSJTU , since DataParallel worked quite well for 1-4 GPU, why don't we use that instead of DDP?
Because theoritically, DDP is much faster.
Well, right now, we have no idea where the issue lies.
Why don’t we cleanup the code for DP then PR into the main branch? This will be an improvement for the current repo. Then, maybe we can close this Issue till someone else finds out the reason why DDP fails and fixes it with our Issues as a guide.
@NanoCode012 @MagicFrogSJTU I started new 1x and 2x T4 GPU training runs for yolov5s using the current default DDP code (no syncbatchnorm). Blue is 1x, orange is 2x. The epoch times are 29m single gpu, 19min double gpu, both use same exact command (--batch 64). The current difference between the two is only 0.001 mAP around epoch 40.

@NanoCode012 @MagicFrogSJTU I started new 1x and 2x T4 GPU training runs for yolov5s using the current default DDP code (no syncbatchnorm). Blue is 1x, orange is 2x. The epoch times are 29m single gpu, 19min double gpu, both use same exact command (--batch 64). The current difference between the two is only 0.001 mAP around epoch 40.
@glenn-jocher Cool, nice!
Should we clean up the code and make a pull request as @NanoCode012 said?
We may leave the 4-GPU problems to the future, since it seems quite difficult to resolve in the near future.
Well, right now, we have no idea where the issue lies.
Why don’t we cleanup the code for DP then PR into the main branch? This will be an improvement for the current repo. Then, maybe we can close this Issue till someone else finds out the reason why DDP fails and fixes it with our Issues as a guide.
What is "cleanup the code for DP"? Is it a typo for "DDP"?
@MagicFrogSJTU yes if you can cleanup the code and consolidate the changes into a PR that would be good. Make sure you test the updated multi-gpu code against the current multi-gpu code to compare to the current baseline. I think 30 epochs out of 300 is probably enough (as in my example above). I cancelled this training after making the plot because its obvious it's very close. I will try 4x T4 if I can.
What is "cleanup the code for DP"? Is it a typo for "DDP"?
@MagicFrogSJTU , From your past results of DP https://github.com/ultralytics/yolov5/issues/264#issuecomment-654809508 , we see that the accuracy is similar to the main branch, and you said that it was faster as well. Since it is stable for 1-4 GPU from your results, I feel it is better to use it.
For your DDP, it is quite experimental right now (only 1-2 GPU), so I am not sure it is appropriate to add it in as some people might be confused when using >= 4 GPU. Of course, this is all up to glenn.
I set two runs right now. One for DDP on main repo, another is setting main repo to use DP, I wanted to see if there are any benefits in accuracy and time for 2 GPU. Right now, they perform similarly.
| Type | Epoch 1 | Time per epoch |
| --- | --- | --- |
| DP | 0.011 | 11:50 |
| DDP | 0.0124 | 11:55 |
I just removed init_process and changed torch.nn.parallel.DDP to torch.nn.DP for setting DP in main repo.
Please tell me what you've decided, and I can help add changes to your code/running it.
DP is already set up in the original code
No change needed actually...Ah, I see. I was just wondering because since DP/DDP are implemented differently, I wanted to test if there were noticeable difference for DP and Single process DDP.
Or was I confused on what you mean? Were you calling Single process DDP as DP? I was under the assumption they were different.
Sorry, my bad.
The original code doesn't implement DP.
DP is activated by model = torch.nn.DataParallel(model)
while DDP by model = DDP(model, device_ids=[local_rank], output_device=local_rank)
In fact, single process DDP with multiple gpus (if network are not explicitly allocated on difference gpus) is implemented as DP internally in pytorch.
@NanoCode012
I want to rebase the commits to merge into one, because there are too many of it.
Let's make two commits, each of us get one! So that all of us can have the honor of contributions! What do you think?
In fact, single process DDP with multiple gpus (if network are not explicitly allocated on difference gpus) is implemented as DP internally in pytorch.
Thanks for clarification.
I want to rebase the commits to merge into one, because there are too many of it.
Let's make two commits, each of us get one! So that all of us can have the honor of contributions! What do you think?
Sure! That'll be great. Can we first update Feature/DDP-fixed to the latest repo version, so we compare from the same level before we rebase? There are a lot of changes in the main repo since we last branched off. Then I'll run it to benchmark.
In fact, single process DDP with multiple gpus (if network are not explicitly allocated on difference gpus) is implemented as DP internally in pytorch.
Thanks for clarification.
I want to rebase the commits to merge into one, because there are too many of it.
Let's make two commits, each of us get one! So that all of us can have the honor of contributions! What do you think?Sure! That'll be great. Can we first update Feature/DDP-fixed to the latest repo version, so we compare from the same level before we rebase? There are a lot of changes in the main repo since we last branched off. Then I'll run it to benchmark.
Yes. My plan is
I will do the merge master!
Hello glenn, do you have an updated script for unit test? The one that you gave before does not work with weights/last.pt since it was moved to runs directory.
@NanoCode012 ah yes, you are correct! This is the latest. The typical use case is you drop this into a single command, i.e. put this in unit_tests.sh and then bash unit_tests.sh, and then an exit code of 0 is passing.
I'm trying to automate all of this as part of the CI pipeline, using github actions for example, but for now its a bit of a manual nightmare, we just run this in colab as often as possible to make sure recent commits haven't broken anything on single-gpu or cpu.
git clone https://github.com/ultralytics/yolov5
cd yolov5
pip install -qr requirements.txt onnx
python3 -c "from utils.google_utils import *; gdrive_download('1n_oKgR81BJtqk75b00eAjdv03qVCQn2f', 'coco128.zip')" && mv ./coco128 ../
export PYTHONPATH="$PWD" # to run *.py. files in subdirectories
for x in yolov5s #yolov5m yolov5l yolov5x # models
do
python train.py --weights $x.pt --cfg $x.yaml --epochs 3 --img 320 --device 0,1 # train
for di in 0,1 0 cpu # inference devices
do
python detect.py --weights $x.pt --device $di # detect official
python detect.py --weights runs/exp0/weights/last.pt --device $di # detect custom
python test.py --weights $x.pt --device $di # test official
python test.py --weights runs/exp0/weights/last.pt --device $di # test custom
done
python models/yolo.py --cfg $x.yaml # inspect
python models/export.py --weights $x.pt --img 640 --batch 1 # export
done
While running the Default code for 4 GPU, I am quite confused why only the first GPU is using a large amount of memory.

Furthermore, the speed of 4 GPU vs 1 GPU is kind of similar. I set both to run at the same time, and they are in the same epoch (34). It could be my CPU bottleneck though.
While running the Default code for 4 GPU, I am quite confused why only the first GPU is using a large amount of memory.
Furthermore, the speed of 4 GPU vs 1 GPU is kind of similar. I set both to run at the same time, and they are in the same epoch (34). It could be my CPU bottleneck though.
The 4-gpu DP should be faster. GPU0 using more memory in DP is normal
Fixed in #401
Most helpful comment
@NanoCode012 @MagicFrogSJTU I started new 1x and 2x T4 GPU training runs for yolov5s using the current default DDP code (no syncbatchnorm). Blue is 1x, orange is 2x. The epoch times are 29m single gpu, 19min double gpu, both use same exact command (--batch 64). The current difference between the two is only 0.001 mAP around epoch 40.