yolov5 🚀 - Changing to Multi-process DistributedDataParallel

Hello @NanoCode012, thank you for your interest in our work! Please visit our Custom Training Tutorial to get started, and see our Jupyter Notebook , Docker Image, and Google Cloud Quickstart Guide for example environments.

If this is a bug report, please provide screenshots and minimum viable code to reproduce your issue, otherwise we can not help you.

If this is a custom model or data training question, please note that Ultralytics does not provide free personal support. As a leader in vision ML and AI, we do offer professional consulting, from simple expert advice up to delivery of fully customized, end-to-end production solutions for our clients, such as:

Cloud-based AI systems operating on hundreds of HD video streams in realtime.
Edge AI integrated into custom iOS and Android apps for realtime 30 FPS video inference.
Custom data training, hyperparameter evolution, and model exportation to any destination.

For more information please visit https://www.ultralytics.com.

github-actions[bot] on 2 Jul 2020

Hello, I think I set it up properly according to many examples from pytorch's official docs as well as other's implementation of DDP. However, it is a lot slower than running single GPU. (I tried 2 GPU for now). Also, the mAP stays 0 throughout. I am not sure why specifically.

Furthermore, I notice the global variables being recalled when we enumerate(dataloader). This could be the cause of it slowing.

NanoCode012 on 3 Jul 2020

I'm just passing by but DDP should be faster (and IS faster in my runs) than DP. you probably missed something. it also depends on how you launch it.
Check my working DDP train.py for classification maybe you would notice the difference with yours. My implementation could be run on 1 GPU simply by calling python3 train.py. If you want DDP the correct way to launch it is using the following command. python3 -m torch.distributed.launch --nproc_per_node=$NUM_GPUS. There is absolutely no need to pass anything else explicitly. launch would set environment variables for you and you could get them in any place in the script using something like

def env_world_size():
    return int(os.environ.get("WORLD_SIZE", 1))

def env_rank():
    return int(os.environ.get("RANK", 0))

bonlime on 3 Jul 2020

Thanks for checking it out. In my main function, I use multiprocessing.spawn to create N processes(1 per GPU). I believe they are the same. I will look over your code.

Something weird I notice is what I mentioned.

Furthermore, I notice the global variables being recalled when we enumerate(dataloader).

I added a print("Test") outside of every function and I noticed it being called 8 times per GPU/process. Do you know of any reason that may happen? (I believe 8 is the number of workers passed to the Dataloader).

The below is the line that caused the problem.

https://github.com/NanoCode012/yolov5/blob/89d195ea9f81645a8c3764c1126aca959977b77a/train.py#L240

where dataloader comes from,
https://github.com/NanoCode012/yolov5/blob/89d195ea9f81645a8c3764c1126aca959977b77a/utils/datasets.py#L44-L73

EDIT: I just ran a quick toy model and DDP easily beated DP as expected. I guess it's somewhere in the code messing it up.

NanoCode012 on 3 Jul 2020

@bonlime , something I've noticed is that you mentioned that we should call ema before wrapping our model in DDP. However, the yolov5 does the opposite. It wraps the DDP with ema. Do you think it is related?

https://github.com/bonlime/sota_imagenet/blob/2fb3e46a82fbf9d767df75f3ba4fd6d8517cd567/train.py#L246-L249

https://github.com/ultralytics/yolov5/blob/3bdea3f697d4fce36c8e24a0701c0f419fa8f63a/train.py#L161
https://github.com/ultralytics/yolov5/blob/3bdea3f697d4fce36c8e24a0701c0f419fa8f63a/train.py#L196

Actually, one more thing is that, the Single Process DDP that's already in the original code does not seem to be faster in my tests compared to just one single GPU.

NanoCode012 on 3 Jul 2020

regarding order of EMA and DDP - it's an implementation specific issue. my version would probably fail if used after DDP. I don't think the order would cause any slowdown but you could test by commenting it out.

Regarding Single Process DDP - I don't really understand what you mean. Single Process Distributed Data Parallel is called Data Parallel, isn't it? why would you expect it to be faster?

bonlime on 3 Jul 2020

Uhm, I am not sure if they are the same, but they are listed under two different docs.

https://pytorch.org/docs/master/generated/torch.nn.DataParallel.html

I think you're right in how they work similarly but maybe there's some difference in the implementation?

Btw, do the processes regulate GPU memory usage? (aka, should they be the same?) My first GPU can take 25GB of memory, whereas second GPU would take 13GB. Then they would randomly swap.

Also, it appears the training works but very poorly in performance.
Single GPU(original code): mAP starts rising from 12th epoch.
Double GPU(my code): mAP starts to crawl from 30th epoch

NanoCode012 on 3 Jul 2020

DP is very different from DDP as the docs clearly show.

I've trained a lot of models using DDP and never faced performance issues. it all depends on implementation thought. try to check some other codebases to understand how to make DDP work and to avoid bugs. I'm pretty sure the issue is some silly bug somewhere 🙃

bonlime on 3 Jul 2020

about GPU memory. for me usually the rank 0 process has slightly larger memory consumption (like by 1-2 GB). after 1 epoch memory consumption doesn't really change

bonlime on 3 Jul 2020

@bonlime , so it would be weird that my memory usage are so different and swap every epoch right ?

NanoCode012 on 3 Jul 2020

Also I checked out multiple pytorch examples from official and others' github
The main things are:

Set init process
Set cuda process .to() for tensors and set_device()
Set map location
Set mp.spawn
Use DistributedSampler
Set epoch for trainsampler for each epoch

Please tell me if I missed anything

NanoCode012 on 3 Jul 2020

I tested my current branch against yours (taken after my ema patch) with coco 2017 for 10 epochs to test speed using two GPU on yolov5s.

python train.py --weights "" --data coco.yaml --cfg "yolov5s.yaml" --epochs 10 --img 640 --device 0,1 --batch-size 128 --nosave

Here are the results.

My branch

# train
     Epoch   gpu_mem      GIoU       obj       cls     total   targets  img_size
       8/9     21.4G   0.06038   0.09437   0.04652    0.2013        64       640
       8/9     22.3G   0.06021   0.09389   0.04625    0.2004        64       640
               Class      Images     Targets           P           R      [email protected]
                 all       5e+03    3.63e+04       0.238       0.321       0.234       0.116

     Epoch   gpu_mem      GIoU       obj       cls     total   targets  img_size
       9/9     22.3G   0.05955   0.09358   0.04501    0.1981       123       640
       9/9     21.4G   0.05959   0.09371    0.0452    0.1985        55       640
               Class      Images     Targets           P           R      [email protected]
                 all       5e+03    3.63e+04       0.245       0.334       0.248       0.126

 Average Precision  (AP) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.134
 Average Precision  (AP) @[ IoU=0.50      | area=   all | maxDets=100 ] = 0.254
 Average Precision  (AP) @[ IoU=0.75      | area=   all | maxDets=100 ] = 0.128
 Average Precision  (AP) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.061
 Average Precision  (AP) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.152
 Average Precision  (AP) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.168
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=  1 ] = 0.171
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets= 10 ] = 0.318
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.361
 Average Recall     (AR) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.175
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.402
 Average Recall     (AR) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.487
Optimizer stripped from weights/last_.pt
10 epochs completed in 1.638 hours.

# test
Caching labels ../coco128/labels/train2017 (126 found, 0 missing, 2 empty, 0 dup
               Class      Images     Targets           P           R      [email protected]
                 all         128         929       0.258       0.359       0.315       0.176
Speed: 3.2/3.3/6.4 ms inference/NMS/total per 640x640 image at batch-size 32

My Patch-1 branch (Ema-patch)

# train
    Epoch   gpu_mem      GIoU       obj       cls     total   targets  img_size
       8/9     14.9G   0.05715   0.09235   0.03983    0.1893       202       640
               Class      Images     Targets           P           R      [email protected]
                 all       5e+03    3.63e+04       0.292       0.415       0.331       0.182

     Epoch   gpu_mem      GIoU       obj       cls     total   targets  img_size
       9/9     14.9G   0.05591   0.09165   0.03808    0.1856       204       640
               Class      Images     Targets           P           R      [email protected]
                 all       5e+03    3.63e+04       0.291       0.439       0.348       0.194
 Average Precision  (AP) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.204
 Average Precision  (AP) @[ IoU=0.50      | area=   all | maxDets=100 ] = 0.355
 Average Precision  (AP) @[ IoU=0.75      | area=   all | maxDets=100 ] = 0.209
 Average Precision  (AP) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.098
 Average Precision  (AP) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.230
 Average Precision  (AP) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.268
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=  1 ] = 0.214
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets= 10 ] = 0.379
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.429
 Average Recall     (AR) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.228
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.475
 Average Recall     (AR) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.568
Optimizer stripped from weights/last.pt
10 epochs completed in 2.302 hours.

# test
Caching labels ../coco128/labels/train2017 (126 found, 0 missing, 2 empty, 0 dup
               Class      Images     Targets           P           R      [email protected]
                 all         128         929       0.267       0.441       0.401       0.246
Speed: 2.1/2.3/4.3 ms inference/NMS/total per 640x640 image at batch-size 32

From these results, we can see that it's a lot faster than Single process DDP now. The main drawback however is accuracy. I am not sure what is the problem. From what I read on multi process DDP, it automatically does a sync, so all the values are the same at the end, so I should not need to modify the code significantly.

I also do not know why the load on GPU are so different. I am thinking it can be related to each GPU creating its own dataloader in Multi-process in comparison to all GPU sharing dataloaders in Single-process.

Do you have any opinions on this @glenn-jocher ? I am now running Single GPU on both branch to benchmark it.
Should I run a full 300 epoch? Should I change model?

NanoCode012 on 6 Jul 2020

I have been working on DDP improve week ago!
See the issue #177 !
There are lots of original code to be revised to make DDP work. It's because the original codebase is complicated!
You have to make sure that every thing in synchrosized in multiple processes!
I will make pull request very soon, if my next experment comes out good. See the code then!

By the way, I found that training 10 epochs are not quite enough to analyze the performance. Unless you use SyncBN! The BN will be a problem in the early training.

MagicFrogSJTU on 6 Jul 2020

In case you are hurry to use DDP, see my fork
Test still running. But I think this would be the final version if SyncBN not added.
If you have GPU resources, you can also help run the test!

How to run

bash python -m torch.distributed.launch --nproc_per_node 4 train.py --data data/coco.yaml --batch-size 64 --cfg models/yolov5s.yaml --weights '' --epochs 300 --device 0, 1,2,3

MagicFrogSJTU on 6 Jul 2020

@MagicFrogSJTU

Ohh, I saw your thread, but I mistook it for an issue in another repo. Will check it out. Can you tell me what you've changed so we can compare notes?

By the way, I found that training 10 epochs are not quite enough to analyze the performance. Unless you use SyncBN!

Yes, I want to do this, but I'm not sure where in the code it is.

NanoCode012 on 6 Jul 2020

So far, my small experiment on Single GPU is done.

My branch

    Epoch   gpu_mem      GIoU       obj       cls     total   targets  img_size
       9/9     21.4G   0.05582    0.0913   0.03765    0.1848       194       640
               Class      Images     Targets           P           R      [email protected]
                 all       5e+03    3.63e+04       0.294       0.437        0.35       0.194
10 epochs completed in 2.456 hours.

For some reason, there were semaphore errors despite running in Single Process.

Traceback (most recent call last):
  File "python3.7/multiprocessing/util.py", line 277, in _run_finalizers
    finalizer()
  File "python3.7/multiprocessing/util.py", line 201, in __call__
    res = self._callback(*self._args, **self._kwargs)
  File "python3.7/multiprocessing/synchronize.py", line 87, in _cleanup
    sem_unlink(name)
FileNotFoundError: [Errno 2] No such file or directory
python3.7/multiprocessing/semaphore_tracker.py:144: UserWarning: semaphore_tracker: There appear to be 33 leaked semaphores to clean up at shutdown
  len(cache))
python3.7/multiprocessing/semaphore_tracker.py:156: UserWarning: semaphore_tracker: '/mp-b3j04ac7': [Errno 2] No such file or directory
  warnings.warn('semaphore_tracker: %r: %s' % (name, e))

Branch from ema-patch

    Epoch   gpu_mem      GIoU       obj       cls     total   targets  img_size
       9/9     21.4G   0.05592   0.09151   0.03777    0.1852       204       640
               Class      Images     Targets           P           R      [email protected]
                 all       5e+03    3.63e+04       0.289       0.435       0.348       0.194
10 epochs completed in 2.200 hours.

This explains why my earlier test on Multiple GPU Single process took half the load, it's because the load was shared between two GPUs.

NanoCode012 on 6 Jul 2020

@MagicFrogSJTU , hello, I looked over your code. It is really nice. There're a few things I would like to add.
1) There was an update to torch.utils, so ema is cleaner now.
2) Do you have an article on where to initalize ema? ~I also see that you only allow 1 process to go through ema. That's something I was thinking of because it was redundant to do multiple deep copies. You beat me right to it!~ Maybe I misread. Not sure if you do this now.
3) I think you can set local_rank to 0 for single GPU, it will clean your code up a bit.
4) I think it's more reasonable if the batch_size is the batch_size per GPU as it's much easier for user. I first planned to split batch size, but chose not to later.
5) I think using spawn to create processes to be easier as it'll be more user friendly(no need to change current commands). But you'll have to scope the variables outside functions into functions because dataloaders' workers will call them. This took a while to figure out.

I plan to update my code to be more legible, and use arguments to check whether it's distributed or not instead.

Edit: https://github.com/MagicFrogSJTU/yolov5/blob/96fa40a3a925e4ffd815fe329e1b5181ec92adc8/train.py#L432
I don't think this is very friendly.

NanoCode012 on 6 Jul 2020

👍1

Cleaned my code up, but after reading @MagicFrogSJTU 's fork, I see that you have done most of the heavy lifting already, so maybe I may close my Issue and send PR instead.

Could you enable Issues for your fork?

NanoCode012 on 6 Jul 2020

@MagicFrogSJTU , I'm setting your code to run with SyncBatch norm.

NanoCode012 on 7 Jul 2020

@MagicFrogSJTU , I'm setting your code to run with SyncBatch norm.

Cool!
I was just trying to add SyncBatch. Now that you have begun the work, you can take this job!

Theoretically, DDP with batch size 64 and gpu 4 should have similar performance with single gpu with batch size 64 (4 accumulations).
If SyncBN is correctly added (and every other things are right), the eval [email protected] of first training epoch should be around 0.01.

Let me know if you have done the job!

MagicFrogSJTU on 7 Jul 2020

@MagicFrogSJTU , I'm setting your code to run with SyncBatch norm.

Cool!
I was just trying to add SyncBatch. Now that you have begun the work, you can take this job!

Theoretically, DDP with batch size 64 and gpu 4 should have similar performance with single gpu with batch size 64 (4 accumulations).
If SyncBN is correctly added (and every other things are right), the eval [email protected] of first training epoch should be around 0.01.

Let me know if you have done the job!

I thought it would work, so I made the PR. However, SyncBatchNorm convert only works with torch.nn.parallel.ddp not with apex.parallel. I think I'll have to change this, so I'm looking how to deal with mixed_precision.

Edit: I just found out there's one in apex as well. However, when I read around, someone suggested to use torch.nn.parallel.ddp as its more forward compatible.

NanoCode012 on 7 Jul 2020

@MagicFrogSJTU , hello, I looked over your code. It is really nice. There're a few things I would like to add.

There was an update to torch.utils, so ema is cleaner now.

Do you have an article on where to initalize ema? ~I also see that you only allow 1 process to go through ema. That's something I was thinking of because it was redundant to do multiple deep copies. You beat me right to it!~ Maybe I misread. Not sure if you do this now.

I think you can set local_rank to 0 for single GPU, it will clean your code up a bit.

I think it's more reasonable if the batch_size is the batch_size per GPU as it's much easier for user. I first planned to split batch size, but chose not to later.

I think using spawn to create processes to be easier as it'll be more user friendly(no need to change current commands). But you'll have to scope the variables outside functions into functions because dataloaders' workers will call them. This took a while to figure out.

I plan to update my code to be more legible, and use arguments to check whether it's distributed or not instead.

Edit: https://github.com/MagicFrogSJTU/yolov5/blob/96fa40a3a925e4ffd815fe329e1b5181ec92adc8/train.py#L432
I don't think this is very friendly.

I will take a look of your code!

https://github.com/rwightman/pytorch-image-models/blob/master/train.py

I aggree with you but let's keep the old way until DDP is correctly set!
3 and 5. I am following other's best practices. spawn introduces heavy extra burden.

MagicFrogSJTU on 7 Jul 2020

I thought it would work, so I made the PR. However, SyncBatchNorm convert only works with torch.nn.parallel.ddp not with apex.parallel. I think I'll have to change this, so I'm looking how to deal with mixed_precision.

Edit: I just found out there's one in apex as well. However, when I read around, someone suggested to use torch.nn.parallel.ddp as its more forward compatible.

Keep in mind that apex will be removed from the code soon, as pytorch is introducing native mixed precision support in torch 1.6. It's already available in nightly for testing, but I'm waiting for the stable 1.6 release before making this switch.
https://pytorch.org/docs/stable/amp.html

glenn-jocher on 7 Jul 2020

👍1

I thought it would work, so I made the PR. However, SyncBatchNorm convert only works with torch.nn.parallel.ddp not with apex.parallel. I think I'll have to change this, so I'm looking how to deal with mixed_precision.
Edit: I just found out there's one in apex as well. However, when I read around, someone suggested to use torch.nn.parallel.ddp as its more forward compatible.

Keep in mind that apex will be removed from the code soon, as pytorch is introducing native mixed precision support in torch 1.6. It's already available in nightly for testing, but I'm waiting for the stable 1.6 release before making this switch.
https://pytorch.org/docs/stable/amp.html

Got

MagicFrogSJTU on 7 Jul 2020

@MagicFrogSJTU , is your branch up to date to your current work? I seem to get mAP zero while testing.

Plus got warning:

UserWarning: Detected call of `lr_scheduler.step()` before `optimizer.step()`. In PyTorch 1.1.0 and later, you should call them in the opposite order: `optimizer.step()` before `lr_scheduler.step()`.  Failure to do this will result in PyTorch skipping the first value of the learning rate schedule. See more details at https://pytorch.org/docs/stable/optim.html#how-to-adjust-learning-rate
  "https://pytorch.org/docs/stable/optim.html#how-to-adjust-learning-rate", UserWarning)

NanoCode012 on 7 Jul 2020

@MagicFrogSJTU , is your branch up to date to your current work? I seem to get mAP zero while testing.

Plus got warning:

UserWarning: Detected call of `lr_scheduler.step()` before `optimizer.step()`. In PyTorch 1.1.0 and later, you should call them in the opposite order: `optimizer.step()` before `lr_scheduler.step()`.  Failure to do this will result in PyTorch skipping the first value of the learning rate schedule. See more details at https://pytorch.org/docs/stable/optim.html#how-to-adjust-learning-rate
  "https://pytorch.org/docs/stable/optim.html#how-to-adjust-learning-rate", UserWarning)

Yes, up to date.
Can you paste more the logs?

MagicFrogSJTU on 7 Jul 2020

I reduced batch size and the warning is gone now. I was doing a quick test with sync batch for coco128 first. To make sure there weren't any code errors. I plan to remove the apex.parallel and use torch.nn.parallel instead. Will see how it goes.

NanoCode012 on 7 Jul 2020

Theoretically, DDP with batch size 64 and gpu 4 should have similar performance with single gpu with batch size 64 (4 accumulations).
If SyncBN is correctly added (and every other things are right), the eval [email protected] of first training epoch should be around 0.01.

Hello @MagicFrogSJTU , I am very curious about this. How accurate is this as a measurement? Can you tell me how your current one works right now?

| Branch | Model | GPU | Batch size (per GPU) | GPU memory(GB each) | First epoch | Second epoch | Sync Batch | Time to run 2 epoch (in h) | Last epoch @ mAP |
| --- | --- | --- | --- | --- | --- | --- | --- | --- | --- |
| Ultralytics default | 5s | 1 | 64 | 8-11 | 0.013 | 0.0536 | No | 0.698 | - |
| Ultralytics default | 5s | 2 | 256/2 | 19-25 | 0.00477 | 0.0414 | No | - | 73 @ 0.439 |
| Ultralytics default | 5m | 1 | 64 | 20 | 0.0203 | 0.0798 | No | 0.776 | - |
| Ultralytics default | 5l | 1 | 64 | 30 | 0.025 | 0.0963 | No | 1.088 | - |
| My ddp branch | 5s | 2 | 128 | 21 | 0.000625 | 0.0104 | No | - | 101 @ 0.493 |
| Magic (torch) post-merge | 5s | 1 | 64 | 12 | 0.014 | 0.0624 | Yes | 0.688 | - |
| Magic (torch) post-merge | 5s | 2 | 64/2 | 6 | 0.00362 | 0.0587 | Yes | 0.466 | - |
| Magic (torch) post-merge drop-last | 5s | 2 | 64/2 | 6 | 0.0124 0.0109 | 0.055 0.0673 | Yes | 0.45 - | - |
| Magic (torch) post-merge drop-last | 5m | 2 | 64/2 | 6 | 0.0193 | 0.0872 | Yes | 0.663 | - |
| Magic (torch) pre-merge | 5s | 4 | 64/4 | 4-6 | 0.00499 | 0.0437 | Yes | - | - |
| Magic (apex) pre-merge | 5s | 4 | 64/4 | 4-6 | 0.00531 | 0.0368 | Yes | - | - |

The reason I chose high batch size was to try running them at the highest batch size for speed. I am not sure if it affects performance since optimzer goes by batch size 64.

Edit: Updated table. “” to divide data for multiple runs.

NanoCode012 on 7 Jul 2020

👍1

@NanoCode012 Updated.

Mine is
| branch | model | gpu | totoal batch size | first epoch [email protected] | second epoch [email protected] |
| --- | --- | --- | --- | --- | --- |
| default | v5s |1 | 64 | 0.0122 | 0.0654 |
| MagicFrog (DDP) | v5s | 2 | 64 | 0.00979 | - |
| MagicFrog (DDP) +dropLastForTrain | v5s | 2 | 64 | 0.0105 | - |
| MagicFrog (DP) | v5s | 4 | 64 | 0.0129 | - |
| MagicFrog (DDP) | v5s | 4 | 64 | 0.00626 | 0.0402 |
| MagicFrog (DP) | v5m | 4 | 64 | 0.0206 | 0.113 |
| MagicFrog (DDP) | v5m | 4 | 64 | 0.00778 | 0.0598 |

MagicFrogSJTU on 7 Jul 2020

@NanoCode012
In my implementation, The first epoch is around 0.005. It should be 0.01 as default single-gpu.
I have checked the code multiple times, and found nothing more to fix.
This is frustrating.

MagicFrogSJTU on 7 Jul 2020

@MagicFrogSJTU , should we focus only on the first epoch? Should a larger epoch be a better benchmark?

Since you said that it should be the same as single gpu (0.01) for first epoch. I’ll split my third run’s gpu usage into two and test out different variations.

I do agree that’s it’s frustrating. When I checked documentation and other’s implementation, it’s just setting up process group, launch, .to(device), and wrapping in ddp .

NanoCode012 on 7 Jul 2020

Can you try for another model and see how they fare? Maybe a bigger model could be better for DDP?

Also, wouldn’t it also be proper to test for Single GPU in DDP against the default in different batch sizes?

NanoCode012 on 7 Jul 2020

@MagicFrogSJTU , should we focus only on the first epoch? Should a larger epoch be a better benchmark?

Since you said that it should be the same as single gpu (0.01) for first epoch. I’ll split my third run’s gpu usage into two and test out different variations.

Theoretically, should be the first epoch.
However, our target is reproduce performance. So as long as the final epoch can reproduce equal or higher performance, I think bigger epoch is okay.
You could let it continue training, and see what perf the final epoch generates.

Yeah. This should be easy. Now I am questioning if there are some special implementations in network or loss functions.

MagicFrogSJTU on 7 Jul 2020

Can you try for another model and see how they fare? Maybe a bigger model could be better for DDP?

Also, wouldn’t it also be proper to test for Single GPU in DDP against the default in different batch sizes?

I will tried v5m.

DDP in batchsize 64 and gpu 4 is like single-gpu in batchsize 16 and accumulation 4 (which you will get when runing batchsize 64 with default code). This is why I am testing DDP with batch size 64 and gpu 4.

MagicFrogSJTU on 7 Jul 2020

I will tried v5m.

Can you please add model column too? May be easier to see.

I am also curious why our implementation of Sync batch (Apex) gets different results. I think I ran your code without changes for apex.

NanoCode012 on 7 Jul 2020

I will tried v5m.

Can you please add model column too? May be easier to see.

I am also curious why our implementation of Sync batch (Apex) gets different results. I think I ran your code without changes for apex.

I have pushed my newest code. I am now only using torch.nn.distributed

MagicFrogSJTU on 7 Jul 2020

Cool. I am thinking of removing amp.scaled loss, did you do that?

NanoCode012 on 7 Jul 2020

Cool. I am thinking of removing amp.scaled loss, did you do that?

Why?
I have tried not using mixed_precision, but performance remains the same.

MagicFrogSJTU on 7 Jul 2020

Oh I see. I didn't test it, but since apex is going to be fleshed out, we should try without it. But since performance is the same, we can keep it for now.

NanoCode012 on 7 Jul 2020

Oh I see. I didn't test it, but since apex is going to be fleshed out, we should try without it. But since performance is the same, we can keep it for now.

If you are available, I suggest you take a look at the network structure. I suspect that there are something that can't be broadcast in DDP.
I am currently occupied in other things

MagicFrogSJTU on 7 Jul 2020

If you are available, I suggest you take a look at the network structure. I suspect that there are something that can't be broadcast in DDP.

I am not confident in my ability to do so, but I will see. If there were things that cannot be broadcasted, I believe it would be listed in the documentation for us.

NanoCode012 on 7 Jul 2020

@MagicFrogSJTU , I added drop_last to test and it actually reached 0.01 for first epoch but dropped for second. It could've been a fluke. But it shows that we shouldn't take only first epoch as the goal.

I'm setting a test to run for 5m for 2 GPU, and 1 more on default as a benchmark.

NanoCode012 on 7 Jul 2020

@MagicFrogSJTU @NanoCode012 hi guys, nice table!! Unfortunately the non-deterministic nature of training is showing up here making comparisons very difficult. I would say you should ignore epoch 1 mAP, it is a very noisy metric, even for the same model same everything it may be +/-50% from one training run to the next. Epoch 2 mAP is probably better, but still may vary up to +/-20% in my experience.

I'm not really sure if larger models produce more stable mAPs early on.

Yes I think bs64 should be used for everything here.

LR changes will dramatically affect mAP. The default repo does not modify the LR for different batch-sizes, instead it accumulates differently, always trying to reach batch-size 64. If you use --batch 8 for example it will accumulate the gradient 8 times before optimizer update. If you use --batch 64 or higher it will run optimizer update every batch.

glenn-jocher on 7 Jul 2020

@MagicFrogSJTU , I reran the drop_last for 5s, and it gave good results. Maybe there were duplicates in data for last batch causing accuracy loss.

~There’s a function to broadcast buffers.~
~https://github.com/facebookresearch/ClassyVision/commit/16a66a85f58dacf725e11b1a3643178b4616e48d~

@glenn-jocher , what should our benchmark be then? 2 epoch ? 3? 5? 10?

NanoCode012 on 8 Jul 2020

@MagicFrogSJTU , I reran the drop_last for 5s, and it gave good results. Maybe there were duplicates in data for last batch causing accuracy loss.

There’s a function to broadcast buffers.
facebookresearch/ClassyVision@16a66a8

@glenn-jocher , what should our benchmark be then? 2 epoch ? 3? 5? 10?

Do you mean drop_last for the dataloader class?
And drop_last for test dataloader, but not for train data loader?

What is the purpose of broadcast buffer and how?

As for benchmarching, I suggest using default LR and default batchsize(64), 2 epochs.

MagicFrogSJTU on 8 Jul 2020

Do you mean drop_last for the dataloader class?
And drop_last for test dataloader, but not for train data loader?

@MagicFrogSJTU Yes. I set drop_last = (train_sampler is not None) I sent a PR to your branch.

For broadcast buffers, it's just something I read, but is said to sync params across GPUs, but I'm not sure if it does the same as SyncBatch.

Edit: Sorry, when reading, I thought it was within DDP.py, but it was in another of Fb's repo.

I added benchmark for running 5m.

NanoCode012 on 8 Jul 2020

Do you mean drop_last for the dataloader class?
And drop_last for test dataloader, but not for train data loader?

@MagicFrogSJTU Yes. I set drop_last = (train_sampler is not None) I sent a PR to your branch.

For broadcast buffers, it's just something I read, but is said to sync params across GPUs, but I'm not sure if it does the same as SyncBatch.

Edit: Sorry, when reading, I thought it was within DDP.py, but it was in another of Fb's repo.

I added benchmark for running 5m.

I trained in your drop_last way. But get the same 0.006. I wonder if you have made other changes to the code?
Plus, by setting python drop_last = (train_sampler is not None) you are setting it to True during training but test.

MagicFrogSJTU on 8 Jul 2020

For that run, I changed two things.

drop_last
init process to use tcp instead. The reason was because I am running the 2-3 codes simultaneously , so had to change the port that they communicate (Could tcp be more accurate?)

~The reason I set drop_last, to my understanding, is if the data that is divided to the GPUs aren't divided evenly, it may cause optimzer to run when _each_ gpu process only (for ex 1 more image) for last batch. We don't need it to drop for testing because it's running on only one GPU, but we can try.~

Edit: I also want to test setting shuffle to train_sampler is None and setting the num_replicas and rank parameters belonging to DistributedSampler, but I don't have GPU available.

I trained in your drop_last way. But get the same 0.006. I wonder if you have made other changes to the code?

The result was for 5s on 2gpu. I haven't tested on 4 gpu yet because don't have available.

NanoCode012 on 8 Jul 2020

For that run, I changed two things.

drop_last

init process to use tcp instead. The reason was because I am running the 2-3 codes simultaneously , so had to change the port that they communicate (Could tcp be more accurate?)

~The reason I set drop_last, to my understanding, is if the data that is divided to the GPUs aren't divided evenly, it may cause optimzer to run when _each_ gpu process only (for ex 1 more image) for last batch. We don't need it to drop for testing because it's running on only one GPU, but we can try.~

Edit: I also want to test setting shuffle to train_sampler is None and setting the num_replicas and rank parameters belonging to DistributedSampler, but I don't have GPU available.

I trained in your drop_last way. But get the same 0.006. I wonder if you have made other changes to the code?

The result was for 5s on 2gpu. I haven't tested on 4 gpu yet because don't have available.

From my test experiments, the perf should be:
1GPU = DDP 1GPU > DDP 2GPU > DDP 4GPUS.
Drop_last may not affect the performance.
I have tested the code. Didn't gain better performance.

Plus, Use

python -m torch.distributed.launch  --master_port $RANDOM_PORT --nproc_per_node 4 train.py

to allow parallel trainings.

MagicFrogSJTU on 8 Jul 2020

python -m torch.distributed.launch --master_port $RANDOM_PORT --nproc_per_node 4 train.py

Thanks. I misunderstood this at first. You mean DP runs right? or do you mean two different training at same time

From my table, the 1 GPU DDP and 2 GPU DDP are almost equal with each other and with the Default repo's 1 GPU.

I have tested the code. Didn't gain better performance.

I would like to re-run without to see, but I want my current ones to reach 300 epochs for once.

NanoCode012 on 8 Jul 2020

python -m torch.distributed.launch --master_port $RANDOM_PORT --nproc_per_node 4 train.py

Thanks. I misunderstood this at first. You mean DP runs right? or do you mean two different training at same time

From my table, the 1 GPU DDP and 2 GPU DDP are almost equal with each other and with the Default repo's 1 GPU.

I have tested the code. Didn't gain better performance.

I would like to re-run without to see.

I mean DDP. You don't have to run in tcp mode. But anyway, as long as you can run in parallel now.
That's weird. You mean 1GPU = DDP 1GPU = DDP 2GPU > DDP 4GPUS ?

MagicFrogSJTU on 8 Jul 2020

From table above, values are quite close. DDP in 1 GPU = 2 GPU, but time reduced. (Look at first and third row) I do not have for DDP 4 GPUS yet.

NanoCode012 on 8 Jul 2020

From table above, values are quite close. DDP in 1 GPU = 2 GPU, but time reduced. (Look at first and third row) I do not have for DDP 4 GPUS yet.

I am taking a 2-gpu run to try reproducing your results.

MagicFrogSJTU on 8 Jul 2020

I cloned ur feature branch and changed the two things I noted above. That's all. I also checked with git status and git diff to be sure

NanoCode012 on 8 Jul 2020

I cloned ur feature branch and changed the two things I noted above. That's all. I also checked with git status and git diff to be sure

tcp is not needed. it is actually the same as the original env way.

MagicFrogSJTU on 8 Jul 2020

I cloned ur feature branch and changed the two things I noted above. That's all. I also checked with git status and git diff to be sure

I couldn't reproduce your results. See my table above.
The DDP 2gpu without droplast also get high score.

MagicFrogSJTU on 8 Jul 2020

The DDP 2gpu without droplast also get high score.

Interesting. I think I won't be able to reproduce my results for today as I am still training. (Want to see performance in long run). I think drop_last is also not bad. I see many places using it.

Edit: I attached the results.txt for those two below. 5s running 2 epoch for 2 GPUs on coco2017. Can you please check if something is off?

# no drop
       0/1     4.89G   0.09067   0.09974    0.1115    0.3019        61       640   0.02479  0.003097  0.003624 0.0009761     0.086   0.09304   0.08909
       1/1     5.32G   0.07448    0.1009    0.0743    0.2497        95       640    0.1698   0.06755   0.06125   0.02524   0.07035   0.08714   0.06414

# drop first time
       0/1     5.47G   0.09052    0.0998    0.1116     0.302       475       640   0.06563   0.01121   0.01243  0.003548   0.08014   0.09001   0.08033
       1/1     5.35G    0.0744    0.1009   0.07411    0.2494       306       640    0.1528   0.08374   0.05756   0.02394   0.06993   0.08806   0.06423
# drop second time
       0/299     5.47G   0.09078   0.09982    0.1124     0.303       475       640   0.04087  0.007801   0.01088  0.003195   0.08014   0.08866   0.08194
       1/299     5.35G   0.07564   0.09939   0.07276    0.2478       306       640    0.1644   0.08547   0.06728   0.02641   0.06972   0.08669   0.06268

Edit: I also want to test setting shuffle = train_sampler is None and setting the num_replicasand rankparameters belonging to DistributedSampler

Do you think this will make a difference?

The DDP 2gpu without droplast also get high score.

Does this mean the issue only occurs for 4 GPU?

NanoCode012 on 8 Jul 2020

drop_last can be saftely added.
It is weird that DDP works in 2GPU but 4GPU. I tend to believe that this is a fluctuation.
Let's see your long-run performance.
Use this:
single_gpu default code
| epoch | mAP |
| --- | --- |
|1 | 0.0115/0.00382 |
|5 | 0.245/0.0.126 |
|10 | 0.341/0.19 |
|50 | 0.458/0.275 |

MagicFrogSJTU on 8 Jul 2020

Result as of now:

Magic 4 GPU Pre-merge Total Batch Size = 64
| epoch | mAP @0.5 |
| --- | --- |
| 0 | 0.00559 |
| 5 | 0.194 |
|10 | 0.259 |
| 50 | 0.392 |
| 100 | 0.437 |
| 150 | 0.464 |
| 170 | 0.473 |

Magic 2 GPU Post-merge Drop_last Total Batch Size = 64
| epoch | mAP @0.5 |
| --- | --- |
| 0 | 0.0109 |
| 5 | 0.27 |
|10 | 0.336 |
| 50 | 0.454 |
| 100 | 0.488 |

The 2 GPU looks very similar to the 1 GPU from default code. Do you have for higher epochs too?

NanoCode012 on 8 Jul 2020

Result as of now:

Magic 4 GPU Pre-merge Total Batch Size = 64

epoch mAP @0.5
0 0.00559
5 0.194
10 0.259
50 0.392
100 0.437
150 0.464
Magic 2 GPU Post-merge Drop_last Total Batch Size = 64

epoch mAP @0.5
0 0.0109
5 0.27
10 0.336
50 0.454
100 0.488
The 2 GPU looks very similar to the 1 GPU from default code. Do you have for higher epochs too?

Sorry, No.
The 2GPU DDP looks good! I think epoch 50 is enough!
Please train a 4GPU DDP with droplast!

MagicFrogSJTU on 8 Jul 2020

Hmm, I think glenn can provide it for us. The chart on ReadMe doesn't give the numbers unfortunately. I will stop my current 4 GPU run because it doesn't seem good any more.

NanoCode012 on 8 Jul 2020

Hmm, I think glenn can provide it for us. The chart on ReadMe doesn't give the numbers unfortunately. I will stop my current 4 GPU run because it doesn't seem good any more.

Don't worry. Epoch 50 is long enough.

MagicFrogSJTU on 8 Jul 2020

@NanoCode012
Just to feed you:
I am currently working on https://github.com/pytorch/pytorch/issues/41101

MagicFrogSJTU on 8 Jul 2020

👍1

results_YOLOv5s.txt
results_YOLOv5m.txt

I've attached 5s and 5m results.txt for official weights.

50 epochs should be more than enough for comparison. Make sure to use python train.py --epochs 300 and CTRL+C at epoch 50 though rather than using python train.py --epochs 50. The second command will give you much better results at epoch 50, as the LR scheduler runs fully.

What is drop_last? We can't be dropping any batches from the training or testing (!).

glenn-jocher on 8 Jul 2020

Ok. I’ve reset it to run without it. Apologize for my incorrect assumption.

Edit: Add tables

Some git error caused the run to be paused, and I didn't notice till now.

| Model | Epoch 1 | Epoch 2 | Epoch 5 | Epoch 10 | Epoch 25 | Epoch 50 |
| --- | --- | --- | --- | --- | --- | --- |
| Default | 0.01011 | 0.05264 | 0.2201 | 0.3411 | 0.3907 | 0.4519 |
| Magic 5s 2 GPU | 0.0123 | 0.0696 | 0.239 | 0.334 | 0.397 | 0.455 |
| Magic 5s 4 GPU | 0.00703 | 0.0463 | 0.168 | 0.253 | 0.326 | 0.3922 |
| Magic 5s 4 GPU Torch 1.6 | 0.0040.00613 | 0.03870.0428 | -0.165 | -0.252 | - | - |
| Magic 5s 8 GPU Torch 1.6 | 0.003761 | 0.02131 | 0.08683 | 0.1417 | 0.2052 | - |

We see that drop_last was not the reason for 2 GPU's accuracy. It was just like that. However, what's confusing is that if it worked for 2, why didn't it work for 4?

A warning I got on first epoch.

0/299     3.64G   0.09153       0.1    0.1131    0.3047        51       640 
python3.7/site-packages/torch/optim/lr_scheduler.py:114: UserWarning: Seems like `optimizer.step()` has been overridden after learning rate scheduler initialization. Please, make sure to call `optimizer.step()` before `lr_scheduler.step()`. See more details at https://pytorch.org/docs/stable/optim.html#how-to-adjust-learning-rate

NanoCode012 on 9 Jul 2020

Ok. I’ve reset it to run without it. Apologize for my incorrect assumption.

Edit: Add tables

Some git error caused the run to be paused, and I didn't notice till now.

Model Epoch 1 Epoch 2 Epoch 5 Epoch 10 Epoch 25 Epoch 50
Default 0.01011 0.05264 0.2201 0.3411 0.3907 0.4519
Magic 5s 2 GPU 0.0123 0.0696 - - - -
Magic 5s 4 GPU 0.00703 0.0463 - - - -
We see that drop_last was not the reason for 2 GPU's accuracy. It was just like that. However, what's confusing is that if it worked for 2, why didn't it work for 4?

A warning I got on first epoch.
0/299     3.64G   0.09153       0.1    0.1131    0.3047        51       640 
python3.7/site-packages/torch/optim/lr_scheduler.py:114: UserWarning: Seems like `optimizer.step()` has been overridden after learning rate scheduler initialization. Please, make sure to call `optimizer.step()` before `lr_scheduler.step()`. See more details at https://pytorch.org/docs/stable/optim.html#how-to-adjust-learning-rate

I fix this in the latest branch.
It is weird why it didn't work!
My assumption is that there is an error growing in exponential order to the gpu num.

MagicFrogSJTU on 9 Jul 2020

👍1

I fix this in the latest branch.

Thanks! I saw it.

My assumption is that there is an error growing in exponential order to the gpu num.

But why does 2 GPU DDP slightly outperform 1 GPU default if there is an error?
Edit: 2 GPU DDP is starting to even out with 1 GPU default now.

NanoCode012 on 9 Jul 2020

I fix this in the latest branch.

Thanks! I saw it.

My assumption is that there is an error growing in exponential order to the gpu num.

But why does 2 GPU DDP slightly outperform 1 GPU default if there is an error?
Edit: 2 GPU DDP is starting to even out with 1 GPU default now.

I have a question. What is your running environment? In particular, your pytorch verison? r u using pytorch1.60?

MagicFrogSJTU on 9 Jul 2020

In particular, your pytorch verison?

I'm following the ReadMe which says min version is 1.5, mine is 1.5.1 running in conda env python 3.7

Waiting on epoch 50 for both runs.

NanoCode012 on 9 Jul 2020

In particular, your pytorch verison?

I'm following the ReadMe which says min version is 1.5, mine is 1.5.1 running in conda env python 3.7

Waiting on epoch 50 for both runs.

Can you run on pytorch1.6? For example, build a docker from pytroch:20.06.
There is a possibilty that 4GPU DDP will work with pytorch1.6.
I can't do that because I can't change the nvidia driver on my machine. pytorch:20.06 need latest nvidia driver

MagicFrogSJTU on 9 Jul 2020

Can you run on pytorch1.6? For example, build a docker from pytroch:20.06.

Hi, I have never used docker before, so may I ask a few questions.
1) I do not see a version for pytorch 1.6 on dockerhub. Do you mean to build it myself from source?
2) pytorch nightly build via conda is version 1.7

NanoCode012 on 9 Jul 2020

Can you run on pytorch1.6? For example, build a docker from pytroch:20.06.

Hi, I have never used docker before, so may I ask a few questions.

I do not see a version for pytorch 1.6 on dockerhub. Do you mean to build it myself from source?

pytorch nightly build via conda is version 1.7

Use dockerfile in the yolov5 repos. You can see that it is built from nvcr.io/nvidia/pytorch:20.06-py3

MagicFrogSJTU on 9 Jul 2020

@MagicFrogSJTU , hello, I built docker and ran under compatibility mode for nvidia drivers. I updated results above. I'm gonna re-run again to be sure. (Checked that it's pytorch 1.6.0a0+9907a3e)

NanoCode012 on 9 Jul 2020

@MagicFrogSJTU , hello, I built docker and ran under compatibility mode for nvidia drivers. I updated results above. I'm gonna re-run again to be sure. (Checked that it's pytorch 1.6.0a0+9907a3e)

It is getting worse. Damn!

MagicFrogSJTU on 9 Jul 2020

Could be the randomness. The second time was much better. Gonna get it to epoch 5 to see.

NanoCode012 on 9 Jul 2020

Could be the randomness. The second time was much better. Gonna get it to epoch 5 to see.

I don't have a clue now. Emm..What do you think？

MagicFrogSJTU on 9 Jul 2020

I have no clue. Could be that 2 GPU is the limit?

NanoCode012 on 9 Jul 2020

I have no clue. Could be that 2 GPU is the limit?

Weeks ago, I have just trained BERT model with 8-GPU DDP. Although I didn't train 1-gpu model and verify its correctness, I don't think there is a limit of 2GPU. It's just too silly.
I have just trained a model with nightly built pytorch (1.7). Got 0.00713 for epoch 1. Damn.

MagicFrogSJTU on 9 Jul 2020

Yep, that's why I say that I have no clue. If the 2 GPU one was also performing poorly, then we can say that something went wrong, but it's not, which is causing this issue. Hm, how about 8 GPU? Can you try? How would it perform?

Edit: I added my test for 8.

NanoCode012 on 9 Jul 2020

Yep, that's why I say that I have no clue. If the 2 GPU one was also performing poorly, then we can say that something went wrong, but it's not, which is causing this issue. Hm, how about 8 GPU? Can you try? How would it perform?

Edit: I added my test for 8.

I suggest we train a whole 300 epochs for DDP 4gpus.
If the network converge to same accuracy at last, let's end this problem and leave for future. This is too tiring.
If not, we just leave it for Glenn for decision.
What do you think

MagicFrogSJTU on 9 Jul 2020

I suggest we train a whole 300 epochs for DDP 4gpus.
If the network converge to same accuracy at last, let's end this problem and leave for future. This is too tiring.
If not, we just leave it for Glenn for decision.
What do you think

Sure. I updated the table for 8. It seems that more GPU decreases accuracy at the start. We should see how long it takes for them to converge, (if they do, that is)

NanoCode012 on 10 Jul 2020

👍1

@MagicFrogSJTU , hello, my test for 4 GPU is done. It took 41.485h for 300 epoch, which doesn't seem right, and the results did not converge at the end. See graph below for comparison between official and it.

How did yours go?

NanoCode012 on 12 Jul 2020

👍1

@NanoCode012 the curves look very similar, no obvious problems stick out. Was this with batch-norm sync enabled?

41 hours sounds a bit high for 4 gpu training, depending on the gpus of course. yolov5s typically takes about 3 days on a V100 or 6 days on a T4. But more worrisome is the mAP drap :(

EDIT: if a T4 takes 6 days by itself, and you have less than 2 days there, then yes, it might be reasonable for 4 T4's to take less than 2 days. The latest multi-gpu benchmarking I did was a while ago on our old yolov3 repo, though a lot of the code will be similar to what was used there:
https://github.com/ultralytics/yolov3#speed

EDIT2: don't worry about the objectness loss differences, the latest models (blue) train with a slightly lower hype['obj'] value there (0.6-0.7) typically vs the 1.0 in master, which helps slightly in lessening overtraining towards the end, but this produces very small mAP changes throughout the entire training (<0.005) , and is definitely not the source of the difference.

glenn-jocher on 12 Jul 2020

@NanoCode012 the curves look very similar, no obvious problems stick out. Was this with batch-norm sync enabled?

Yes, it is.
Order of init: Model>SyncBatch>Ema>DDP (If this is wrong, it might be cause. It was originally Ema>SyncBatch.)

41 hours sounds a bit high for 4 gpu training, depending on the gpus of course. yolov5s typically takes about 3 days on a V100 or 6 days on a T4. But more worrisome is the mAP drap :(

EDIT: if a T4 takes 6 days by itself, and you have less than 2 days there, then yes, it might be reasonable for 4 T4's to take less than 2 days. The latest multi-gpu benchmarking I did was a while ago on our old yolov3 repo, though a lot of the code will be similar to what was used there:
https://github.com/ultralytics/yolov3#speed

It was run on 4 V100s. That is the part that confused me. I ran without notest and nosave. ~I could not recall precisely the time per epoch. It was about 12-17 minutes for training and about 7-10 minutes for testing.~ Edit: Table below.

I am thinking whether I will re-run for 2 GPU since it produced good results for <50 epoch runs.

NanoCode012 on 13 Jul 2020

Yes, it is.
Order of init: Model>SyncBatch>Ema>DDP (If this is wrong, it might be cause. It was originally Ema>SyncBatch.)

I have tried training with bothModel>SyncBatch>Ema>DDP and Model>Ema>SyncBatch>DDP , and got similar results.
By the way, I have even tried training without Ema, and got similar results. @NanoCode012 You may want to take a try, because there is chance that I did it incorrectlly.

EDIT2: don't worry about the objectness loss differences, the latest models (blue) train with a slightly lower hype['obj'] value there (0.6-0.7) typically vs the 1.0 in master, which helps slightly in lessening overtraining towards the end, but this produces very small mAP changes throughout the entire training (<0.005) , and is definitely not the source of the difference.

DDP is now working on 2-GPUs but 4-GPUs. 2-GPUs generated similar mAP, while 4-GPUs got worse.
I and @NanoCode012 have done many tests to try finding the source of difference but failed.
This seems strange and I don't have a clue now.

MagicFrogSJTU on 13 Jul 2020

@NanoCode012 the curves look very similar, no obvious problems stick out. Was this with batch-norm sync enabled?

Yes, it is.
Order of init: Model>SyncBatch>Ema>DDP (If this is wrong, it might be cause. It was originally Ema>SyncBatch.)

41 hours sounds a bit high for 4 gpu training, depending on the gpus of course. yolov5s typically takes about 3 days on a V100 or 6 days on a T4. But more worrisome is the mAP drap :(
EDIT: if a T4 takes 6 days by itself, and you have less than 2 days there, then yes, it might be reasonable for 4 T4's to take less than 2 days. The latest multi-gpu benchmarking I did was a while ago on our old yolov3 repo, though a lot of the code will be similar to what was used there:
https://github.com/ultralytics/yolov3#speed

It was run on 4 V100s. That is the part that confused me. I ran without notest and nosave. I could not recall precisely the time per epoch. It was about 12-17 minutes for training and about 7-10 minutes for testing.

I am thinking whether I will re-run for 2 GPU since it produced good results for <50 epoch runs.

Oh boy, yes if this was with 4 V100's then there is definitely a problem somewhere. I don't know what yolov5s time should be on single V100 since I use T4 to train 5s (V100 for m, l, x), but I know testing time alone should only be about 1 minute total per epoch (certainly no more than 2 minutes). I'll post a screenshot here, this is a current GCP VM with one V100 training yolov5m.yaml, all default settings (I'm retraining all models with a few tweaks this week).
Screen Shot 2020-07-12 at 8 10 11 PM

glenn-jocher on 13 Jul 2020

👀1

Hello, I just did a quick run to check speed. I was way off mark. I think I may have remembered a time while testing another model.
Now, I also set OMP_NUM_THREADS=1 as recommended by,

Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.

Using Magic feature branch,
| Num GPU | Train min per epoch | Train iter per second | Test min per epoch | first epoch mAP |
| ---| --- | --- | --- | -- |
| 1 | 9-11 | 2.8-3 | 1:08 | 0.0125 |
| 2 | 8-9 | 3.4-3.6 | 1:14 | 0.00988 |
| 4 | 6-7 | 4.3-4.5 | 1:12 | - |

@MagicFrogSJTU , since DataParallel worked quite well for 1-4 GPU, why don't we use that instead of DDP?

NanoCode012 on 13 Jul 2020

@NanoCode012 oh, those are much faster. Ok that makes more sense. So the problem is not the speed, the problem lies in reproducing the mAP.

EDIT: Also the speed multiple is not as high as it could be since I assume you are keeping batch-size fixed. In practice you'd probably want to increase your batch size linearly with your gpu count to take advantage of your extra gpu ram.

glenn-jocher on 13 Jul 2020

Hello, I just did a quick run to check speed. I was way off mark. I think I may have remembered a time while testing another model.
Now, I also set OMP_NUM_THREADS=1 as recommended by,
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
Using Magic feature branch,

Num GPU Train min per epoch Train iter per second Test min per epoch first epoch mAP
1 9-11 2.8-3 1:08 0.0125
2 8-9 3.4-3.6 1:14 0.00988
4 6-7 4.3-4.5 1:12 -
@MagicFrogSJTU , since DataParallel worked quite well for 1-4 GPU, why don't we use that instead of DDP?

Because theoritically, DDP is much faster.

MagicFrogSJTU on 13 Jul 2020

Well, right now, we have no idea where the issue lies.

Why don’t we cleanup the code for DP then PR into the main branch? This will be an improvement for the current repo. Then, maybe we can close this Issue till someone else finds out the reason why DDP fails and fixes it with our Issues as a guide.

NanoCode012 on 13 Jul 2020

@NanoCode012 @MagicFrogSJTU I started new 1x and 2x T4 GPU training runs for yolov5s using the current default DDP code (no syncbatchnorm). Blue is 1x, orange is 2x. The epoch times are 29m single gpu, 19min double gpu, both use same exact command (--batch 64). The current difference between the two is only 0.001 mAP around epoch 40.

results

glenn-jocher on 13 Jul 2020

👍2

@NanoCode012 @MagicFrogSJTU I started new 1x and 2x T4 GPU training runs for yolov5s using the current default DDP code (no syncbatchnorm). Blue is 1x, orange is 2x. The epoch times are 29m single gpu, 19min double gpu, both use same exact command (--batch 64). The current difference between the two is only 0.001 mAP around epoch 40.

@glenn-jocher Cool, nice!
Should we clean up the code and make a pull request as @NanoCode012 said?
We may leave the 4-GPU problems to the future, since it seems quite difficult to resolve in the near future.

MagicFrogSJTU on 14 Jul 2020

Well, right now, we have no idea where the issue lies.

Why don’t we cleanup the code for DP then PR into the main branch? This will be an improvement for the current repo. Then, maybe we can close this Issue till someone else finds out the reason why DDP fails and fixes it with our Issues as a guide.

What is "cleanup the code for DP"? Is it a typo for "DDP"?

MagicFrogSJTU on 14 Jul 2020

@MagicFrogSJTU yes if you can cleanup the code and consolidate the changes into a PR that would be good. Make sure you test the updated multi-gpu code against the current multi-gpu code to compare to the current baseline. I think 30 epochs out of 300 is probably enough (as in my example above). I cancelled this training after making the plot because its obvious it's very close. I will try 4x T4 if I can.

glenn-jocher on 14 Jul 2020

👍1

What is "cleanup the code for DP"? Is it a typo for "DDP"?

@MagicFrogSJTU , From your past results of DP https://github.com/ultralytics/yolov5/issues/264#issuecomment-654809508 , we see that the accuracy is similar to the main branch, and you said that it was faster as well. Since it is stable for 1-4 GPU from your results, I feel it is better to use it.

For your DDP, it is quite experimental right now (only 1-2 GPU), so I am not sure it is appropriate to add it in as some people might be confused when using >= 4 GPU. Of course, this is all up to glenn.

I set two runs right now. One for DDP on main repo, another is setting main repo to use DP, I wanted to see if there are any benefits in accuracy and time for 2 GPU. Right now, they perform similarly.

| Type | Epoch 1 | Time per epoch |
| --- | --- | --- |
| DP | 0.011 | 11:50 |
| DDP | 0.0124 | 11:55 |

I just removed init_process and changed torch.nn.parallel.DDP to torch.nn.DP for setting DP in main repo.
Please tell me what you've decided, and I can help add changes to your code/running it.

NanoCode012 on 14 Jul 2020

DP is already set up in the original code
No change needed actually...

Ah, I see. I was just wondering because since DP/DDP are implemented differently, I wanted to test if there were noticeable difference for DP and Single process DDP.

Or was I confused on what you mean? Were you calling Single process DDP as DP? I was under the assumption they were different.

Sorry, my bad.
The original code doesn't implement DP.
DP is activated by model = torch.nn.DataParallel(model)
while DDP by model = DDP(model, device_ids=[local_rank], output_device=local_rank)
In fact, single process DDP with multiple gpus (if network are not explicitly allocated on difference gpus) is implemented as DP internally in pytorch.

MagicFrogSJTU on 14 Jul 2020

@NanoCode012
I want to rebase the commits to merge into one, because there are too many of it.
Let's make two commits, each of us get one! So that all of us can have the honor of contributions! What do you think?

MagicFrogSJTU on 14 Jul 2020

In fact, single process DDP with multiple gpus (if network are not explicitly allocated on difference gpus) is implemented as DP internally in pytorch.

Thanks for clarification.

I want to rebase the commits to merge into one, because there are too many of it.
Let's make two commits, each of us get one! So that all of us can have the honor of contributions! What do you think?

Sure! That'll be great. Can we first update Feature/DDP-fixed to the latest repo version, so we compare from the same level before we rebase? There are a lot of changes in the main repo since we last branched off. Then I'll run it to benchmark.

NanoCode012 on 14 Jul 2020

In fact, single process DDP with multiple gpus (if network are not explicitly allocated on difference gpus) is implemented as DP internally in pytorch.

Thanks for clarification.

I want to rebase the commits to merge into one, because there are too many of it.
Let's make two commits, each of us get one! So that all of us can have the honor of contributions! What do you think?

Sure! That'll be great. Can we first update Feature/DDP-fixed to the latest repo version, so we compare from the same level before we rebase? There are a lot of changes in the main repo since we last branched off. Then I'll run it to benchmark.

Yes. My plan is

Merge the master and clean up, test.
Make a pull request.
Improve the code until Glenn passes it.
Test
Rebase
Test
Final merge.

I will do the merge master!

MagicFrogSJTU on 14 Jul 2020

👍1

Hello glenn, do you have an updated script for unit test? The one that you gave before does not work with weights/last.pt since it was moved to runs directory.

NanoCode012 on 14 Jul 2020

@NanoCode012 ah yes, you are correct! This is the latest. The typical use case is you drop this into a single command, i.e. put this in unit_tests.sh and then bash unit_tests.sh, and then an exit code of 0 is passing.

I'm trying to automate all of this as part of the CI pipeline, using github actions for example, but for now its a bit of a manual nightmare, we just run this in colab as often as possible to make sure recent commits haven't broken anything on single-gpu or cpu.

git clone https://github.com/ultralytics/yolov5
cd yolov5
pip install -qr requirements.txt onnx
python3 -c "from utils.google_utils import *; gdrive_download('1n_oKgR81BJtqk75b00eAjdv03qVCQn2f', 'coco128.zip')" && mv ./coco128 ../

export PYTHONPATH="$PWD" # to run *.py. files in subdirectories
for x in yolov5s #yolov5m yolov5l yolov5x # models
do
  python train.py --weights $x.pt --cfg $x.yaml --epochs 3 --img 320 --device 0,1  # train
  for di in 0,1 0 cpu # inference devices
  do
    python detect.py --weights $x.pt --device $di  # detect official
    python detect.py --weights runs/exp0/weights/last.pt --device $di  # detect custom
    python test.py --weights $x.pt --device $di # test official
    python test.py --weights runs/exp0/weights/last.pt --device $di # test custom
  done
  python models/yolo.py --cfg $x.yaml # inspect
  python models/export.py --weights $x.pt --img 640 --batch 1 # export
done

glenn-jocher on 14 Jul 2020

👍1

While running the Default code for 4 GPU, I am quite confused why only the first GPU is using a large amount of memory.

Furthermore, the speed of 4 GPU vs 1 GPU is kind of similar. I set both to run at the same time, and they are in the same epoch (34). It could be my CPU bottleneck though.

NanoCode012 on 15 Jul 2020

While running the Default code for 4 GPU, I am quite confused why only the first GPU is using a large amount of memory.

Furthermore, the speed of 4 GPU vs 1 GPU is kind of similar. I set both to run at the same time, and they are in the same epoch (34). It could be my CPU bottleneck though.

The 4-gpu DP should be faster. GPU0 using more memory in DP is normal

MagicFrogSJTU on 15 Jul 2020

Fixed in #401

NanoCode012 on 20 Jul 2020

Yolov5: Changing to Multi-process DistributedDataParallel

Most helpful comment

All 104 comments

How to run

Related issues