Yolov5: Bug: MultiGPU model saving size is 4X single GPU size

Created on 1 Jul 2020  ยท  24Comments  ยท  Source: ultralytics/yolov5

@glenn-jocher
YOLOV5S on the official website is 7.5MB๏ผŒWhy the model I trained is 58.4M

bug

Most helpful comment

Hi all. I've met the same problem and solved by myself.
Actually @jh-park-mindslab commented resoanably, and #124 has the answer, however it also does not explain the issue completely.
During training a ckpt is saved with the name of "last.pt" or "best.pt" as fp32 model + fp32 optimizer, so its size is 56M.
At the last epoch the last checkpoint is saved with the name of "last.pt" as fp32 model without fp32 optimizer, so its size is 28M.

        if save:
            with open(results_file, 'r') as f:  # create checkpoint
                ckpt = {'epoch': epoch,
                        'best_fitness': best_fitness,
                        'training_results': f.read(),
                        'model': ema.ema.module if hasattr(model, 'module') else ema.ema,
                        'optimizer': None if final_epoch else optimizer.state_dict()}

            # Save last, best and delete
            torch.save(ckpt, last)
            if (best_fitness == fi) and not final_epoch:
                torch.save(ckpt, best)
            del ckpt

After the end of training these "best.pt" of which size is 56M and "last.pt" of which size is 28M are renamed as "best_[name].pt" and "last_[name].pt" before they are reduced to fp16 models which means their sizes are 14M respectively.
BUT HERE IS THE POINT. It is only if you set the parameter --name in your command line. If not, "best.pt"(56M) and "last.pt"(28M) remain intact, no 14M models of the above comment. For example,

train.py --name meter --img-size 640 --batch 16 --data ./meter.yaml --cfg ./models/yolov5s_meter.yaml --epochs 2

Below is the code in train.py which explains the issue:

    n = opt.name
    if len(n):
        n = '_' + n if not n.isnumeric() else n
        fresults, flast, fbest = 'results%s.txt' % n, wdir + 'last%s.pt' % n, wdir + 'best%s.pt' % n
        for f1, f2 in zip([wdir + 'last.pt', wdir + 'best.pt', 'results.txt'], [flast, fbest, fresults]):
            if os.path.exists(f1):
                os.rename(f1, f2)  # rename
                ispt = f2.endswith('.pt')  # is *.pt
                strip_optimizer(f2) if ispt else None  # strip optimizer
                os.system('gsutil cp %s gs://%s/weights' % (f2, opt.bucket)) if opt.bucket and ispt else None  # upload

All 24 comments

Here is the same problem.
I trained with the YOLOv5s model, but the weight file is 56 MB in size.

the default saved model contains optimizer params also whole module define, if only weights it should smaller.

How did you do the epochs option when learning?
Strip_optimizer() is called after all learning is done.
The weight file is large before training is complete, but the weight file is reduced when training is complete.

When I finished learning, the capacity was halved. 56M -> 28M
However, the weight file is still larger than 7 MB.

me too

@glenn-jocher
YOLOV5S on the official website is 7.5MB๏ผŒWhy the model I trained is 58.4M

You are incorrect. The model contains 7.5M parameters.

Current file sizes (may change in the future) are here:
Screen Shot 2020-07-01 at 12 55 54 PM

Hi all. I've met the same problem and solved by myself.
Actually @jh-park-mindslab commented resoanably, and #124 has the answer, however it also does not explain the issue completely.
During training a ckpt is saved with the name of "last.pt" or "best.pt" as fp32 model + fp32 optimizer, so its size is 56M.
At the last epoch the last checkpoint is saved with the name of "last.pt" as fp32 model without fp32 optimizer, so its size is 28M.

        if save:
            with open(results_file, 'r') as f:  # create checkpoint
                ckpt = {'epoch': epoch,
                        'best_fitness': best_fitness,
                        'training_results': f.read(),
                        'model': ema.ema.module if hasattr(model, 'module') else ema.ema,
                        'optimizer': None if final_epoch else optimizer.state_dict()}

            # Save last, best and delete
            torch.save(ckpt, last)
            if (best_fitness == fi) and not final_epoch:
                torch.save(ckpt, best)
            del ckpt

After the end of training these "best.pt" of which size is 56M and "last.pt" of which size is 28M are renamed as "best_[name].pt" and "last_[name].pt" before they are reduced to fp16 models which means their sizes are 14M respectively.
BUT HERE IS THE POINT. It is only if you set the parameter --name in your command line. If not, "best.pt"(56M) and "last.pt"(28M) remain intact, no 14M models of the above comment. For example,

train.py --name meter --img-size 640 --batch 16 --data ./meter.yaml --cfg ./models/yolov5s_meter.yaml --epochs 2

Below is the code in train.py which explains the issue:

    n = opt.name
    if len(n):
        n = '_' + n if not n.isnumeric() else n
        fresults, flast, fbest = 'results%s.txt' % n, wdir + 'last%s.pt' % n, wdir + 'best%s.pt' % n
        for f1, f2 in zip([wdir + 'last.pt', wdir + 'best.pt', 'results.txt'], [flast, fbest, fresults]):
            if os.path.exists(f1):
                os.rename(f1, f2)  # rename
                ispt = f2.endswith('.pt')  # is *.pt
                strip_optimizer(f2) if ispt else None  # strip optimizer
                os.system('gsutil cp %s gs://%s/weights' % (f2, opt.bucket)) if opt.bucket and ispt else None  # upload

@rcg12387 ah, good job! Yes I think you are right, we had not noticed this since we always --name our training runs. I will try to push a fix for this so that last.pt and best.pt are both stripped of optimizer and quantized to FP16 before saving at the end of the run.

This fix will only change the final checkpoint size at the very end of training. It's important to note that checkpoints saved during training will always be 4X larger (FP32 model, FP32 optimizer) than the official saved checkpoints (FP16 model only).

Ok, fix has been pushed. Final checkpoints will now be saved as FP16 sizes when training is complete, regardless of--nameusage. Checkpoints saved during training will be 4X the final size, as they carry FP32 models and FP32 optimizers.

I trained with the YOLOv5x model using Multi-GPU , but the weight file is 886 MB in size.
I trained with the YOLOv5x model using Single GPU , the weight file is 177 MB in size.

@ZJU-lishuang yes, I can reproduce your same numbers. That's no good. Something is definitely not correct with multigpu saving.

@ZJU-lishuang I can't find the source of the problem at the moment. If you manage to debug this or find the cause, please let us know!

@ZJU-lishuang @rcg12387 @LVROBOT this issue was traced down to the EMA updating extraneous attributes during the EMA.update_attr() call. This should be resolved now in https://github.com/ultralytics/yolov5/commit/a586751904e0f439d40b1a98ad1cbef5fa856761

Please git pull or clone a new copy of the repo and try again. Thank you!

Fix verified with 2x T4 VM below. Note filesize is now displayed after training completes as an added feature.

$ python train.py --epochs 3 
Apex recommended for faster mixed precision training: https://github.com/NVIDIA/apex
Namespace(batch_size=16, bucket='', cache_images=False, cfg='models/yolov5s.yaml', data='data/coco128.yaml', device='', epochs=3, evolve=False, hyp='', img_size=[640, 640], multi_scale=False, name='', noautoanchor=False, nosave=False, notest=False, rect=False, resume=False, single_cls=False, weights='')
Using CUDA device0 _CudaDeviceProperties(name='Tesla T4', total_memory=15079MB)
           device1 _CudaDeviceProperties(name='Tesla T4', total_memory=15079MB)

Start Tensorboard with "tensorboard --logdir=runs", view at http://localhost:6006/
Hyperparameters {'optimizer': 'SGD', 'lr0': 0.01, 'momentum': 0.937, 'weight_decay': 0.0005, 'giou': 0.05, 'cls': 0.58, 'cls_pw': 1.0, 'obj': 1.0, 'obj_pw': 1.0, 'iou_t': 0.2, 'anchor_t': 4.0, 'fl_gamma': 0.0, 'hsv_h': 0.014, 'hsv_s': 0.68, 'hsv_v': 0.36, 'degrees': 0.0, 'translate': 0.0, 'scale': 0.5, 'shear': 0.0}

                 from  n    params  module                                  arguments                     
  0                -1  1      3520  models.common.Focus                     [3, 32, 3]                    
  1                -1  1     18560  models.common.Conv                      [32, 64, 3, 2]                
  2                -1  1     19904  models.common.BottleneckCSP             [64, 64, 1]                   
  3                -1  1     73984  models.common.Conv                      [64, 128, 3, 2]               
  4                -1  1    161152  models.common.BottleneckCSP             [128, 128, 3]                 
  5                -1  1    295424  models.common.Conv                      [128, 256, 3, 2]              
  6                -1  1    641792  models.common.BottleneckCSP             [256, 256, 3]                 
  7                -1  1   1180672  models.common.Conv                      [256, 512, 3, 2]              
  8                -1  1    656896  models.common.SPP                       [512, 512, [5, 9, 13]]        
  9                -1  1   1248768  models.common.BottleneckCSP             [512, 512, 1, False]          
 10                -1  1    131584  models.common.Conv                      [512, 256, 1, 1]              
 11                -1  1         0  torch.nn.modules.upsampling.Upsample    [None, 2, 'nearest']          
 12           [-1, 6]  1         0  models.common.Concat                    [1]                           
 13                -1  1    378624  models.common.BottleneckCSP             [512, 256, 1, False]          
 14                -1  1     33024  models.common.Conv                      [256, 128, 1, 1]              
 15                -1  1         0  torch.nn.modules.upsampling.Upsample    [None, 2, 'nearest']          
 16           [-1, 4]  1         0  models.common.Concat                    [1]                           
 17                -1  1     95104  models.common.BottleneckCSP             [256, 128, 1, False]          
 18                -1  1     32895  torch.nn.modules.conv.Conv2d            [128, 255, 1, 1]              
 19                -2  1    147712  models.common.Conv                      [128, 128, 3, 2]              
 20          [-1, 14]  1         0  models.common.Concat                    [1]                           
 21                -1  1    313088  models.common.BottleneckCSP             [256, 256, 1, False]          
 22                -1  1     65535  torch.nn.modules.conv.Conv2d            [256, 255, 1, 1]              
 23                -2  1    590336  models.common.Conv                      [256, 256, 3, 2]              
 24          [-1, 10]  1         0  models.common.Concat                    [1]                           
 25                -1  1   1248768  models.common.BottleneckCSP             [512, 512, 1, False]          
 26                -1  1    130815  torch.nn.modules.conv.Conv2d            [512, 255, 1, 1]              
 27      [-1, 22, 18]  1         0  models.yolo.Detect                      [80, [[116, 90, 156, 198, 373, 326], [30, 61, 62, 45, 59, 119], [10, 13, 16, 30, 33, 23]]]
Model Summary: 191 layers, 7.46816e+06 parameters, 7.46816e+06 gradients

Optimizer groups: 62 .bias, 70 conv.weight, 59 other
Scanning labels ../coco128/labels/train2017.cache (126 found, 0 missing, 2 empty, 0 duplicate, for 128 images): 100%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ| 128/128 [00:00<00:00, 14557.24it/s]
Scanning labels ../coco128/labels/train2017.cache (126 found, 0 missing, 2 empty, 0 duplicate, for 128 images): 100%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ| 128/128 [00:00<00:00, 15628.52it/s]

Analyzing anchors... Best Possible Recall (BPR) = 0.9946
Image sizes 640 train, 640 test
Using 8 dataloader workers
Starting training for 3 epochs...

     Epoch   gpu_mem      GIoU       obj       cls     total   targets  img_size
       0/2     2.92G    0.1183    0.1104    0.2131    0.4418       169       640: 100%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ| 8/8 [00:10<00:00,  1.36s/it]
               Class      Images     Targets           P           R      [email protected]  [email protected]:.95: 100%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ| 8/8 [00:07<00:00,  1.01it/s]
                 all         128         929           0           0           0           0

     Epoch   gpu_mem      GIoU       obj       cls     total   targets  img_size
       1/2     2.91G    0.1149    0.1121    0.2125    0.4394       187       640: 100%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ| 8/8 [00:03<00:00,  2.51it/s]
               Class      Images     Targets           P           R      [email protected]  [email protected]:.95: 100%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ| 8/8 [00:01<00:00,  5.74it/s]
                 all         128         929           0           0           0           0

     Epoch   gpu_mem      GIoU       obj       cls     total   targets  img_size
       2/2     3.94G    0.1124    0.1184    0.2118    0.4426       187       640: 100%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ| 8/8 [00:03<00:00,  2.54it/s]
               Class      Images     Targets           P           R      [email protected]  [email protected]:.95: 100%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ| 8/8 [00:01<00:00,  5.46it/s]
                 all         128         929           0           0           0           0
Optimizer stripped from runs/exp7/weights/last.pt, 15.2MB
Optimizer stripped from runs/exp7/weights/best.pt, 15.2MB
3 epochs completed in 0.009 hours.

@glenn-jocher Hi~~~
I found out a workaroud :

in utils/torch_utils: line 220


    def update_attr(self, model):
        # Update EMA attributes
        for k, v in model.module.__dict__.items() if is_parallel(model) else model.__dict__.items():

            self.ema.names= model.names
            self.ema.hyp=model.hyp
            self.ema.gr=model.gr
            self.ema.nc=model.nc
            if not k.startswith('_') and k not in ["process_group", "reducer"]:
                setattr(self.ema, k, v)


in test.py: line 67:

names = model.names #if hasattr(model, 'names') else model.module.names
Analyse:
ema.update_attr() set all models' parameter in ema.ema which is supposed to hold only one model's parameters.
when training, it calls ema.update(), which is fine as ema.ema was initialized with one model, and update with one model.
And later when calling test and save , after ema.update_attr() called when each epoch ends , ema.ema holds 2 or more model's parameters.

I test it under 2 cards DistributedDataParallel mode, for yolo5x, it is 170~ mb as normal as single card situation

I just found out it was solved just 8 hours ago, LOL.

@glenn-jocher Hi~~~
I tested a586751 locally
it throw out:


File "train.py", line 396, in <module>
    train(hyp)
  File "train.py", line 295, in train
    results, maps, times = test.test(opt.data,
  File "/home/wangying/Projects/project_wheat_detect/own/yolov5-master_DEV/test.py", line 91, in test
    loss += compute_loss([x.float() for x in train_out], targets, model)[1][:3]  # GIoU, obj, cls
  File "/home/wangying/Projects/project_wheat_detect/own/yolov5-master_DEV/utils/utils.py", line 461, in compute_loss
    tobj[b, a, gj, gi] = (1.0 - model.gr) + model.gr * giou.detach().clamp(0).type(tobj.dtype)  # giou ratio
  File "/home/wangying/.local/lib/python3.8/site-packages/torch/nn/modules/module.py", line 593, in __getattr__
    raise AttributeError("'{}' object has no attribute '{}'".format(
AttributeError: 'Model' object has no attribute 'gr'

in train.py it needs to add 'gr':
ema.update_attr(model, include=['md', 'nc', 'hyp','gr', 'names', 'stride'])

@AlexWang1900 oh, thanks buddy. I just patched this problem today though, so if you git pull you should no longer see the effect.

Yes, I saw the gr problem after the first commit, so there is a second one right after it with the gr fix!
:https://github.com/ultralytics/yolov5/commit/01a73ec08e67479b03376dc24e6a271d6c470db6

There is a much more robust attribute handling now for the ema. The new scheme is an opt-in scheme, so only attributes that are in include are actually updated. I discovered a whole lot of useless attributes were being added, as you found out.
https://github.com/ultralytics/yolov5/blob/1b1681bac9a97f9ce17a77c8c60dbb2c3e1e00d0/train.py#L290

But definitely we need more contributors!! While we don't have any major bugs that come to mind, there may be ways to improve multi-gpu. One major change coming up also is that we want to do away with Apex and integrate torch 1.6 native mixed precision training. If you'd like to try that out that would help a lot.

I'm not sure when 1.6 stable is being released, but you can already start this work with 1.6 nightly.

@AlexWang1900 another TODO item was to improve the dataset visualization. Right now we have a few different scattered images, such as labels.png, results.png etc, which are done with matploblib, but which might benefit from a move to a more advanced plotting package like seaborn, bokeh or plotly. It's not urgent but its something I wanted to explore in the future, because the first step in training well is understanding your data well, and to understand your data well we need better introspection tools. In this respect roboflow actually has some very good dataset introspection tools.

great!I will try to contribute something!!! 
I will start with pytorch1.6 and data visualization
Thanks!!!

---Original---
From: "Glenn Jocher"<[email protected]>
Date: Sun, Jul 12, 2020 11:46 AM
To: "ultralytics/yolov5"<[email protected]>;
Cc: "Mention"<[email protected]>;"AlexWang1900"<[email protected]>;
Subject: Re: [ultralytics/yolov5] Bug: MultiGPU model saving size is 4X single GPU size (#253)

There is a much more robust attribute handling now for the ema. The new scheme is an opt-in scheme, so only attributes that are in include are actually updated. I discovered a whole lot of useless attributes were being added, as you found out.
https://github.com/ultralytics/yolov5/blob/1b1681bac9a97f9ce17a77c8c60dbb2c3e1e00d0/train.py#L290

But definitely we need more contributors!! While we don't have any major bugs that come to mind, there may be ways to improve multi-gpu. One major change coming up also is that we want to do away with Apex and integrate torch 1.6 native mixed precision training. If you'd like to try that out that would help a lot.

I'm not sure when 1.6 stable is being released, but you can already start this work with 1.6 nightly.

โ€”
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub, or unsubscribe.

the train in Multi-GPU is ok,but I meet the "out of memory" problem when eval the mAP.
Can you help me?

@ZJU-lishuang please git pull to make sure you are using the latest code, or pull the latest docker image, as there have been many recent changes.

You are right.
Thanks a lot.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

milind-soni picture milind-soni  ยท  3Comments

Single430 picture Single430  ยท  4Comments

cswwp picture cswwp  ยท  4Comments

abhiksark picture abhiksark  ยท  3Comments

xinxin342 picture xinxin342  ยท  3Comments