@glenn-jocher
YOLOV5S on the official website is 7.5MB๏ผWhy the model I trained is 58.4M
Here is the same problem.
I trained with the YOLOv5s model, but the weight file is 56 MB in size.
the default saved model contains optimizer params also whole module define, if only weights it should smaller.
Have you seen this issue?
https://github.com/ultralytics/yolov5/issues/124
How did you do the epochs option when learning?
Strip_optimizer() is called after all learning is done.
The weight file is large before training is complete, but the weight file is reduced when training is complete.
When I finished learning, the capacity was halved. 56M -> 28M
However, the weight file is still larger than 7 MB.
me too
@glenn-jocher
YOLOV5S on the official website is 7.5MB๏ผWhy the model I trained is 58.4M
You are incorrect. The model contains 7.5M parameters.
Current file sizes (may change in the future) are here:

Hi all. I've met the same problem and solved by myself.
Actually @jh-park-mindslab commented resoanably, and #124 has the answer, however it also does not explain the issue completely.
During training a ckpt is saved with the name of "last.pt" or "best.pt" as fp32 model + fp32 optimizer, so its size is 56M.
At the last epoch the last checkpoint is saved with the name of "last.pt" as fp32 model without fp32 optimizer, so its size is 28M.
if save:
with open(results_file, 'r') as f: # create checkpoint
ckpt = {'epoch': epoch,
'best_fitness': best_fitness,
'training_results': f.read(),
'model': ema.ema.module if hasattr(model, 'module') else ema.ema,
'optimizer': None if final_epoch else optimizer.state_dict()}
# Save last, best and delete
torch.save(ckpt, last)
if (best_fitness == fi) and not final_epoch:
torch.save(ckpt, best)
del ckpt
After the end of training these "best.pt" of which size is 56M and "last.pt" of which size is 28M are renamed as "best_[name].pt" and "last_[name].pt" before they are reduced to fp16 models which means their sizes are 14M respectively.
BUT HERE IS THE POINT. It is only if you set the parameter --name in your command line. If not, "best.pt"(56M) and "last.pt"(28M) remain intact, no 14M models of the above comment. For example,
train.py --name meter --img-size 640 --batch 16 --data ./meter.yaml --cfg ./models/yolov5s_meter.yaml --epochs 2
Below is the code in train.py which explains the issue:
n = opt.name
if len(n):
n = '_' + n if not n.isnumeric() else n
fresults, flast, fbest = 'results%s.txt' % n, wdir + 'last%s.pt' % n, wdir + 'best%s.pt' % n
for f1, f2 in zip([wdir + 'last.pt', wdir + 'best.pt', 'results.txt'], [flast, fbest, fresults]):
if os.path.exists(f1):
os.rename(f1, f2) # rename
ispt = f2.endswith('.pt') # is *.pt
strip_optimizer(f2) if ispt else None # strip optimizer
os.system('gsutil cp %s gs://%s/weights' % (f2, opt.bucket)) if opt.bucket and ispt else None # upload
@rcg12387 ah, good job! Yes I think you are right, we had not noticed this since we always --name our training runs. I will try to push a fix for this so that last.pt and best.pt are both stripped of optimizer and quantized to FP16 before saving at the end of the run.
This fix will only change the final checkpoint size at the very end of training. It's important to note that checkpoints saved during training will always be 4X larger (FP32 model, FP32 optimizer) than the official saved checkpoints (FP16 model only).
Ok, fix has been pushed. Final checkpoints will now be saved as FP16 sizes when training is complete, regardless of--nameusage. Checkpoints saved during training will be 4X the final size, as they carry FP32 models and FP32 optimizers.
I trained with the YOLOv5x model using Multi-GPU , but the weight file is 886 MB in size.
I trained with the YOLOv5x model using Single GPU , the weight file is 177 MB in size.
@ZJU-lishuang yes, I can reproduce your same numbers. That's no good. Something is definitely not correct with multigpu saving.
@ZJU-lishuang I can't find the source of the problem at the moment. If you manage to debug this or find the cause, please let us know!
@ZJU-lishuang @rcg12387 @LVROBOT this issue was traced down to the EMA updating extraneous attributes during the EMA.update_attr() call. This should be resolved now in https://github.com/ultralytics/yolov5/commit/a586751904e0f439d40b1a98ad1cbef5fa856761
Please git pull or clone a new copy of the repo and try again. Thank you!
Fix verified with 2x T4 VM below. Note filesize is now displayed after training completes as an added feature.
$ python train.py --epochs 3
Apex recommended for faster mixed precision training: https://github.com/NVIDIA/apex
Namespace(batch_size=16, bucket='', cache_images=False, cfg='models/yolov5s.yaml', data='data/coco128.yaml', device='', epochs=3, evolve=False, hyp='', img_size=[640, 640], multi_scale=False, name='', noautoanchor=False, nosave=False, notest=False, rect=False, resume=False, single_cls=False, weights='')
Using CUDA device0 _CudaDeviceProperties(name='Tesla T4', total_memory=15079MB)
device1 _CudaDeviceProperties(name='Tesla T4', total_memory=15079MB)
Start Tensorboard with "tensorboard --logdir=runs", view at http://localhost:6006/
Hyperparameters {'optimizer': 'SGD', 'lr0': 0.01, 'momentum': 0.937, 'weight_decay': 0.0005, 'giou': 0.05, 'cls': 0.58, 'cls_pw': 1.0, 'obj': 1.0, 'obj_pw': 1.0, 'iou_t': 0.2, 'anchor_t': 4.0, 'fl_gamma': 0.0, 'hsv_h': 0.014, 'hsv_s': 0.68, 'hsv_v': 0.36, 'degrees': 0.0, 'translate': 0.0, 'scale': 0.5, 'shear': 0.0}
from n params module arguments
0 -1 1 3520 models.common.Focus [3, 32, 3]
1 -1 1 18560 models.common.Conv [32, 64, 3, 2]
2 -1 1 19904 models.common.BottleneckCSP [64, 64, 1]
3 -1 1 73984 models.common.Conv [64, 128, 3, 2]
4 -1 1 161152 models.common.BottleneckCSP [128, 128, 3]
5 -1 1 295424 models.common.Conv [128, 256, 3, 2]
6 -1 1 641792 models.common.BottleneckCSP [256, 256, 3]
7 -1 1 1180672 models.common.Conv [256, 512, 3, 2]
8 -1 1 656896 models.common.SPP [512, 512, [5, 9, 13]]
9 -1 1 1248768 models.common.BottleneckCSP [512, 512, 1, False]
10 -1 1 131584 models.common.Conv [512, 256, 1, 1]
11 -1 1 0 torch.nn.modules.upsampling.Upsample [None, 2, 'nearest']
12 [-1, 6] 1 0 models.common.Concat [1]
13 -1 1 378624 models.common.BottleneckCSP [512, 256, 1, False]
14 -1 1 33024 models.common.Conv [256, 128, 1, 1]
15 -1 1 0 torch.nn.modules.upsampling.Upsample [None, 2, 'nearest']
16 [-1, 4] 1 0 models.common.Concat [1]
17 -1 1 95104 models.common.BottleneckCSP [256, 128, 1, False]
18 -1 1 32895 torch.nn.modules.conv.Conv2d [128, 255, 1, 1]
19 -2 1 147712 models.common.Conv [128, 128, 3, 2]
20 [-1, 14] 1 0 models.common.Concat [1]
21 -1 1 313088 models.common.BottleneckCSP [256, 256, 1, False]
22 -1 1 65535 torch.nn.modules.conv.Conv2d [256, 255, 1, 1]
23 -2 1 590336 models.common.Conv [256, 256, 3, 2]
24 [-1, 10] 1 0 models.common.Concat [1]
25 -1 1 1248768 models.common.BottleneckCSP [512, 512, 1, False]
26 -1 1 130815 torch.nn.modules.conv.Conv2d [512, 255, 1, 1]
27 [-1, 22, 18] 1 0 models.yolo.Detect [80, [[116, 90, 156, 198, 373, 326], [30, 61, 62, 45, 59, 119], [10, 13, 16, 30, 33, 23]]]
Model Summary: 191 layers, 7.46816e+06 parameters, 7.46816e+06 gradients
Optimizer groups: 62 .bias, 70 conv.weight, 59 other
Scanning labels ../coco128/labels/train2017.cache (126 found, 0 missing, 2 empty, 0 duplicate, for 128 images): 100%|โโโโโโโโโโโโโโโ| 128/128 [00:00<00:00, 14557.24it/s]
Scanning labels ../coco128/labels/train2017.cache (126 found, 0 missing, 2 empty, 0 duplicate, for 128 images): 100%|โโโโโโโโโโโโโโโ| 128/128 [00:00<00:00, 15628.52it/s]
Analyzing anchors... Best Possible Recall (BPR) = 0.9946
Image sizes 640 train, 640 test
Using 8 dataloader workers
Starting training for 3 epochs...
Epoch gpu_mem GIoU obj cls total targets img_size
0/2 2.92G 0.1183 0.1104 0.2131 0.4418 169 640: 100%|โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ| 8/8 [00:10<00:00, 1.36s/it]
Class Images Targets P R [email protected] [email protected]:.95: 100%|โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ| 8/8 [00:07<00:00, 1.01it/s]
all 128 929 0 0 0 0
Epoch gpu_mem GIoU obj cls total targets img_size
1/2 2.91G 0.1149 0.1121 0.2125 0.4394 187 640: 100%|โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ| 8/8 [00:03<00:00, 2.51it/s]
Class Images Targets P R [email protected] [email protected]:.95: 100%|โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ| 8/8 [00:01<00:00, 5.74it/s]
all 128 929 0 0 0 0
Epoch gpu_mem GIoU obj cls total targets img_size
2/2 3.94G 0.1124 0.1184 0.2118 0.4426 187 640: 100%|โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ| 8/8 [00:03<00:00, 2.54it/s]
Class Images Targets P R [email protected] [email protected]:.95: 100%|โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ| 8/8 [00:01<00:00, 5.46it/s]
all 128 929 0 0 0 0
Optimizer stripped from runs/exp7/weights/last.pt, 15.2MB
Optimizer stripped from runs/exp7/weights/best.pt, 15.2MB
3 epochs completed in 0.009 hours.
@glenn-jocher Hi~~~
I found out a workaroud :
in utils/torch_utils: line 220
def update_attr(self, model):
# Update EMA attributes
for k, v in model.module.__dict__.items() if is_parallel(model) else model.__dict__.items():
self.ema.names= model.names
self.ema.hyp=model.hyp
self.ema.gr=model.gr
self.ema.nc=model.nc
if not k.startswith('_') and k not in ["process_group", "reducer"]:
setattr(self.ema, k, v)
in test.py: line 67:
names = model.names #if hasattr(model, 'names') else model.module.names
Analyse:
ema.update_attr() set all models' parameter in ema.ema which is supposed to hold only one model's parameters.
when training, it calls ema.update(), which is fine as ema.ema was initialized with one model, and update with one model.
And later when calling test and save , after ema.update_attr() called when each epoch ends , ema.ema holds 2 or more model's parameters.
I test it under 2 cards DistributedDataParallel mode, for yolo5x, it is 170~ mb as normal as single card situation
I just found out it was solved just 8 hours ago, LOL.
@glenn-jocher Hi~~~
I tested a586751 locally
it throw out:
File "train.py", line 396, in <module>
train(hyp)
File "train.py", line 295, in train
results, maps, times = test.test(opt.data,
File "/home/wangying/Projects/project_wheat_detect/own/yolov5-master_DEV/test.py", line 91, in test
loss += compute_loss([x.float() for x in train_out], targets, model)[1][:3] # GIoU, obj, cls
File "/home/wangying/Projects/project_wheat_detect/own/yolov5-master_DEV/utils/utils.py", line 461, in compute_loss
tobj[b, a, gj, gi] = (1.0 - model.gr) + model.gr * giou.detach().clamp(0).type(tobj.dtype) # giou ratio
File "/home/wangying/.local/lib/python3.8/site-packages/torch/nn/modules/module.py", line 593, in __getattr__
raise AttributeError("'{}' object has no attribute '{}'".format(
AttributeError: 'Model' object has no attribute 'gr'
in train.py it needs to add 'gr':
ema.update_attr(model, include=['md', 'nc', 'hyp','gr', 'names', 'stride'])
@AlexWang1900 oh, thanks buddy. I just patched this problem today though, so if you git pull you should no longer see the effect.
Yes, I saw the gr problem after the first commit, so there is a second one right after it with the gr fix!
:https://github.com/ultralytics/yolov5/commit/01a73ec08e67479b03376dc24e6a271d6c470db6
There is a much more robust attribute handling now for the ema. The new scheme is an opt-in scheme, so only attributes that are in include are actually updated. I discovered a whole lot of useless attributes were being added, as you found out.
https://github.com/ultralytics/yolov5/blob/1b1681bac9a97f9ce17a77c8c60dbb2c3e1e00d0/train.py#L290
But definitely we need more contributors!! While we don't have any major bugs that come to mind, there may be ways to improve multi-gpu. One major change coming up also is that we want to do away with Apex and integrate torch 1.6 native mixed precision training. If you'd like to try that out that would help a lot.
I'm not sure when 1.6 stable is being released, but you can already start this work with 1.6 nightly.
@AlexWang1900 another TODO item was to improve the dataset visualization. Right now we have a few different scattered images, such as labels.png, results.png etc, which are done with matploblib, but which might benefit from a move to a more advanced plotting package like seaborn, bokeh or plotly. It's not urgent but its something I wanted to explore in the future, because the first step in training well is understanding your data well, and to understand your data well we need better introspection tools. In this respect roboflow actually has some very good dataset introspection tools.
great!I will try to contribute something!!!
I will start with pytorch1.6 and data visualization
Thanks!!!
---Original---
From: "Glenn Jocher"<[email protected]>
Date: Sun, Jul 12, 2020 11:46 AM
To: "ultralytics/yolov5"<[email protected]>;
Cc: "Mention"<[email protected]>;"AlexWang1900"<[email protected]>;
Subject: Re: [ultralytics/yolov5] Bug: MultiGPU model saving size is 4X single GPU size (#253)
There is a much more robust attribute handling now for the ema. The new scheme is an opt-in scheme, so only attributes that are in include are actually updated. I discovered a whole lot of useless attributes were being added, as you found out.
https://github.com/ultralytics/yolov5/blob/1b1681bac9a97f9ce17a77c8c60dbb2c3e1e00d0/train.py#L290
But definitely we need more contributors!! While we don't have any major bugs that come to mind, there may be ways to improve multi-gpu. One major change coming up also is that we want to do away with Apex and integrate torch 1.6 native mixed precision training. If you'd like to try that out that would help a lot.
I'm not sure when 1.6 stable is being released, but you can already start this work with 1.6 nightly.
โ
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub, or unsubscribe.
the train in Multi-GPU is ok,but I meet the "out of memory" problem when eval the mAP.
Can you help me?
@ZJU-lishuang please git pull to make sure you are using the latest code, or pull the latest docker image, as there have been many recent changes.
You are right.
Thanks a lot.
Most helpful comment
Hi all. I've met the same problem and solved by myself.
Actually @jh-park-mindslab commented resoanably, and #124 has the answer, however it also does not explain the issue completely.
During training a ckpt is saved with the name of "last.pt" or "best.pt" as fp32 model + fp32 optimizer, so its size is 56M.
At the last epoch the last checkpoint is saved with the name of "last.pt" as fp32 model without fp32 optimizer, so its size is 28M.
After the end of training these "best.pt" of which size is 56M and "last.pt" of which size is 28M are renamed as "best_[name].pt" and "last_[name].pt" before they are reduced to fp16 models which means their sizes are 14M respectively.
BUT HERE IS THE POINT. It is only if you set the parameter --name in your command line. If not, "best.pt"(56M) and "last.pt"(28M) remain intact, no 14M models of the above comment. For example,
train.py --name meter --img-size 640 --batch 16 --data ./meter.yaml --cfg ./models/yolov5s_meter.yaml --epochs 2
Below is the code in train.py which explains the issue: