Hi, there is something wrong about training model with pretrained weights. And I try several times, but it already throws the same error.
When I use the following command:
!python train.py --img 1024 --batch 2 --epochs 5 --data ./yoloconfig/wheat0.yaml --cfg ./yoloconfig/yolov5x.yaml --weights ./yolo_weight/wheat/fold0.pt
Error:
Apex recommended for faster mixed precision training: https://github.com/NVIDIA/apex
{'lr0': 0.01, 'momentum': 0.937, 'weight_decay': 0.0005, 'giou': 0.05, 'cls': 0.58, 'cls_pw': 1.0, 'obj': 1.0, 'obj_pw': 1.0, 'iou_t': 0.2, 'anchor_t': 4.0, 'fl_gamma': 0.0, 'hsv_h': 0.014, 'hsv_s': 0.68, 'hsv_v': 0.36, 'degrees': 0.0, 'translate': 0.0, 'scale': 0.5, 'shear': 0.0}
Namespace(adam=False, batch_size=2, bucket='', cache_images=False, cfg='./yoloconfig/yolov5x.yaml', data='./yoloconfig/wheat0.yaml', device='', epochs=5, evolve=False, img_size=[1024], multi_scale=False, name='', noautoanchor=False, nosave=False, notest=False, rect=False, resume=False, single_cls=False, weights='./yolo_weight/wheat/fold0.pt')
Using CUDA device0 _CudaDeviceProperties(name='Tesla P100-PCIE-16GB', total_memory=16280MB)
2020-06-30 20:17:08.025764: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.1
Start Tensorboard with "tensorboard --logdir=runs", view at http://localhost:6006/
from n params module arguments
0 -1 1 8800 models.common.Focus [3, 80, 3]
1 -1 1 115520 models.common.Conv [80, 160, 3, 2]
2 -1 1 315680 models.common.BottleneckCSP [160, 160, 4]
3 -1 1 461440 models.common.Conv [160, 320, 3, 2]
4 -1 1 3311680 models.common.BottleneckCSP [320, 320, 12]
5 -1 1 1844480 models.common.Conv [320, 640, 3, 2]
6 -1 1 13228160 models.common.BottleneckCSP [640, 640, 12]
7 -1 1 7375360 models.common.Conv [640, 1280, 3, 2]
8 -1 1 4099840 models.common.SPP [1280, 1280, [5, 9, 13]]
9 -1 1 20087040 models.common.BottleneckCSP [1280, 1280, 4, False]
10 -1 1 820480 models.common.Conv [1280, 640, 1, 1]
11 -1 1 0 torch.nn.modules.upsampling.Upsample [None, 2, 'nearest']
12 [-1, 6] 1 0 models.common.Concat [1]
13 -1 1 5435520 models.common.BottleneckCSP [1280, 640, 4, False]
14 -1 1 205440 models.common.Conv [640, 320, 1, 1]
15 -1 1 0 torch.nn.modules.upsampling.Upsample [None, 2, 'nearest']
16 [-1, 4] 1 0 models.common.Concat [1]
17 -1 1 1360960 models.common.BottleneckCSP [640, 320, 4, False]
18 -1 1 5778 torch.nn.modules.conv.Conv2d [320, 18, 1, 1]
19 -2 1 922240 models.common.Conv [320, 320, 3, 2]
20 [-1, 14] 1 0 models.common.Concat [1]
21 -1 1 5025920 models.common.BottleneckCSP [640, 640, 4, False]
22 -1 1 11538 torch.nn.modules.conv.Conv2d [640, 18, 1, 1]
23 -2 1 3687680 models.common.Conv [640, 640, 3, 2]
24 [-1, 10] 1 0 models.common.Concat [1]
25 -1 1 20087040 models.common.BottleneckCSP [1280, 1280, 4, False]
26 -1 1 23058 torch.nn.modules.conv.Conv2d [1280, 18, 1, 1]
27 [-1, 22, 18] 1 0 models.yolo.Detect [1, [[116, 90, 156, 198, 373, 326], [30, 61, 62, 45, 59, 119], [10, 13, 16, 30, 33, 23]]]
Model Summary: 407 layers, 8.84337e+07 parameters, 8.84337e+07 gradients
Optimizer groups: 134 .bias, 142 conv.weight, 131 other
Caching labels wheat_data/fold0/labels/wheat_train.npy (2708 found, 0 missing, 0 empty, 0 duplicate, for 2708 images): 100% 2708/2708 [00:00<00:00, 17776.98it/s]
Caching labels wheat_data/fold0/labels/wheat_val (675 found, 0 missing, 0 empty, 0 duplicate, for 675 images): 100% 675/675 [00:00<00:00, 5661.93it/s]
Analyzing anchors... Best Possible Recall (BPR) = 0.9998
Image sizes 1024 train, 1024 test
Using 2 dataloader workers
Starting training for 5 epochs...
Traceback (most recent call last):
File "train.py", line 388, in
train(hyp)
File "train.py", line 346, in train
print('%g epochs completed in %.3f hours.\n' % (epoch - start_epoch + 1, (time.time() - t0) / 3600))
UnboundLocalError: local variable 'epoch' referenced before assignment
But if I use this command, it starts training:
Apex recommended for faster mixed precision training: https://github.com/NVIDIA/apex
{'lr0': 0.01, 'momentum': 0.937, 'weight_decay': 0.0005, 'giou': 0.05, 'cls': 0.58, 'cls_pw': 1.0, 'obj': 1.0, 'obj_pw': 1.0, 'iou_t': 0.2, 'anchor_t': 4.0, 'fl_gamma': 0.0, 'hsv_h': 0.014, 'hsv_s': 0.68, 'hsv_v': 0.36, 'degrees': 0.0, 'translate': 0.0, 'scale': 0.5, 'shear': 0.0}
Namespace(adam=False, batch_size=2, bucket='', cache_images=False, cfg='./yoloconfig/yolov5x.yaml', data='./yoloconfig/wheat0.yaml', device='', epochs=5, evolve=False, img_size=[1024], multi_scale=False, name='', noautoanchor=False, nosave=False, notest=False, rect=False, resume=False, single_cls=False, weights='')
Using CUDA device0 _CudaDeviceProperties(name='Tesla P100-PCIE-16GB', total_memory=16280MB)
2020-06-30 20:20:39.550777: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.1
Start Tensorboard with "tensorboard --logdir=runs", view at http://localhost:6006/
from n params module arguments
0 -1 1 8800 models.common.Focus [3, 80, 3]
1 -1 1 115520 models.common.Conv [80, 160, 3, 2]
2 -1 1 315680 models.common.BottleneckCSP [160, 160, 4]
3 -1 1 461440 models.common.Conv [160, 320, 3, 2]
4 -1 1 3311680 models.common.BottleneckCSP [320, 320, 12]
5 -1 1 1844480 models.common.Conv [320, 640, 3, 2]
6 -1 1 13228160 models.common.BottleneckCSP [640, 640, 12]
7 -1 1 7375360 models.common.Conv [640, 1280, 3, 2]
8 -1 1 4099840 models.common.SPP [1280, 1280, [5, 9, 13]]
9 -1 1 20087040 models.common.BottleneckCSP [1280, 1280, 4, False]
10 -1 1 820480 models.common.Conv [1280, 640, 1, 1]
11 -1 1 0 torch.nn.modules.upsampling.Upsample [None, 2, 'nearest']
12 [-1, 6] 1 0 models.common.Concat [1]
13 -1 1 5435520 models.common.BottleneckCSP [1280, 640, 4, False]
14 -1 1 205440 models.common.Conv [640, 320, 1, 1]
15 -1 1 0 torch.nn.modules.upsampling.Upsample [None, 2, 'nearest']
16 [-1, 4] 1 0 models.common.Concat [1]
17 -1 1 1360960 models.common.BottleneckCSP [640, 320, 4, False]
18 -1 1 5778 torch.nn.modules.conv.Conv2d [320, 18, 1, 1]
19 -2 1 922240 models.common.Conv [320, 320, 3, 2]
20 [-1, 14] 1 0 models.common.Concat [1]
21 -1 1 5025920 models.common.BottleneckCSP [640, 640, 4, False]
22 -1 1 11538 torch.nn.modules.conv.Conv2d [640, 18, 1, 1]
23 -2 1 3687680 models.common.Conv [640, 640, 3, 2]
24 [-1, 10] 1 0 models.common.Concat [1]
25 -1 1 20087040 models.common.BottleneckCSP [1280, 1280, 4, False]
26 -1 1 23058 torch.nn.modules.conv.Conv2d [1280, 18, 1, 1]
27 [-1, 22, 18] 1 0 models.yolo.Detect [1, [[116, 90, 156, 198, 373, 326], [30, 61, 62, 45, 59, 119], [10, 13, 16, 30, 33, 23]]]
Model Summary: 407 layers, 8.84337e+07 parameters, 8.84337e+07 gradients
Optimizer groups: 134 .bias, 142 conv.weight, 131 other
Caching labels wheat_data/fold0/labels/wheat_train.npy (2708 found, 0 missing, 0 empty, 0 duplicate, for 2708 images): 100% 2708/2708 [00:00<00:00, 16168.77it/s]
Caching labels wheat_data/fold0/labels/wheat_val (675 found, 0 missing, 0 empty, 0 duplicate, for 675 images): 100% 675/675 [00:00<00:00, 5414.57it/s]
Analyzing anchors... Best Possible Recall (BPR) = 0.9998
Image sizes 1024 train, 1024 test
Using 2 dataloader workers
Starting training for 5 epochs...
Epoch gpu_mem GIoU obj cls total targets img_size
0/4 11.3G 0.1128 0.1595 0 0.2723 155 1024: 1% 8/1354 [00:07<17:50, 1.26it/s]
So do you have any idea about that, thanks
Hello @CHC278Cao, thank you for your interest in our work! Please visit our Custom Training Tutorial to get started, and see our Jupyter Notebook , Docker Image, and Google Cloud Quickstart Guide for example environments.
If this is a bug report, please provide screenshots and minimum viable code to reproduce your issue, otherwise we can not help you.
If this is a custom model or data training question, please note that Ultralytics does not provide free personal support. As a leader in vision ML and AI, we do offer professional consulting, from simple expert advice up to delivery of fully customized, end-to-end production solutions for our clients, such as:
For more information please visit https://www.ultralytics.com.
The value you pass in --epoch should be greater than the number of epochs you've trained your last weights on.
For example in: --weights last_yolov5s_visdrone_100.pt, I trained last_yolov5s_visdrone_100.pt upto 100 epochs, so next time I want to train it for 100 more epoch I will have to pass 200 i.e 100+100.
The value you pass in --epoch should be greater than the number of epochs you've trained your last weights on.
For example in: --weights last_yolov5s_visdrone_100.pt, I trained last_yolov5s_visdrone_100.pt upto 100 epochs, so next time I want to train it for 100 more epoch I will have to pass 200 i.e 100+100.
So you mean I have to set a bigger epochs to retrain the model? I already train the model to get the model weights, and after that, I only add ten more images to finetune the model. Do I really need to re-train the model for a bigger epochs with pretrained weights? And I also try to set a smaller learning rate, but it seems like that there is no options to let me do that....
The value you pass in --epoch should be greater than the number of epochs you've trained your last weights on.
For example in: --weights last_yolov5s_visdrone_100.pt, I trained last_yolov5s_visdrone_100.pt upto 100 epochs, so next time I want to train it for 100 more epoch I will have to pass 200 i.e 100+100.So you mean I have to set a bigger epochs to retrain the model? I already train the model to get the model weights, and after that, I only add ten more images to finetune the model. Do I really need to re-train the model for a bigger epochs with pretrained weights? And I also try to set a smaller learning rate, but it seems like that there is no options to let me do that....
Say you trained your previous weights for 500 epochs. And now to retrain you set --epoch 505, you'll see that training already starts from 501 and will go upto 505. So in total(previous + now) you've trained for 505 epochs; but for this run 5 epochs only and for about 20 mins(in Google Colab).
@CHC278Cao --epochs are an absolute value, they are not relative. Your model has trained for 500 epochs. If you want to train 5 more (which I would strongly advise against), you are training to --epochs 505, not --epochs 5.
I can already tell you that you will end up with a worse result after this process however, as the 500 epochs are carefully managed in terms of warmup and LR schedule.
OK,got it. thanks.
Sent from my iPhone
On Jun 30, 2020, at 5:14 PM, Glenn Jocher notifications@github.com wrote:

@CHC278Cao --epochs are an absolute value, they are not relative. Your model has trained for 500 epochs. If you want to train 5 more (which I would strongly advise against), you are training to --epochs 505, not --epochs 5.I can already tell you that you will end up with a worse result after this process however, as the 500 epochs are carefully managed in terms of warmup and LR schedule.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub, or unsubscribe.
Most helpful comment
The value you pass in --epoch should be greater than the number of epochs you've trained your last weights on.
For example in: --weights last_yolov5s_visdrone_100.pt, I trained last_yolov5s_visdrone_100.pt upto 100 epochs, so next time I want to train it for 100 more epoch I will have to pass 200 i.e 100+100.