Yolov5: UnboundLocalError: local variable 'epoch' referenced before assignment

Created on 30 Jun 2020 · 6Comments · Source: ultralytics/yolov5

Hi, there is something wrong about training model with pretrained weights. And I try several times, but it already throws the same error.

When I use the following command:
!python train.py --img 1024 --batch 2 --epochs 5 --data ./yoloconfig/wheat0.yaml --cfg ./yoloconfig/yolov5x.yaml --weights ./yolo_weight/wheat/fold0.pt

Error:
Apex recommended for faster mixed precision training: https://github.com/NVIDIA/apex
{'lr0': 0.01, 'momentum': 0.937, 'weight_decay': 0.0005, 'giou': 0.05, 'cls': 0.58, 'cls_pw': 1.0, 'obj': 1.0, 'obj_pw': 1.0, 'iou_t': 0.2, 'anchor_t': 4.0, 'fl_gamma': 0.0, 'hsv_h': 0.014, 'hsv_s': 0.68, 'hsv_v': 0.36, 'degrees': 0.0, 'translate': 0.0, 'scale': 0.5, 'shear': 0.0}
Namespace(adam=False, batch_size=2, bucket='', cache_images=False, cfg='./yoloconfig/yolov5x.yaml', data='./yoloconfig/wheat0.yaml', device='', epochs=5, evolve=False, img_size=[1024], multi_scale=False, name='', noautoanchor=False, nosave=False, notest=False, rect=False, resume=False, single_cls=False, weights='./yolo_weight/wheat/fold0.pt')
Using CUDA device0 _CudaDeviceProperties(name='Tesla P100-PCIE-16GB', total_memory=16280MB)

2020-06-30 20:17:08.025764: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.1
Start Tensorboard with "tensorboard --logdir=runs", view at http://localhost:6006/

          from  n    params  module                                  arguments

0 -1 1 8800 models.common.Focus [3, 80, 3]
1 -1 1 115520 models.common.Conv [80, 160, 3, 2]
2 -1 1 315680 models.common.BottleneckCSP [160, 160, 4]
3 -1 1 461440 models.common.Conv [160, 320, 3, 2]
4 -1 1 3311680 models.common.BottleneckCSP [320, 320, 12]
5 -1 1 1844480 models.common.Conv [320, 640, 3, 2]
6 -1 1 13228160 models.common.BottleneckCSP [640, 640, 12]
7 -1 1 7375360 models.common.Conv [640, 1280, 3, 2]
8 -1 1 4099840 models.common.SPP [1280, 1280, [5, 9, 13]]
9 -1 1 20087040 models.common.BottleneckCSP [1280, 1280, 4, False]
10 -1 1 820480 models.common.Conv [1280, 640, 1, 1]
11 -1 1 0 torch.nn.modules.upsampling.Upsample [None, 2, 'nearest']
12 [-1, 6] 1 0 models.common.Concat [1]
13 -1 1 5435520 models.common.BottleneckCSP [1280, 640, 4, False]
14 -1 1 205440 models.common.Conv [640, 320, 1, 1]
15 -1 1 0 torch.nn.modules.upsampling.Upsample [None, 2, 'nearest']
16 [-1, 4] 1 0 models.common.Concat [1]
17 -1 1 1360960 models.common.BottleneckCSP [640, 320, 4, False]
18 -1 1 5778 torch.nn.modules.conv.Conv2d [320, 18, 1, 1]
19 -2 1 922240 models.common.Conv [320, 320, 3, 2]
20 [-1, 14] 1 0 models.common.Concat [1]
21 -1 1 5025920 models.common.BottleneckCSP [640, 640, 4, False]
22 -1 1 11538 torch.nn.modules.conv.Conv2d [640, 18, 1, 1]
23 -2 1 3687680 models.common.Conv [640, 640, 3, 2]
24 [-1, 10] 1 0 models.common.Concat [1]
25 -1 1 20087040 models.common.BottleneckCSP [1280, 1280, 4, False]
26 -1 1 23058 torch.nn.modules.conv.Conv2d [1280, 18, 1, 1]
27 [-1, 22, 18] 1 0 models.yolo.Detect [1, [[116, 90, 156, 198, 373, 326], [30, 61, 62, 45, 59, 119], [10, 13, 16, 30, 33, 23]]]
Model Summary: 407 layers, 8.84337e+07 parameters, 8.84337e+07 gradients

Optimizer groups: 134 .bias, 142 conv.weight, 131 other
Caching labels wheat_data/fold0/labels/wheat_train.npy (2708 found, 0 missing, 0 empty, 0 duplicate, for 2708 images): 100% 2708/2708 [00:00<00:00, 17776.98it/s]
Caching labels wheat_data/fold0/labels/wheat_val (675 found, 0 missing, 0 empty, 0 duplicate, for 675 images): 100% 675/675 [00:00<00:00, 5661.93it/s]

Analyzing anchors... Best Possible Recall (BPR) = 0.9998
Image sizes 1024 train, 1024 test
Using 2 dataloader workers
Starting training for 5 epochs...
Traceback (most recent call last):
File "train.py", line 388, in
train(hyp)
File "train.py", line 346, in train
print('%g epochs completed in %.3f hours.\n' % (epoch - start_epoch + 1, (time.time() - t0) / 3600))
UnboundLocalError: local variable 'epoch' referenced before assignment

But if I use this command, it starts training:
Apex recommended for faster mixed precision training: https://github.com/NVIDIA/apex
{'lr0': 0.01, 'momentum': 0.937, 'weight_decay': 0.0005, 'giou': 0.05, 'cls': 0.58, 'cls_pw': 1.0, 'obj': 1.0, 'obj_pw': 1.0, 'iou_t': 0.2, 'anchor_t': 4.0, 'fl_gamma': 0.0, 'hsv_h': 0.014, 'hsv_s': 0.68, 'hsv_v': 0.36, 'degrees': 0.0, 'translate': 0.0, 'scale': 0.5, 'shear': 0.0}
Namespace(adam=False, batch_size=2, bucket='', cache_images=False, cfg='./yoloconfig/yolov5x.yaml', data='./yoloconfig/wheat0.yaml', device='', epochs=5, evolve=False, img_size=[1024], multi_scale=False, name='', noautoanchor=False, nosave=False, notest=False, rect=False, resume=False, single_cls=False, weights='')
Using CUDA device0 _CudaDeviceProperties(name='Tesla P100-PCIE-16GB', total_memory=16280MB)

2020-06-30 20:20:39.550777: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.1
Start Tensorboard with "tensorboard --logdir=runs", view at http://localhost:6006/

          from  n    params  module                                  arguments

Optimizer groups: 134 .bias, 142 conv.weight, 131 other
Caching labels wheat_data/fold0/labels/wheat_train.npy (2708 found, 0 missing, 0 empty, 0 duplicate, for 2708 images): 100% 2708/2708 [00:00<00:00, 16168.77it/s]
Caching labels wheat_data/fold0/labels/wheat_val (675 found, 0 missing, 0 empty, 0 duplicate, for 675 images): 100% 675/675 [00:00<00:00, 5414.57it/s]

Analyzing anchors... Best Possible Recall (BPR) = 0.9998
Image sizes 1024 train, 1024 test
Using 2 dataloader workers
Starting training for 5 epochs...

 Epoch   gpu_mem      GIoU       obj       cls     total   targets  img_size
   0/4     11.3G    0.1128    0.1595         0    0.2723       155      1024:   1% 8/1354 [00:07<17:50,  1.26it/s]

So do you have any idea about that, thanks

Source

CHC278Cao

Most helpful comment

The value you pass in --epoch should be greater than the number of epochs you've trained your last weights on.

For example in: --weights last_yolov5s_visdrone_100.pt, I trained last_yolov5s_visdrone_100.pt upto 100 epochs, so next time I want to train it for 100 more epoch I will have to pass 200 i.e 100+100.

pranavchat14 on 30 Jun 2020

👍2

All 6 comments

Hello @CHC278Cao, thank you for your interest in our work! Please visit our Custom Training Tutorial to get started, and see our Jupyter Notebook , Docker Image, and Google Cloud Quickstart Guide for example environments.

If this is a bug report, please provide screenshots and minimum viable code to reproduce your issue, otherwise we can not help you.

If this is a custom model or data training question, please note that Ultralytics does not provide free personal support. As a leader in vision ML and AI, we do offer professional consulting, from simple expert advice up to delivery of fully customized, end-to-end production solutions for our clients, such as:

Cloud-based AI systems operating on hundreds of HD video streams in realtime.
Edge AI integrated into custom iOS and Android apps for realtime 30 FPS video inference.
Custom data training, hyperparameter evolution, and model exportation to any destination.

For more information please visit https://www.ultralytics.com.

github-actions[bot] on 30 Jun 2020

The value you pass in --epoch should be greater than the number of epochs you've trained your last weights on.

For example in: --weights last_yolov5s_visdrone_100.pt, I trained last_yolov5s_visdrone_100.pt upto 100 epochs, so next time I want to train it for 100 more epoch I will have to pass 200 i.e 100+100.

pranavchat14 on 30 Jun 2020

👍2

The value you pass in --epoch should be greater than the number of epochs you've trained your last weights on.

For example in: --weights last_yolov5s_visdrone_100.pt, I trained last_yolov5s_visdrone_100.pt upto 100 epochs, so next time I want to train it for 100 more epoch I will have to pass 200 i.e 100+100.

So you mean I have to set a bigger epochs to retrain the model? I already train the model to get the model weights, and after that, I only add ten more images to finetune the model. Do I really need to re-train the model for a bigger epochs with pretrained weights? And I also try to set a smaller learning rate, but it seems like that there is no options to let me do that....

CHC278Cao on 30 Jun 2020

The value you pass in --epoch should be greater than the number of epochs you've trained your last weights on.
For example in: --weights last_yolov5s_visdrone_100.pt, I trained last_yolov5s_visdrone_100.pt upto 100 epochs, so next time I want to train it for 100 more epoch I will have to pass 200 i.e 100+100.

So you mean I have to set a bigger epochs to retrain the model? I already train the model to get the model weights, and after that, I only add ten more images to finetune the model. Do I really need to re-train the model for a bigger epochs with pretrained weights? And I also try to set a smaller learning rate, but it seems like that there is no options to let me do that....

Say you trained your previous weights for 500 epochs. And now to retrain you set --epoch 505, you'll see that training already starts from 501 and will go upto 505. So in total(previous + now) you've trained for 505 epochs; but for this run 5 epochs only and for about 20 mins(in Google Colab).

pranavchat14 on 30 Jun 2020

@CHC278Cao --epochs are an absolute value, they are not relative. Your model has trained for 500 epochs. If you want to train 5 more (which I would strongly advise against), you are training to --epochs 505, not --epochs 5.

I can already tell you that you will end up with a worse result after this process however, as the 500 epochs are carefully managed in terms of warmup and LR schedule.

glenn-jocher on 30 Jun 2020

👍1

OK，got it. thanks.

Sent from my iPhone

On Jun 30, 2020, at 5:14 PM, Glenn Jocher notifications@github.com wrote:

@CHC278Cao --epochs are an absolute value, they are not relative. Your model has trained for 500 epochs. If you want to train 5 more (which I would strongly advise against), you are training to --epochs 505, not --epochs 5.

I can already tell you that you will end up with a worse result after this process however, as the 500 epochs are carefully managed in terms of warmup and LR schedule.

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub, or unsubscribe.