insightface 🚀 - 中断训练后，要重新训练怎么从之前保存的ckpt接着训练呢？

发现每次重新训练都是从scratch开始的，请问怎样才能接着之前保存的ckpt训练呢？谢谢

clhne on 27 Feb 2019

see 'pretrained' option in config.py

nttstar on 1 Mar 2019

👍2

see 'pretrained' option in config.py

I encounter the same problem.

Firstly, I didn't change config.py. And its default setting is
default.pretrained = ' '
default.pretrained_epoch = 1
I train to epoch 7 and then for some reason I manually stop training. And the saved model file is named as
model-0001.params
model-symbol.json.
At this time, the training accuracy is 0.61, loss value is 3.37
INFO:root:Epoch[7] Batch [16600-16620] Speed: 214.54 samples/sec acc=0.610938 lossvalue=3.376553

Then I want to continue training based on the ckpt, so I use the following command
CUDA_VISIBLE_DEVICES='0,1' python -u train.py --network r100 --loss arcface --dataset emore --pretrained ./models/r100-arcface-emore/model --pretrained-epoch 1

It indeed load the model.
However, the loss value and training accuracy is just like start from scratch.
INFO:root:Epoch[0] Batch [0-20] Speed: 205.57 samples/sec acc=0.000000 lossvalue=50.857266
INFO:root:Epoch[0] Batch [20-40] Speed: 214.24 samples/sec acc=0.000000 lossvalue=57.416350

I thought the commond
--pretrained ./models/r100-arcface-emore/model --pretrained-epoch 1
is equal to modify setting in config.py.

I argue this is an important problem as train the ArcFace model is really a time-consuming task. So how to solve this problem? Hope for your reply at your earliest convenience!

Best regards!

jake221 on 20 Apr 2019

发现每次重新训练都是从scratch开始的，请问怎样才能接着之前保存的ckpt训练呢？谢谢

请问你这个问题解决了吗？

jake221 on 20 Apr 2019

Solved.
You can set --pretrained model path, and it will load the pretrained model from the checkpoint(.params and json file).

clhne on 24 Apr 2019

@clhne 你好！
我使用insightface-master/recognition/train.py进行训练，中断后想要从保存的模型中接着训练，使用命令：
python -u train.py --network r100 --loss arcface --dataset emore --per-batch-size 60 --pretrained D:\insightface\models\r100-arcface-emore\model --pretrained-epoch 1 --models-root D:\insightface\models --lr 0.05
貌似加载成功了，但是无论我使用下载的预训练模型，还是训练途中保存的模型，它都是从这种状态开始了，这是否正常？

_C:\Users\wangting\iCloudDrive\Pycharm\DeepLearning\face\insightface-master\recognition>python -u train.py --network r100 --loss arcface --dataset emore --per-batch-size 50 --pretrained D:\insightface\models\r100-arcface-emore\model --pretrained-epoch 1 --models-root D:\insightface\models --lr 0.05
C:\Anaconda3\lib\site-packages\h5py__init__.py:36: FutureWarning: Conversion of the second argument of issubdtype from float to np.floating is deprecated. In future, it will be treated as np.float64 == np.dtype(float).type.
from ._conv import register_converters as _register_converters
gpu num: 4
prefix D:\insightface\models\r100-arcface-emore\model
image_size [112, 112]
num_classes 85742
Called with argument: Namespace(batch_size=200, ckpt=3, ctx_num=4, dataset='emore', frequent=20, image_channel=3, kvstore='device', loss='arcface', lr=0.05, lr_steps='100000,160000,220000', models_root='D:\insightface\models', mom=0.9, network='r100', per_batch_size=50, pretrained='D:\insightface\models\r100-arcface-emore\model', pretrained_epoch=1, rescale_threshold=0, verbose=2000, wd=0.0005) {'bn_mom': 0.9, 'workspace': 256, 'emb_size': 512, 'ckpt_embedding': True, 'net_se': 0, 'net_act': 'prelu', 'net_unit': 3, 'net_input': 1, 'net_blocks': [1, 4, 6, 2], 'net_output': 'E', 'net_multiplier': 1.0, 'val_targets': ['lfw', 'cfp_fp', 'agedb_30'], 'ce_loss': True, 'fc7_lr_mult': 1.0, 'fc7_wd_mult': 1.0, 'fc7_no_bias': False, 'max_steps': 0, 'data_rand_mirror': True, 'data_cutoff': False, 'data_color': 0, 'data_images_filter': 0, 'count_flops': True, 'memonger': False, 'loss_name': 'margin_softmax', 'loss_s': 64.0, 'loss_m1': 1.0, 'loss_m2': 0.5, 'loss_m3': 0.0, 'net_name': 'fresnet', 'num_layers': 100, 'dataset': 'emore', 'dataset_path': 'F:\facedataset\faces_ms1m_112x112', 'num_classes': 85742, 'image_shape': [112, 112, 3], 'loss': 'arcface', 'network': 'r100', 'num_workers': 1, 'batch_size': 200, 'per_batch_size': 50}
loading D:\insightface\models\r100-arcface-emore\model 1
0 1 E 3 prelu False
Network FLOPs: 24.2G
INFO:root:loading recordio F:\facedataset\faces_ms1m_112x112\train.rec...
header0 label [3804847. 3890011.]
id2range 85164
3804846
rand_mirror True
loading bin 0
loading bin 1000
loading bin 2000
loading bin 3000
loading bin 4000
loading bin 5000
loading bin 6000
loading bin 7000
loading bin 8000
loading bin 9000
loading bin 10000
loading bin 11000
(12000, 3, 112, 112)
ver lfw
loading bin 0
loading bin 1000
loading bin 2000
loading bin 3000
loading bin 4000
loading bin 5000
loading bin 6000
loading bin 7000
loading bin 8000
loading bin 9000
loading bin 10000
loading bin 11000
loading bin 12000
loading bin 13000
(14000, 3, 112, 112)
ver cfp_fp
loading bin 0
loading bin 1000
loading bin 2000
loading bin 3000
loading bin 4000
loading bin 5000
loading bin 6000
loading bin 7000
loading bin 8000
loading bin 9000
loading bin 10000
loading bin 11000
(12000, 3, 112, 112)
ver agedb_30
lr_steps [100000, 160000, 220000]
call reset()
C:\Anaconda3\lib\site-packages\mxnet\module\base_module.py:505: UserWarning: Optimizer created manually outside Module but rescale_grad is not normalized to 1.0/batch_size/num_workers (0.25 vs. 0.005). Is this intended?
optimizer_params=optimizer_params)
[17:30:50] c:\jenkins\workspace\mxnet-tag\mxnet\src\operator\nn\cudnn./cudnn_algoreg-inl.h:97: Running performance tests to find the best convolution algorithm, this can take a while... (setting env variable MXNET_CUDNN_AUTOTUNE_DEFAULT to 0 to disable)
INFO:root:Epoch[0] Batch [0-20] Speed: 157.76 samples/sec acc=0.000000 lossvalue=46.753732
INFO:root:Epoch[0] Batch [20-40] Speed: 159.36 samples/sec acc=0.000000 lossvalue=47.684572
INFO:root:Epoch[0] Batch [40-60] Speed: 158.34 samples/sec acc=0.000000 lossvalue=47.008412_

wting861006 on 8 May 2019

Set ckpt_embedding=False to save the fc7 weight matrix.

nttstar on 9 May 2019

Set ckpt_embedding=False to save the fc7 weight matrix.

@nttstar Hi，感谢你的回复，我设置了config.py中的config.ckpt_embedding = False，使用--pretrained还是从头开始的。

wting861006 on 9 May 2019

@wting861006
如果执行的是recognition下面的train需要修改该配置。
重新开始训练，要保证重新开始训练的lr和上次最后一次迭代的lr一致，这样能得到基本一致的acc。
如果重新训练的lr新设置，acc是会不一样

SueeH on 9 May 2019

@SueeH 你好，我修改了lr，使其和上次最后一次迭代的lr一致，得到的acc还是0，loss是45。我发现我加或不加--pretrained，都是一样的。

wting861006 on 10 May 2019

To set ckpt_embedding=False while training your pre-trained model, so that fc7 weight matrix can be saved.

nttstar on 11 May 2019

@wting861006
Just as nttstar says, you should retraining and set ckpt_embedding = False, try and let me know the result, thanks.

clhne on 13 May 2019

👍1

@wting861006
查一下你加载的.symbol中是否包含了fc7层，没包含的话就是楼上说的未保存fc7

SueeH on 13 May 2019

@SueeH @clhne @nttstar 感谢你们的热心回复，问题已经解决，确实如你们所说的一样。我建议文档是否可以把这些类似的小细节写进去，让新人少走弯路。
再次谢谢你们！

wting861006 on 16 May 2019

@SueeH @clhne @nttstar 感谢你们的热心回复，问题已经解决，确实如你们所说的一样。我建议文档是否可以把这些类似的小细节写进去，让新人少走弯路。
再次谢谢你们！

不能更同意层主的说法。我的情况是训练到10w+步的时候发现loss突然激增到nan，需要修改超参来作断点续训，使用了pretrained想要去加载断点，发现仿佛重新开始从0开始训练。如果有官方文档讲解下续载问题，能让对小白轻松些。

另，有几个细节想询问一下你：
①上面@wting861006 提到要将lr设置为断点处的lr，这属实吗？因为我猜测，断点处不是应该保存了lr吗？
②我注意到，不管训练时保存了ckpt（param+symbol）几次，param都只是-0001.params？！不像tensroflow的ckpt那样，eventout有好些断点。这是否会对断点加载有影响？

Thx for your attention!

JackWolftoMc on 17 Jul 2019

To set ckpt_embedding=False while training your pre-trained model, so that fc7 weight matrix can be saved.

Hi, sry to bother you. I notice that you are one of the authors of Insightface, it may be silly to ask you so straightly and rudely. But, as a noob, I really got some problems understanding some variables in train.py, chances are you may help me out here.

From the train info, I notice three lines as below:
[sc_val][120000]XNorm: 30.679875
[sc_val][120000]Accuracy-Flip: 0.00037+-0.00111
[120000]Accuracy-Highest: 0.41181

I'm confused about XNrom, Acc-Filp, and Acc-Highest. Like, what is "XNorm". what's the meaning of "flip", why "acc-highest" nor "acc" directly? I believe they could also cause trouble for other guys like me. Hope that it won't take your time too long.

Thanks for your attention, any words will be appreciated!

JackWolftoMc on 17 Jul 2019

@SueeH @clhne @nttstar 感谢你们的热心回复，问题已经解决，确实如你们所说的一样。我建议文档是否可以把这些类似的小细节写进去，让新人少走弯路。
再次谢谢你们！

不能更同意层主的说法。我的情况是训练到10w+步的时候发现loss突然激增到nan，需要修改超参来作断点续训，使用了pretrained想要去加载断点，发现仿佛重新开始从0开始训练。如果有官方文档讲解下续载问题，能让对小白轻松些。

另，有几个细节想询问一下你：
①上面@wting861006 提到要将lr设置为断点处的lr，这属实吗？因为我猜测，断点处不是应该保存了lr吗？
②我注意到，不管训练时保存了ckpt（param+symbol）几次，param都只是-0001.params？！不像tensroflow的ckpt那样，eventout有好些断点。这是否会对断点加载有影响？

Thx for your attention!

❶I think the .params saved the lr.
❷You could modify the default.ckpt's value to modify the strategy of ckpt saving:
0: discard saving. 1: save when necessary. 2: always save. 3: save one model, prefix-0001.params
So, I guess you set default.ckpt==3.

clhne on 17 Jul 2019

@SueeH @clhne @nttstar 感谢你们的热心回复，问题已经解决，确实如你们所说的一样。我建议文档是否可以把这些类似的小细节写进去，让新人少走弯路。
再次谢谢你们！

不能更同意层主的说法。我的情况是训练到10w+步的时候发现loss突然激增到nan，需要修改超参来作断点续训，使用了pretrained想要去加载断点，发现仿佛重新开始从0开始训练。如果有官方文档讲解下续载问题，能让对小白轻松些。
另，有几个细节想询问一下你：
①上面@wting861006 提到要将lr设置为断点处的lr，这属实吗？因为我猜测，断点处不是应该保存了lr吗？
②我注意到，不管训练时保存了ckpt（param+symbol）几次，param都只是-0001.params？！不像tensroflow的ckpt那样，eventout有好些断点。这是否会对断点加载有影响？
Thx for your attention!

❶I think the .params saved the lr.
❷You could modify the default.ckpt's value to modify the strategy of ckpt saving:
0: discard saving. 1: save when necessary. 2: always save. 3: save one model, prefix-0001.params
So, I guess you set default.ckpt==3.

You're right. Thx for your advice. I was so stubborn that I thought ckpt need to be set to 3 cause the sample.config.py made this and there is no help info. about 3 but 0, 1, and 2 in train.py. So appreciated about your kind reply.

JackWolftoMc on 17 Jul 2019

@jake221
hi, jake221,I met the same problem, you solved this problem? can you help me?

chenchaohui on 11 Oct 2019

You're right. Thx for your advice. I was so stubborn that I thought ckpt need to be set to 3 cause the sample.config.py made this and there is no help info. about 3 but 0, 1, and 2 in train.py. So appreciated about your kind reply.

请问兄弟是怎么解决的，我也把ckpt_embedding设为 false，设置了 pretrained 及 pretrained_epoch，还是从头训练啊，.ckpt需要设置吗？

EdwardVincentMa on 12 Nov 2019

@SueeH @clhne @nttstar 感谢你们的热心回复，问题已经解决，确实如你们所说的一样。我建议文档是否可以把这些类似的小细节写进去，让新人少走弯路。
再次谢谢你们！

不能更同意层主的说法。我的情况是训练到10w+步的时候发现loss突然激增到nan，需要修改超参来作断点续训，使用了pretrained想要去加载断点，发现仿佛重新开始从0开始训练。如果有官方文档讲解下续载问题，能让对小白轻松些。
另，有几个细节想询问一下你：
①上面@wting861006 提到要将lr设置为断点处的lr，这属实吗？因为我猜测，断点处不是应该保存了lr吗？
②我注意到，不管训练时保存了ckpt（param+symbol）几次，param都只是-0001.params？！不像tensroflow的ckpt那样，eventout有好些断点。这是否会对断点加载有影响？
Thx for your attention!

❶I think the .params saved the lr.
❷You could modify the default.ckpt's value to modify the strategy of ckpt saving:
0: discard saving. 1: save when necessary. 2: always save. 3: save one model, prefix-0001.params
So, I guess you set default.ckpt==3.

You're right. Thx for your advice. I was so stubborn that I thought ckpt need to be set to 3 cause the sample.config.py made this and there is no help info. about 3 but 0, 1, and 2 in train.py. So appreciated about your kind reply.

感谢兄弟帮忙。

EdwardVincentMa on 12 Nov 2019

@SueeH @clhne @nttstar 感谢你们的热心回复，问题已经解决，确实如你们所说的一样。我建议文档是否可以把这些类似的小细节写进去，让新人少走弯路。
再次谢谢你们！

不能更同意层主的说法。我的情况是训练到10w+步的时候发现loss突然激增到nan，需要修改超参来作断点续训，使用了pretrained想要去加载断点，发现仿佛重新开始从0开始训练。如果有官方文档讲解下续载问题，能让对小白轻松些。
另，有几个细节想询问一下你：
①上面@wting861006 提到要将lr设置为断点处的lr，这属实吗？因为我猜测，断点处不是应该保存了lr吗？
②我注意到，不管训练时保存了ckpt（param+symbol）几次，param都只是-0001.params？！不像tensroflow的ckpt那样，eventout有好些断点。这是否会对断点加载有影响？
Thx for your attention!

❶I think the .params saved the lr.
❷You could modify the default.ckpt's value to modify the strategy of ckpt saving:
0: discard saving. 1: save when necessary. 2: always save. 3: save one model, prefix-0001.params
So, I guess you set default.ckpt==3.

You're right. Thx for your advice. I was so stubborn that I thought ckpt need to be set to 3 cause the sample.config.py made this and there is no help info. about 3 but 0, 1, and 2 in train.py. So appreciated about your kind reply.

我说错了，不是从头训练，好像loss是nan，测试保存的时候loss是20左右，从para中继续训练就变成Nan了，不知道为啥。

EdwardVincentMa on 12 Nov 2019

@EdwardVincentMa
Decrease the learning rate.

clhne on 12 Nov 2019

Insightface: 中断训练后，要重新训练怎么从之前保存的ckpt接着训练呢？

Most helpful comment

All 23 comments

Related issues