Insightface: vargfacenet + arcloss infinite loss

Created on 16 Nov 2019 · 9Comments · Source: deepinsight/insightface

I try to retrain vargfacenet as in the LFR Challenge, however the training is showing infinite loss immediately.
Can someone please help?
For the information, I train with smaller batch_size and with only one GPU, but I doubt if that is the problem.
Log:

CUDA_VISIBLE_DEVICES='0' python -u train.py --network vargfacenet --loss arcface --dataset retina
gpu num: 1
prefix ./models/vargfacenet-arcface-retina/model
image_size [112, 112]
num_classes 93431
Called with argument: Namespace(batch_size=32, ckpt=3, ctx_num=1, dataset='retina', frequent=20, image_channel=3, kvstore='device', loss='arcface', lr=0.1, lr_steps='100000,160000,220000', models_root='./models', mom=0.9, network='vargfacenet', per_batch_size=32, pretrained='', pretrained_epoch=1, rescale_threshold=0, verbose=2000, wd=0.0005) {'bn_mom': 0.9, 'workspace': 256, 'emb_size': 512, 'ckpt_embedding': True, 'net_se': 0, 'net_act': 'prelu', 'net_unit': 3, 'net_input': 1, 'net_blocks': [1, 4, 6, 2], 'net_output': 'J', 'net_multiplier': 1.25, 'val_targets': ['lfw', 'cfp_fp', 'agedb_30'], 'ce_loss': True, 'fc7_lr_mult': 1.0, 'fc7_wd_mult': 1.0, 'fc7_no_bias': False, 'max_steps': 0, 'data_rand_mirror': True, 'data_cutoff': False, 'data_color': 0, 'data_images_filter': 0, 'count_flops': True, 'memonger': False, 'loss_name': 'margin_softmax', 'loss_s': 64.0, 'loss_m1': 1.0, 'loss_m2': 0.5, 'loss_m3': 0.0, 'net_name': 'vargfacenet', 'dataset': 'retina', 'dataset_path': '../datasets/ms1m-retinaface-t1', 'num_classes': 93431, 'image_shape': [112, 112, 3], 'loss': 'arcface', 'network': 'vargfacenet', 'num_workers': 1, 'batch_size': 32, 'per_batch_size': 32}
Network FLOPs: 1.0G
INFO:root:loading recordio ../datasets/ms1m-retinaface-t1/train.rec...
header0 label [5179511. 5272942.]
id2range 93431
5179510
rand_mirror True
[13:21:19] src/engine/engine.cc:55: MXNet start using engine: ThreadedEnginePerDevice
loading bin 0
loading bin 1000
loading bin 2000
loading bin 3000
loading bin 4000
loading bin 5000
loading bin 6000
loading bin 7000
loading bin 8000
loading bin 9000
loading bin 10000
loading bin 11000
(12000, 3, 112, 112)
ver lfw
loading bin 0
loading bin 1000
loading bin 2000
loading bin 3000
loading bin 4000
loading bin 5000
loading bin 6000
loading bin 7000
loading bin 8000
loading bin 9000
loading bin 10000
loading bin 11000
loading bin 12000
loading bin 13000
(14000, 3, 112, 112)
ver cfp_fp
loading bin 0
loading bin 1000
loading bin 2000
loading bin 3000
loading bin 4000
loading bin 5000
loading bin 6000
loading bin 7000
loading bin 8000
loading bin 9000
loading bin 10000
loading bin 11000
(12000, 3, 112, 112)
ver agedb_30
lr_steps [100000, 160000, 220000]
call reset()
/home/vdx/csenv/lib/python3.7/site-packages/mxnet/module/base_module.py:504: UserWarning: Optimizer created manually outside Module but rescale_grad is not normalized to 1.0/batch_size/num_workers (1.0 vs. 0.03125). Is this intended?
  optimizer_params=optimizer_params)
[13:22:01] src/operator/nn/./cudnn/./cudnn_algoreg-inl.h:97: Running performance tests to find the best convolution algorithm, this can take a while... (set the environment variable MXNET_CUDNN_AUTOTUNE_DEFAULT to 0 to disable)
INFO:root:Epoch[0] Batch [0-20] Speed: 62.45 samples/sec    acc=0.000000    lossvalue=nan
INFO:root:Epoch[0] Batch [20-40]    Speed: 61.56 samples/sec    acc=0.000000    lossvalue=nan
INFO:root:Epoch[0] Batch [40-60]    Speed: 58.71 samples/sec    acc=0.000000    lossvalue=nan
INFO:root:Epoch[0] Batch [60-80]    Speed: 44.77 samples/sec    acc=0.000000    lossvalue=nan
INFO:root:Epoch[0] Batch [80-100]   Speed: 74.33 samples/sec    acc=0.000000    lossvalue=nan
INFO:root:Epoch[0] Batch [100-120]  Speed: 26.56 samples/sec    acc=0.000000    lossvalue=nan

Source

doxuanviet1996

All 9 comments

@doxuanviet1996
May be you can decrease the learning rate, eg 0.01.

clhne on 18 Nov 2019

Update: Decreasing the learning rate does solve the issue. Thanks!

doxuanviet1996 on 24 Nov 2019

@doxuanviet1996 Did you successfully train with vargfacenet？The accuracy of my model is very low. I want to know your setting.

chenghan1995 on 3 Dec 2019

@doxuanviet1996 Did you successfully train with vargfacenet？The accuracy of my model is very low. I want to know your setting.

The training accuracy is always low, but the validation/testing accuracy is similar to the one reported in the paper. I ended up with ~ 0.97 after 1 week.
I duplicate the default setting, except for per batch size -> 64. I have 2 GPUs, so 128 batch size in total.

doxuanviet1996 on 3 Dec 2019

🎉1

Thank you for your instant reply. I will try immediately.

chenghan1995 on 3 Dec 2019

🎉1

@doxuanviet1996 Another question, how can I get the FR dataset ms1m-retinaface-t1? I just have faces_emore and faces_ms1m_112x112.

chenghan1995 on 3 Dec 2019

Go to the lfr challenge page, they link to the retina dataset.

doxuanviet1996 on 3 Dec 2019

🚀1

The training accuracy is always low, but the validation/testing accuracy is similar to the one reported in the paper. I ended up with ~ 0.97 after 1 week.
I duplicate the default setting, except for per batch size -> 64. I have 2 GPUs, so 128 batch size in total.

Hello! Could you possibly share your final model weights for vargfacenet?