I try to retrain vargfacenet as in the LFR Challenge, however the training is showing infinite loss immediately.
Can someone please help?
For the information, I train with smaller batch_size and with only one GPU, but I doubt if that is the problem.
Log:
CUDA_VISIBLE_DEVICES='0' python -u train.py --network vargfacenet --loss arcface --dataset retina
gpu num: 1
prefix ./models/vargfacenet-arcface-retina/model
image_size [112, 112]
num_classes 93431
Called with argument: Namespace(batch_size=32, ckpt=3, ctx_num=1, dataset='retina', frequent=20, image_channel=3, kvstore='device', loss='arcface', lr=0.1, lr_steps='100000,160000,220000', models_root='./models', mom=0.9, network='vargfacenet', per_batch_size=32, pretrained='', pretrained_epoch=1, rescale_threshold=0, verbose=2000, wd=0.0005) {'bn_mom': 0.9, 'workspace': 256, 'emb_size': 512, 'ckpt_embedding': True, 'net_se': 0, 'net_act': 'prelu', 'net_unit': 3, 'net_input': 1, 'net_blocks': [1, 4, 6, 2], 'net_output': 'J', 'net_multiplier': 1.25, 'val_targets': ['lfw', 'cfp_fp', 'agedb_30'], 'ce_loss': True, 'fc7_lr_mult': 1.0, 'fc7_wd_mult': 1.0, 'fc7_no_bias': False, 'max_steps': 0, 'data_rand_mirror': True, 'data_cutoff': False, 'data_color': 0, 'data_images_filter': 0, 'count_flops': True, 'memonger': False, 'loss_name': 'margin_softmax', 'loss_s': 64.0, 'loss_m1': 1.0, 'loss_m2': 0.5, 'loss_m3': 0.0, 'net_name': 'vargfacenet', 'dataset': 'retina', 'dataset_path': '../datasets/ms1m-retinaface-t1', 'num_classes': 93431, 'image_shape': [112, 112, 3], 'loss': 'arcface', 'network': 'vargfacenet', 'num_workers': 1, 'batch_size': 32, 'per_batch_size': 32}
Network FLOPs: 1.0G
INFO:root:loading recordio ../datasets/ms1m-retinaface-t1/train.rec...
header0 label [5179511. 5272942.]
id2range 93431
5179510
rand_mirror True
[13:21:19] src/engine/engine.cc:55: MXNet start using engine: ThreadedEnginePerDevice
loading bin 0
loading bin 1000
loading bin 2000
loading bin 3000
loading bin 4000
loading bin 5000
loading bin 6000
loading bin 7000
loading bin 8000
loading bin 9000
loading bin 10000
loading bin 11000
(12000, 3, 112, 112)
ver lfw
loading bin 0
loading bin 1000
loading bin 2000
loading bin 3000
loading bin 4000
loading bin 5000
loading bin 6000
loading bin 7000
loading bin 8000
loading bin 9000
loading bin 10000
loading bin 11000
loading bin 12000
loading bin 13000
(14000, 3, 112, 112)
ver cfp_fp
loading bin 0
loading bin 1000
loading bin 2000
loading bin 3000
loading bin 4000
loading bin 5000
loading bin 6000
loading bin 7000
loading bin 8000
loading bin 9000
loading bin 10000
loading bin 11000
(12000, 3, 112, 112)
ver agedb_30
lr_steps [100000, 160000, 220000]
call reset()
/home/vdx/csenv/lib/python3.7/site-packages/mxnet/module/base_module.py:504: UserWarning: Optimizer created manually outside Module but rescale_grad is not normalized to 1.0/batch_size/num_workers (1.0 vs. 0.03125). Is this intended?
optimizer_params=optimizer_params)
[13:22:01] src/operator/nn/./cudnn/./cudnn_algoreg-inl.h:97: Running performance tests to find the best convolution algorithm, this can take a while... (set the environment variable MXNET_CUDNN_AUTOTUNE_DEFAULT to 0 to disable)
INFO:root:Epoch[0] Batch [0-20] Speed: 62.45 samples/sec acc=0.000000 lossvalue=nan
INFO:root:Epoch[0] Batch [20-40] Speed: 61.56 samples/sec acc=0.000000 lossvalue=nan
INFO:root:Epoch[0] Batch [40-60] Speed: 58.71 samples/sec acc=0.000000 lossvalue=nan
INFO:root:Epoch[0] Batch [60-80] Speed: 44.77 samples/sec acc=0.000000 lossvalue=nan
INFO:root:Epoch[0] Batch [80-100] Speed: 74.33 samples/sec acc=0.000000 lossvalue=nan
INFO:root:Epoch[0] Batch [100-120] Speed: 26.56 samples/sec acc=0.000000 lossvalue=nan
@doxuanviet1996
May be you can decrease the learning rate, eg 0.01.
Update: Decreasing the learning rate does solve the issue. Thanks!
@doxuanviet1996 Did you successfully train with vargfacenet锛烼he accuracy of my model is very low. I want to know your setting.
@doxuanviet1996 Did you successfully train with vargfacenet锛烼he accuracy of my model is very low. I want to know your setting.
The training accuracy is always low, but the validation/testing accuracy is similar to the one reported in the paper. I ended up with ~ 0.97 after 1 week.
I duplicate the default setting, except for per batch size -> 64. I have 2 GPUs, so 128 batch size in total.
Thank you for your instant reply. I will try immediately.
@doxuanviet1996 Another question, how can I get the FR dataset ms1m-retinaface-t1? I just have faces_emore and faces_ms1m_112x112.
Go to the lfr challenge page, they link to the retina dataset.
The training accuracy is always low, but the validation/testing accuracy is similar to the one reported in the paper. I ended up with ~ 0.97 after 1 week.
I duplicate the default setting, except for per batch size -> 64. I have 2 GPUs, so 128 batch size in total.
Hello! Could you possibly share your final model weights for vargfacenet?
@doxuanviet1996
Can you share the final models ?