Facenet: How to train NN3

Created on 7 May 2017  ·  9Comments  ·  Source: davidsandberg/facenet

Two proplem:
1) Loss nan , what should I set in Learning rate schedule
I try even that:

Learning rate schedule

Maps an epoch number to a learning rate

0: 0.0001
100: 0.00001

Also get:
Epoch: [0][94/1000] Time 0.875 Loss 11.982 RegLoss 1.697
Epoch: [0][95/1000] Time 0.861 Loss 11.341 RegLoss 1.698
Epoch: [0][96/1000] Time 0.859 Loss 11.224 RegLoss 1.700
Epoch: [0][97/1000] Time 0.859 Loss nan RegLoss nan
Epoch: [0][98/1000] Time 0.875 Loss nan RegLoss nan
Epoch: [0][99/1000] Time 0.860 Loss nan RegLoss nan
Epoch: [0][100/1000] Time 0.859 Loss nan RegLoss nan
Epoch: [0][101/1000] Time 0.859 Loss nan RegLoss nan

my shell cmd:
python facenet_master/src/facenet_train_classifier.py --logs_base_dir logs/ --models_base_dir models/ --data_dir dataset/CASIA-WebFace/casia_maxpy_mtcnnpy_182 --image_size 160 --model_def models.nn3 --optimizer RMSPROP --learning_rate -1 --max_nrof_epochs 80 --keep_probability 0.8 --random_crop --random_flip --learning_rate_schedule_file facenet_master/data/learning_rate_schedule_nn3.txt --weight_decay 5e-5 --center_loss_factor 1e-4 --center_loss_alfa 0.9 --batch_size 10

2) Cannot have number of splits n_splits=10 greater than the number of 0

When I run facenet_train_classifier.py with "--lfw_dir" such as:
python facenet_master/src/facenet_train_classifier.py --logs_base_dir logs/ --models_base_dir models/ --data_dir dataset/CASIA-WebFace/casia_maxpy_mtcnnpy_182 --image_size 160 --model_def models.nn3 --lfw_dir dataset/lfw/lfw_mtcnnalign_160 --optimizer RMSPROP --learning_rate -1 --max_nrof_epochs 80 --keep_probability 0.8 --random_crop --random_flip --learning_rate_schedule_file facenet_master/data/learning_rate_schedule_nn3.txt --weight_decay 5e-5 --center_loss_factor 1e-4 --center_loss_alfa 0.9 --batch_size 10

I get that:
Saving variables
Variables saved in 3.32 seconds
Saving metagraph
Metagraph saved in 27.87 seconds
Runnning forward pass on LFW images
Traceback (most recent call last):
File "facenet_master/src/facenet_train_classifier.py", line 468, in
main(parse_arguments(sys.argv[1:]))
File "facenet_master/src/facenet_train_classifier.py", line 244, in main
embeddings, label_batch, lfw_paths, actual_issame, args.lfw_batch_size, args
.lfw_nrof_folds, log_dir, step, summary_writer)
File "facenet_master/src/facenet_train_classifier.py", line 344, in evaluate
_, _, accuracy, val, val_std, far = lfw.evaluate(emb_array, actual_issame, n
rof_folds=nrof_folds)
File "FaceNetfacenet_mastersrclfw.py", line 4
0, in evaluate
np.asarray(actual_issame), nrof_folds=nrof_folds)
File "FaceNetfacenet_mastersrcfacenet.py", li
ne 405, in calculate_roc
for fold_idx, (train_set, test_set) in enumerate(k_fold.split(indices)):
File "env-35libsite-packagessklearnmodel_selection_split.py", l
ine 320, in split
n_samples))
ValueError: Cannot have number of splits n_splits=10 greater than the number of
samples: 0.

And I think that is interrelated with validate_on_lfw.py
But when I try this :
python facenet_master/src/validate_on_lfw.py dataset/lfw/lfw_mtcnnpy_160 models/20170507-111919 (the files save by facenet_train_classifier.py).
the computer run well.

Most helpful comment

For the second question, probably because the lfw images are not loaded correctly. Please check your lfw path.

All 9 comments

For the second question, probably because the lfw images are not loaded correctly. Please check your lfw path.

Yes!
--lfw_dir dataset/lfw/lfw_mtcnnpy_160
not --lfw_dir dataset/lfw/lfw_mtcnnalign_160

Will Davidsandberg please show how to training NN2/NN3/NN4/NNS1/NNS2, It does not work! All will Loss nan

Hi,
Since I don't have time to support these models they will most likely be removed in a not too distant future.
However, I tried to run NN3 and NN4 for a short time and I didn't get any NaNs for the loss.
For NN3 I ran
--data_dir ~/datasets/casia/casia_maxpy_mtcnnalign_182_160 --image_size 160 --model_def models.nn3 --weight_decay 1e-4 --optimizer RMSPROP --learning_rate -1 --max_nrof_epochs 80 --keep_probability 0.8 --random_crop --random_flip --random_rotate --learning_rate_schedule_file ../data/learning_rate_schedule_classifier_casia.txt --center_loss_factor 2e-4

The output I got was

Epoch: [0][1/1000]  Time 4.514  Loss 15.115 RegLoss 5.505
Epoch: [0][2/1000]  Time 2.076  Loss 15.045 RegLoss 5.471
Epoch: [0][3/1000]  Time 0.646  Loss 15.040 RegLoss 5.439
Epoch: [0][4/1000]  Time 0.628  Loss 15.058 RegLoss 5.413
Epoch: [0][5/1000]  Time 0.288  Loss 15.215 RegLoss 5.383
Epoch: [0][6/1000]  Time 0.353  Loss 15.068 RegLoss 5.309
Epoch: [0][7/1000]  Time 0.672  Loss 14.896 RegLoss 5.314
Epoch: [0][8/1000]  Time 0.360  Loss 14.830 RegLoss 5.203
Epoch: [0][9/1000]  Time 0.638  Loss 14.776 RegLoss 5.164

And this is what it looked after the first epoch:

Epoch: [1][293/1000]    Time 0.362  Loss 11.131 RegLoss 2.576
Epoch: [1][294/1000]    Time 0.371  Loss 11.157 RegLoss 2.583
Epoch: [1][295/1000]    Time 0.369  Loss 11.109 RegLoss 2.576

Did you manage to run run e.g. inception_resnet_v1 successfully?

It work !

The output I got was:
...
Epoch: [2][201/1000] Time 0.860 Loss 14.678 RegLoss 2.978
Epoch: [2][202/1000] Time 0.875 Loss 14.678 RegLoss 2.964
Epoch: [2][203/1000] Time 0.891 Loss 14.296 RegLoss 2.969
Epoch: [2][204/1000] Time 0.875 Loss 15.092 RegLoss 2.968
Epoch: [2][205/1000] Time 0.906 Loss 15.252 RegLoss 2.979
Epoch: [2][206/1000] Time 0.876 Loss 15.805 RegLoss 2.971
Epoch: [2][207/1000] Time 0.891 Loss 15.612 RegLoss 2.974
Epoch: [2][208/1000] Time 0.877 Loss 16.021 RegLoss 2.987
Epoch: [2][209/1000] Time 0.891 Loss 15.348 RegLoss 2.960
Epoch: [2][210/1000] Time 0.875 Loss 14.565 RegLoss 2.969
Epoch: [2][211/1000] Time 0.907 Loss 14.791 RegLoss 2.972
Epoch: [2][212/1000] Time 0.922 Loss 14.451 RegLoss 2.974
Epoch: [2][213/1000] Time 0.928 Loss 14.207 RegLoss 2.982
Epoch: [2][214/1000] Time 0.922 Loss 15.852 RegLoss 2.972
Epoch: [2][215/1000] Time 0.891 Loss 15.093 RegLoss 2.979
Epoch: [2][216/1000] Time 0.923 Loss 15.296 RegLoss 2.973
Epoch: [2][217/1000] Time 0.891 Loss 13.584 RegLoss 2.973
Epoch: [2][218/1000] Time 0.876 Loss 13.881 RegLoss 2.968
Epoch: [2][219/1000] Time 0.891 Loss 14.799 RegLoss 2.965
Epoch: [2][220/1000] Time 0.891 Loss 15.481 RegLoss 2.973
Epoch: [2][221/1000] Time 0.891 Loss 14.960 RegLoss 2.971

Great!

Nan Again!
Epoch: [27][359/1000] Time 0.938 Loss 8.829 RegLoss 0.213
Epoch: [27][360/1000] Time 0.937 Loss 8.820 RegLoss 0.213
Epoch: [27][361/1000] Time 0.938 Loss 9.171 RegLoss 0.213
Epoch: [27][362/1000] Time 0.955 Loss 8.713 RegLoss 0.213
Epoch: [27][363/1000] Time 0.922 Loss 8.978 RegLoss 0.213
Epoch: [27][364/1000] Time 0.923 Loss 9.253 RegLoss 0.213
Epoch: [27][365/1000] Time 0.906 Loss nan RegLoss nan
Epoch: [27][366/1000] Time 0.916 Loss nan RegLoss nan
Epoch: [27][367/1000] Time 0.906 Loss nan RegLoss nan
Epoch: [27][368/1000] Time 0.922 Loss nan RegLoss nan
Epoch: [27][369/1000] Time 0.892 Loss nan RegLoss nan
Epoch: [27][370/1000] Time 0.875 Loss nan RegLoss nan
Epoch: [27][371/1000] Time 0.922 Loss nan RegLoss nan
Epoch: [27][372/1000] Time 0.929 Loss nan RegLoss nan
Epoch: [27][373/1000] Time 0.906 Loss nan RegLoss nan
Epoch: [27][374/1000] Time 0.988 Loss nan RegLoss nan
Epoch: [27][375/1000] Time 0.938 Loss nan RegLoss nan
Epoch: [27][376/1000] Time 0.922 Loss nan RegLoss nan
Epoch: [27][377/1000] Time 0.922 Loss nan RegLoss nan
Epoch: [27][378/1000] Time 0.906 Loss nan RegLoss nan

I edit “learning_rate_schedule_classifier_casia.txt”
But it seen to no use.

According to:
https://q-a-assistant.info/computer-internet-technology/valueerror-cannot-have-number-of-splits-n-splits-10-greater-than-the-number-of-samples-0/1299682

It could be because of png/jpg. If validate_on_lfw.py expect png and your dataset is jpg, it will fail to load image and cause the error.
Adding parameter "--lfw_file_ext jpg" can solve the issue.

Dear All,
Thanks David for sharing this great work!

I also encountered loss = nan

There are two such scenario :

  1. I did not specify lfw data dir
  2. I specified a lfw data dir not aligned the same way as training data

After I found and correct lfw related mismatch, non-specified, everything back to normal.

But I think maybe I still need to trace code and see what exactly caused loss = nan.

BR,
JimmyYS

Was this page helpful?
0 / 5 - 0 ratings

Related issues

tonybaigang picture tonybaigang  ·  3Comments

Leedonggeon picture Leedonggeon  ·  3Comments

arunxz98 picture arunxz98  ·  3Comments

MrXu picture MrXu  ·  3Comments

haochange picture haochange  ·  3Comments