Hi,
I've downloaded the vggface images from the urls and collected a total of 2.04 million faces after the MTCNN alignment step. I used the vggface images as the training set and trained the network with the method described on the page Classifier training of Inception-ResNet-v1 with small parameter modification (set center_loss_factor to 5e-5, and other parameters as described on the page).
The training works well in the beginning. The accuracy increases especially when the learning rate drops. The performance reaches Acc: 0.98, Val:0.87 when Epoch=76 (last epoch for lr=0.01). However, the performance didn't improve when the process continues. Instead, it gently declined:
Epoch=80: Acc: 0.98, Val:0.85
Epoch=499: Acc: 0.97, Val:0.63
Epoch=799, Acc: 0.95, Val:0.54
Epoch=999, Acc: 0.94, Val:0.39.
When Epoch=1000 and learning rate=0.0001, the loss suddenly became nan and Acc:0.5, Val:0.
I wonder why this happened. Is it caused by the regularization or the center loss term?
BTW, I didn't change the epoch_size according to the vggface dataset size. Would that be a problem?
Hi @ugtony!
Nice to see some results on the VGG Face dataset!
But I must say I'm a bit confused by the problem you are having. I don't see it as very likely that the model should start overfitting with this large dataset, so the only explanation would be some bug causing the training to do worse with the lower learning rates. But I don't see what problem would cause that behaviour either.
It would be interesting to look at if the performance start to drop when the learning rate changes or if it's something else that causes the degradation to start at epoch 76.
I changed the learning_rate_schedule file and ran the training again.
The learning rate is scheduled as
0: 0.1
65: 0.01
77: 0.001
101:0.0005
121:0.00025
141:0.000125
161:0.0000625
The performance remained good(Acc: 0.981+-0.006, Val: 0.91479+-0.02202@FAR=0.00100) until the 160-th epoch was finished, and quickly dropped to Acc: 0.485 at 161-th epoch.
Epoch: [160][999/1000] Time 1.140 Loss 3.087 RegLoss 1.384
Epoch: [160][1000/1000] Time 1.136 Loss 2.806 RegLoss 1.387
Epoch: [161][1/1000] Time 1.528 Loss 2.640 RegLoss 1.382
Epoch: [161][2/1000] Time 1.097 Loss nan RegLoss nan
so I think there are two issues here
1) why the performance degrades steadily (though slowly) when the learning rate equals 0.001.
2) why the performance crashes immediately when the learning rate is lower than a certain value (somewhat between 0.0001 and 0.000125)
Hi @ugtony,
Did you figure this one out? In #124 @mgy89 trained for 1.4M time steps without degrading performance.
Could the problem be related to the learning rate schedule?
Sorry, I didn't figure that out. Maybe the problem doesn't exist in the current version. You could temporarily close the issue.
Closing this for now
The problem still exists, the accuracy crashes soon after the earning rate becomes 0.0001.
There is a coding error in facenet.get_learning_rate_from_file function, it won't return a value for epochs equal or larger then last entry in scheduling file.
I use this code and it works well:
def get_learning_rate_from_file(filename, epoch):
learning_rate = 0.1
with open(filename, 'r') as f:
for line in f.readlines():
line = line.split('#', 1)[0]
if line:
par = line.strip().split(':')
e = int(par[0])
lr = float(par[1])
if e <= epoch:
learning_rate = lr
return learning_rate
I think @antoniosimunovic is right, thanks!
Most helpful comment
There is a coding error in facenet.get_learning_rate_from_file function, it won't return a value for epochs equal or larger then last entry in scheduling file.
I use this code and it works well: