Facenet: The accuracy/validation_rate keeps decreasing and drops to 0.5/0 when learning rate equals to 0.0001

Created on 4 Jan 2017 · 8Comments · Source: davidsandberg/facenet

Hi,
I've downloaded the vggface images from the urls and collected a total of 2.04 million faces after the MTCNN alignment step. I used the vggface images as the training set and trained the network with the method described on the page Classifier training of Inception-ResNet-v1 with small parameter modification (set center_loss_factor to 5e-5, and other parameters as described on the page).

The training works well in the beginning. The accuracy increases especially when the learning rate drops. The performance reaches Acc: 0.98, Val:0.87 when Epoch=76 (last epoch for lr=0.01). However, the performance didn't improve when the process continues. Instead, it gently declined:
Epoch=80: Acc: 0.98, Val:0.85
Epoch=499: Acc: 0.97, Val:0.63
Epoch=799, Acc: 0.95, Val:0.54
Epoch=999, Acc: 0.94, Val:0.39.
When Epoch=1000 and learning rate=0.0001, the loss suddenly became nan and Acc:0.5, Val:0.

I wonder why this happened. Is it caused by the regularization or the center loss term?

BTW, I didn't change the epoch_size according to the vggface dataset size. Would that be a problem?

Source

ugtony

Most helpful comment

There is a coding error in facenet.get_learning_rate_from_file function, it won't return a value for epochs equal or larger then last entry in scheduling file.

I use this code and it works well:

def get_learning_rate_from_file(filename, epoch):
    learning_rate = 0.1
    with open(filename, 'r') as f:
        for line in f.readlines():
            line = line.split('#', 1)[0]
            if line:
                par = line.strip().split(':')
                e = int(par[0])
                lr = float(par[1])
                if e <= epoch:
                    learning_rate = lr
    return learning_rate

antoniosimunovic on 3 May 2017

👍4

All 8 comments

Hi @ugtony!
Nice to see some results on the VGG Face dataset!
But I must say I'm a bit confused by the problem you are having. I don't see it as very likely that the model should start overfitting with this large dataset, so the only explanation would be some bug causing the training to do worse with the lower learning rates. But I don't see what problem would cause that behaviour either.
It would be interesting to look at if the performance start to drop when the learning rate changes or if it's something else that causes the degradation to start at epoch 76.

davidsandberg on 8 Jan 2017

I changed the learning_rate_schedule file and ran the training again.
The learning rate is scheduled as

   0:  0.1 
   65: 0.01 
   77: 0.001
   101:0.0005
   121:0.00025
   141:0.000125
   161:0.0000625

The performance remained good(Acc: 0.981+-0.006, Val: 0.91479+-0.02202@FAR=0.00100) until the 160-th epoch was finished, and quickly dropped to Acc: 0.485 at 161-th epoch.

Epoch: [160][999/1000]  Time 1.140      Loss 3.087      RegLoss 1.384
Epoch: [160][1000/1000] Time 1.136      Loss 2.806      RegLoss 1.387

Epoch: [161][1/1000]    Time 1.528      Loss 2.640      RegLoss 1.382
Epoch: [161][2/1000]    Time 1.097      Loss nan        RegLoss nan

so I think there are two issues here
1) why the performance degrades steadily (though slowly) when the learning rate equals 0.001.
2) why the performance crashes immediately when the learning rate is lower than a certain value (somewhat between 0.0001 and 0.000125)

ugtony on 10 Jan 2017

Hi @ugtony,
Did you figure this one out? In #124 @mgy89 trained for 1.4M time steps without degrading performance.
Could the problem be related to the learning rate schedule?

davidsandberg on 12 Feb 2017

Sorry, I didn't figure that out. Maybe the problem doesn't exist in the current version. You could temporarily close the issue.

ugtony on 13 Feb 2017

Closing this for now

davidsandberg on 14 Feb 2017

The problem still exists, the accuracy crashes soon after the earning rate becomes 0.0001.

ugtony on 20 Mar 2017

There is a coding error in facenet.get_learning_rate_from_file function, it won't return a value for epochs equal or larger then last entry in scheduling file.

I use this code and it works well:

def get_learning_rate_from_file(filename, epoch):
    learning_rate = 0.1
    with open(filename, 'r') as f:
        for line in f.readlines():
            line = line.split('#', 1)[0]
            if line:
                par = line.strip().split(':')
                e = int(par[0])
                lr = float(par[1])
                if e <= epoch:
                    learning_rate = lr
    return learning_rate

antoniosimunovic on 3 May 2017

👍4

I think @antoniosimunovic is right, thanks!

ugtony on 3 May 2017

Was this page helpful?

0 / 5 - 0 ratings