Facenet: PCA, hierarchical clustering and k nearest neighbors

Created on 1 Feb 2017 · 10Comments · Source: davidsandberg/facenet

Hi @davidsandberg,

First of all congratulations for your excellent results. The last model you've uploaded is simply amazing. Thanks for sharing it!!

I wanted to share with you how I am using the features generated by the model to build datasets, cluster them and cross-check new images.

First I am using PCA to reduce from 1792 features to 128, as those seem to be sufficient to compare images accurately and all subsequent calculations are faster:
from sklearn.decomposition import PCA
pcafacenet_ = PCA(n_components=128)
pcafacenet_.fit(embfacenet_)
embfacenet = pcafacenet_.transform(embfacenet_)
This PCA object can be persisted using sklearn.externals.joblib.dump and reloaded later, so the first 2 lines can be skipped in the next executions.

Then for clustering I use:
from scipy.spatial.distance import pdist
import fastcluster as fastcluster
import scipy.cluster.hierarchy as hcluster
Y = pdist(embfacenet_, 'euclidean')
Z = fastcluster.single(Y)
labels = hcluster.fcluster(Z, threshold, criterion="distance")

Finally, for comparing a new image with the dataset, I use:
nnfacenet_ = NearestNeighbors(n_neighbors=nn)
nnfacenet_.fit(embfacenet_)
dist, kneighbors = nnfacenet_.kneighbors(newembeddings)

Does this make sense to you?

I know that you are also investing quite some effort in training a classifier. What is the advantage of doing that compared to using k nearest neighbors to look for similar faces?

Source

dleonsanchez

Most helpful comment

Hi @dleonsanchez, @ugtony!
I added a 128d bottleneck before the classification layer and the results look good.
I'm still running training on casia to tune weight decay etc but with the same hyper parameters I get similar performance to the 1792d embeddings.
This will simplify the flow since it will make the PCA unnesseccary.
Will publish new models when training on the Ms-Celeb-1M dataset is done.

        # Build the inference graph
        prelogits, _ = network.inference(image_batch, args.keep_probability, 
            phase_train=phase_train_placeholder, weight_decay=args.weight_decay)
        bottleneck = slim.fully_connected(prelogits, args.embedding_size, activation_fn=None, 
                weights_initializer=tf.truncated_normal_initializer(stddev=0.1), 
                weights_regularizer=slim.l2_regularizer(args.weight_decay),
                normalizer_fn=slim.batch_norm,
                normalizer_params=batch_norm_params,
                scope='Bottleneck', reuse=False)
        logits = slim.fully_connected(bottleneck, len(train_set), activation_fn=None, 
                weights_initializer=tf.truncated_normal_initializer(stddev=0.1), 
                weights_regularizer=slim.l2_regularizer(args.weight_decay),
                scope='Logits', reuse=False)

        embeddings = tf.nn.l2_normalize(bottleneck, 1, 1e-10, name='embeddings')

davidsandberg on 14 Feb 2017

❤6 👍4 🚀1 😄1

All 10 comments

I'm doing something similar and finding my results to be not great. ie the top nearest neighbor matches are false positives. The PCA doesn't seem to help much. Which models & datasets are you having success with?

haydenth on 2 Feb 2017

Hi @dleonsanchez ,
I don't quite understand the intention behind clustering. Is it to clean a noisy dataset or to use the cluster centers for classification (by NN)? Could you give more details? Thanks.

Hi @haydenth ,
I was also planning to implement the "Detailed setting in testing" part in the centerloss paper(A Discriminative Feature...), but it seems that you've already done it and did not get very good results.
Did you train PCA on the whole training dataset(e.g., casia) or on the validation dataset (vfw pairs, the 9 out of 10 folds)? Did you do horizontal-flip and cosine distance as well? Thanks.

ugtony on 7 Feb 2017

I did PCA only on the validation dataset and on my own validation datasets. Did not try other distance functions yet.

haydenth on 7 Feb 2017

@haydenth I am testing with LFW dataset. PCA does not help in the sense of increasing accuracy, the advantage from my point of view is that we can reduce the number of embeddings (to 128 in my example) without losing accuracy, and then the files where I store the embeddings are much smaller and all calculations (loading the file into a pandas object, knn, clustering, etc) are much faster.

@ugtony, for me clustering helps when the use case includes processing faces with unknown identities. If we want to tag them you can either go one by one and, displaying their nearest neighbours, see if each one of them has the same identity as another picture, or we can cluster them (with hierarchical clustering in my example) and then display them in clusters so that tagging them becomes a much easier exercise.

But I am not too sure about this, so that's also why I was asking in the first place.

dleonsanchez on 8 Feb 2017

I have similar issue. I am building classifier similar to openface's. But accuracy is extremely low - around 1-5%.

hudvin on 8 Feb 2017

@haydenth,
I've tested horizontal-flip, PCA(learnt on all casia-webface faces), and cosine distance with a model trained on casia-webface. The results:

Flip+PCA+cosine > Flip ~= PCA+cosine > original > PCA ~= cosine

The performance improves when using PCA and cosine together, but does not when using PCA/cosine alone. Interesting phenomenon.

However, when the model trained on msceleb is used, even Flip+PCA+cosine does not help.

ugtony on 13 Feb 2017

        # Build the inference graph
        prelogits, _ = network.inference(image_batch, args.keep_probability, 
            phase_train=phase_train_placeholder, weight_decay=args.weight_decay)
        bottleneck = slim.fully_connected(prelogits, args.embedding_size, activation_fn=None, 
                weights_initializer=tf.truncated_normal_initializer(stddev=0.1), 
                weights_regularizer=slim.l2_regularizer(args.weight_decay),
                normalizer_fn=slim.batch_norm,
                normalizer_params=batch_norm_params,
                scope='Bottleneck', reuse=False)
        logits = slim.fully_connected(bottleneck, len(train_set), activation_fn=None, 
                weights_initializer=tf.truncated_normal_initializer(stddev=0.1), 
                weights_regularizer=slim.l2_regularizer(args.weight_decay),
                scope='Logits', reuse=False)

        embeddings = tf.nn.l2_normalize(bottleneck, 1, 1e-10, name='embeddings')

davidsandberg on 14 Feb 2017

❤6 👍4 🚀1 😄1

By observing the PCA and the bottleneck-fc-layer experimental resluts(done before on vggface), I agree to davidsandberg's opinion that PCA would be unnecessary. Good to know that the bottleneck layer is to be added to the master branch.

ugtony on 15 Feb 2017

FYI, I have included PCA transformations at the end of validate_on_lfw.py:

for i in [32, 64, 128, 256, 512, 1024, 1792]:
   pcafacenet = PCA(n_components=i)
   pcafacenet.fit(emb_array)
   pcaemb_array = pcafacenet.transform(emb_array)
   print('Number of features: {}'.format(i))
   tpr, fpr, accuracy, val, val_std, far = lfw.evaluate(pcaemb_array, args.seed, actual_issame, nrof_folds=args.lfw_nrof_folds)
   print('Accuracy: %1.3f+-%1.3f' % (np.mean(accuracy), np.std(accuracy)))
   print('Validation rate: %2.5f+-%2.5f @ FAR=%2.5f' % (val, val_std, far))
   print ('')

These are the results I get:

Number of features: 32
Accuracy: 0.959+-0.007
Validation rate: 0.66702+-0.01962 @ FAR=0.00103

Number of features: 64
Accuracy: 0.987+-0.003
Validation rate: 0.91714+-0.01633 @ FAR=0.00103

Number of features: 128
Accuracy: 0.992+-0.003
Validation rate: 0.97604+-0.00678 @ FAR=0.00068

Number of features: 256
Accuracy: 0.992+-0.003
Validation rate: 0.97500+-0.00732 @ FAR=0.00100

Number of features: 512
Accuracy: 0.992+-0.003
Validation rate: 0.97401+-0.00706 @ FAR=0.00100

Number of features: 1024
Accuracy: 0.993+-0.003
Validation rate: 0.97404+-0.00785 @ FAR=0.00100

Number of features: 1792
Accuracy: 0.993+-0.003
Validation rate: 0.97436+-0.00755 @ FAR=0.00100

Slightly less accuracy with 512, 256 and 128 features. Below that, it is significant. Interesting the slightly better validation rate with 128 too.

dleonsanchez on 28 Feb 2017

❤4

Hi @dleonsanchez!
Nice results! Thanks for that!!
It basically confirms the results from the FaceNet paper regarding the required dimensionality of the embedding.
I have added a pretrained model with a 128 wide bottleneck layer. And as you say the performance is very similar to the original 1792 embedding.

davidsandberg on 2 Mar 2017

Was this page helpful?

0 / 5 - 0 ratings