Facenet: Some misconception in implementations

Created on 11 Jan 2017 · 13Comments · Source: davidsandberg/facenet

Hi,
currently I am learning TensorFlow ( I left Torch for a while because some stuff is hard to evaluate) and I'm looking closer to this project. I have find some misconception I think in the training code:

DeConv layer should be attach to any other layer but not the outputs from the last one (exactly like the DropOut). In the code it is attached to 'logits', so it is a 'bad way' of using it. It should be attached to 'prelogits' to decorrelation features, not class predictions.

In the Inception-ResNet there a dropout. It is wrong place to use it when we use "CenterLoss". Why? Because with DropOut the CenterLoss get features with missing fields. As we learn one 'center' per class, this could harm the performance. The DropOut could be applied between prelogits and logits, named ex. prelogits_dropout (where still prelogits go to CenterLoss or DeConvLoss)

I have a plan to reimplement and get ~same results like Caffe implementation of Center-Loss (~98.6 %). But first I need to make some changes in training code. But I must say that TensorFlow with Tf.Slim is a really nice library. And implementing the loss function is really easy :)

Source

melgor

👍1

Most helpful comment

I will continue my finding in code here:

The Implementation of CenterLoss in not proper and I know why:):

def center_loss(features, label, alfa, nrof_classes):
    """Center loss based on the paper "A Discriminative Feature Learning Approach for Deep Face Recognition"
       (http://ydwen.github.io/papers/WenECCV16.pdf)
    """
    with tf.variable_scope('center_loss'):
        batch_size    = tf.cast(tf.shape(features)[0], tf.float32)
        nrof_features = features.get_shape()[1]
        centers       = tf.get_variable('centers', [nrof_classes, nrof_features], dtype=tf.float32,
                            initializer = tf.contrib.layers.xavier_initializer())
        label         = tf.reshape(label, [-1])
        centers_batch = tf.gather(centers, label)
        loss          = tf.nn.l2_loss(features - centers_batch)
        loss          = tf.truediv(loss, batch_size)

    return loss, centers

I added two lines:

batch_size    = tf.cast(tf.shape(features)[0], tf.float32)
loss          = tf.truediv(loss, batch_size)

Why? -> I look like tf.nn.l2_loss does not use reduce_mean function but reduce_sum. This cause that it sum all L2_Loss from each example and not divide by the batch_size. I checked this implementation with my using Numpy and it is correct. Additional, I remove code for updating the centers and I treat is a learnable variable (like in caffe-face repo).
This changes enable learning the model with CenterLoss with same parameters like in Caffe.

If @davidsandberg you agree with this, I can send a PR.

I was investigating the speed of TF is in all scenario it was much slower than Torch. Based on some reading, NCHW is faster than default NHWC. And it is (faster by ~30%). I do not need a lot of changes to use it. Just:

image_batch = tf.transpose(image_batch, [0, 3, 1, 2])

after getting the image_batch variable.

When creating the mode adding data_format = "NCHW":

with slim.arg_scope([slim.conv2d, slim.max_pool2d, slim.avg_pool2d],
                                stride=1, padding='SAME', data_format = "NCHW"):

And for Inception layers, concat must be done by 1 channel

mixed = tf.concat(1, [tower_conv, tower_conv1_1, tower_conv2_2])

TensorFlow is very slow for not-optimized operation. I was trying to learn the model using PReLU:
```python
'''Parametric activtion function'''
def PReLu(x, idx, name = "PReLu"):

If incoming Tensor has a scope, this op is defined inside it

i_scope = ""
if hasattr(x, 'scope'):
if x.scope: i_scope = x.scope

with tf.variable_scope(i_scope + str(idx) + name) as scope:
alphas = tf.get_variable('alpha', x.get_shape()[-1],
initializer=tf.constant_initializer(0.0),
dtype=tf.float32)
```
x = tf.nn.relu(x) +  tf.mul(alphas, (x - tf.abs(x))) * 0.5
```
return x
```The network withPReLUis ~30% slower and take 10-20% memory. So testing my desire architecture is really painful. Currently I decide to useReLU`, maybe I will take a single test with 'ELU'. Any suggestion?
I make a single test using Face-Res architecture with "PReLU" to validate the training with Torch. And in Torch I get ~81% of accuracy, but in TensorFlow 75%. I'm do not know why it happen but it look like in TensorFlow the network overfitt much more. Maybe I miss some regularizarion? Some suggestion?

melgor on 17 Jan 2017

👍4

All 13 comments

Hi @melgor,
Thanks for sharing your ideas!! Looking forward to hear about your progress...
Regarding the Deconv loss the implementation in the repo is not very good, and I don't think it's possible to use on a bit larger datasets due to its memory inefficiency. Although I like the concept of penalizing correlated features I should probably remove the implementation from the repo for now, but I haven't gotten that far. ;-)
Do you mean that the estimate of the center gets noisy due to dropout? Could very well be like that, and it would be interesting to try out the solution that you propose. Currently my GPU is occupied with training on the MsCeleb-1M dataset (looks promising so far) but there is quite some parameter tuning to be done...
It will be interesting to hear about your findings!! Lately I have seen that it's possible to improve performance from 0.984 to 0.988 by training on a selected subset of the CASIA-WebFace dataset (and no facescrub). So even a relatively clean dataset as CASIA still may have enough label noise to drag down the performance a bit. That was a surprise to me at least. :-)
Yes, Tensorflow is a fantastic tool and seems to improve for every release...

davidsandberg on 11 Jan 2017

I like the idea of DeConv too because when you are learning the features using CenterLoss or TripletLoss, you can not use DropOut. DeConv resolve that problem so it would be nice to check if this works ok.

Do you mean that the estimate of the center gets noisy due to dropout?

Yes, exactly this what I'm taking about. I think that it may hurt the performance as CenterLoss penatialize fields which was masked by DropOut earlier. You may try it later with DropOut after 'prelogits'.

! Lately I have seen that it's possible to improve performance from 0.984 to 0.988 by training on a selected subset of the CASIA-WebFace dataset (and no facescrub).

This is a scenario which I want to reimplement. The people of Caffe repo with CenerLoss claim that with clean dataset (and merged features from original image and flipped) they get nearly 99%. Using Caffe, I get 98.6% (original + fliped images), so at least I have a right database. If everything will be working, then I will try to learn model with cleaned version of CASIA.

Again, thanks for fantastic code. It is much easier to learn new stuff when you have such nice code to read!

melgor on 11 Jan 2017

Also felt strange when I read the DropOut and DeConv part. Thank @melgor for pointing them out and making good discussion here.

I have a similar question that might be related:
In the original FaceNet paper, the embedded features are L2-normalized. In facenet_train_classifier.py, the features are normalized in the evaluation phase, but not in the training phase. Is it also some kind of consideration to the dropout layer?
Two relevant experiments had been made:
1) I removed the L2-normalization op in evaluation phase and the performance crashed.
2) I added a L2-normalization op in the training phase and the performance slightly degraded (Acc: 0.980=>0.978) after 100k iterations on VGGFace.

@davidsandberg is working on selecting subset of CASIA-WebFace to get better performance. From #48 I could understand that the process was done by filtering out the images far from centers (correct me if I'm wrong). But what's the intuition behind? Is it to remove correct-labeled samples with rare appearance or to remove the wrong-labeled images?

In either case, in this webpage I found a CASIA-WebFace "cleanlist" (haven't found how they produce it), which shows there are still many wrong-labeled images in the standard published version. Maybe you can run an experiment based on the list and compare the resulting performance.

ugtony on 12 Jan 2017

@ugtony About your experiments, it was right behavior based on theory:

At evaulation stage we want to have all features to be normalized because we calculate the distance between them. If the normalize the using L2-Norm, the maximum distance is 4. When you remove the normalization, your maximum distance between features is Unknown (it could be 100k). So, L2-Norm during evaluation is useful because we can test a range of threshold between 0-4.
When you apply the normalization during training, the input features to 'Classifier' are normalized. I think that this should not hurt the performance a lot.

About CASIA: The clean version removed the wrong-labeled images. I think that correct-labeled should stay even they are rare. I will be using the clean_list which you send. But maybe a data-filtering done by @davidsandberg works better, I will see this later.

melgor on 12 Jan 2017

Thanks for melgor's reply. I'm curious about why normalization during training didn't make the performance better.

The question about filtering and suggestion to test on cleaned CASIA dataset is actually for davidsandberg, sorry for misleading.

ugtony on 12 Jan 2017

@ugtony The intention is to filter out the wrong-labeled examples, but unless the model is perfect already it will not only filter out wrong labeled examples but also correctly labeled examples that can be considered more difficult. And if too many of the difficult examples are removed the training will suffer, so the question is if this procedure can result in a better model. I think the answer is yes.
It should be noted that the final goal is not to clean the CASIA dataset (I'm even using the maxpy cleaned version of casia, so it should be quite clean already) but to apply it to the MsCeleb dataset. This dataset is "intentionally" very dirty and sifting through 8M+ images manually is not anything that I would like to do. ;-)
I tried two ways of cleaning the dataset. The first one assumed that there are some identities that are more dirty than others and removing the classes with the largest intra-class variance we get a "cleaned" dataset. I removed the 10% "worst" classes but got the same performance as the full dataset.
The second method instead removed the 10% of the images that was furthest away from the respective class center. This approach seemed to work well on casia and I'm now trying it for cleaning the MsCeleb dataset.

davidsandberg on 12 Jan 2017

Thanks for the explanation. Looking forward to seeing your result on the MsCeleb dataset.
Do you know where does maxpy casia come from? I googled it and found some download links. But none of them mentioned how it is made or made by whom.

ugtony on 13 Jan 2017

I will continue my finding in code here:

The Implementation of CenterLoss in not proper and I know why:):

def center_loss(features, label, alfa, nrof_classes):
    """Center loss based on the paper "A Discriminative Feature Learning Approach for Deep Face Recognition"
       (http://ydwen.github.io/papers/WenECCV16.pdf)
    """
    with tf.variable_scope('center_loss'):
        batch_size    = tf.cast(tf.shape(features)[0], tf.float32)
        nrof_features = features.get_shape()[1]
        centers       = tf.get_variable('centers', [nrof_classes, nrof_features], dtype=tf.float32,
                            initializer = tf.contrib.layers.xavier_initializer())
        label         = tf.reshape(label, [-1])
        centers_batch = tf.gather(centers, label)
        loss          = tf.nn.l2_loss(features - centers_batch)
        loss          = tf.truediv(loss, batch_size)

    return loss, centers

I added two lines:

batch_size    = tf.cast(tf.shape(features)[0], tf.float32)
loss          = tf.truediv(loss, batch_size)

If @davidsandberg you agree with this, I can send a PR.

I was investigating the speed of TF is in all scenario it was much slower than Torch. Based on some reading, NCHW is faster than default NHWC. And it is (faster by ~30%). I do not need a lot of changes to use it. Just:

image_batch = tf.transpose(image_batch, [0, 3, 1, 2])

after getting the image_batch variable.

When creating the mode adding data_format = "NCHW":

with slim.arg_scope([slim.conv2d, slim.max_pool2d, slim.avg_pool2d],
                                stride=1, padding='SAME', data_format = "NCHW"):

And for Inception layers, concat must be done by 1 channel

mixed = tf.concat(1, [tower_conv, tower_conv1_1, tower_conv2_2])

TensorFlow is very slow for not-optimized operation. I was trying to learn the model using PReLU:
```python
'''Parametric activtion function'''
def PReLu(x, idx, name = "PReLu"):

If incoming Tensor has a scope, this op is defined inside it

i_scope = ""
if hasattr(x, 'scope'):
if x.scope: i_scope = x.scope

with tf.variable_scope(i_scope + str(idx) + name) as scope:
alphas = tf.get_variable('alpha', x.get_shape()[-1],
initializer=tf.constant_initializer(0.0),
dtype=tf.float32)
```
x = tf.nn.relu(x) +  tf.mul(alphas, (x - tf.abs(x))) * 0.5
```
return x
```The network withPReLUis ~30% slower and take 10-20% memory. So testing my desire architecture is really painful. Currently I decide to useReLU`, maybe I will take a single test with 'ELU'. Any suggestion?
I make a single test using Face-Res architecture with "PReLU" to validate the training with Torch. And in Torch I get ~81% of accuracy, but in TensorFlow 75%. I'm do not know why it happen but it look like in TensorFlow the network overfitt much more. Maybe I miss some regularizarion? Some suggestion?

melgor on 17 Jan 2017

👍4

Hi @melgor
I agree with the scaling of the center loss. But I guess it shouldn't affect the results but just result in another "optimal" center loss coefficient. If you send a PR I will merge it.
For the data format this is very interesting. I didn't think it would have that big impact. I assume that it has to do with CUDNNs native data format. I will try the transpose when my GPU is free for something else. ;-) Thanks!

davidsandberg on 26 Jan 2017

Hi again @melgor,
I tried to change the data_format to 'NCHW', but I didn't see any speed-up.
I'm running on a Titan X (Pascal) using Cuda 8 and CuDNN 5.1.5, and training takes ~0.38 seconds for one batch (90 examples).
Which setup did you use when you saw the speed-up?

davidsandberg on 28 Jan 2017

I have GTX 980 (Cuda 8 and CuDNN 5.1.5). Maybe you could check the speed based on benchmark here : I changed the batch-size to 64.

This is the results from this benchmark:

NCHW:
Forward across 100 steps, 0.091 +/- 0.009 sec / batch
Forward-backward across 100 steps, 0.281 +/- 0.028 sec / batch

Forward across 100 steps, 0.103 +/- 0.010 sec / batch
Forward-backward across 100 steps, 0.341 +/- 0.034 sec / batch

So it is the difference. I have pretty much same difference when using "ResNet" architecture. But I did not try the running "inception_resnet_v2", so maybe it is not true for all architecutures.

melgor on 2 Feb 2017

Hi @davidsandberg, just to confirm that the "label" in Centerloss function is dense label, not one hot vector type , right?