Hi,
currently I am learning TensorFlow ( I left Torch for a while because some stuff is hard to evaluate) and I'm looking closer to this project. I have find some misconception I think in the training code:
I have a plan to reimplement and get ~same results like Caffe implementation of Center-Loss (~98.6 %). But first I need to make some changes in training code. But I must say that TensorFlow with Tf.Slim is a really nice library. And implementing the loss function is really easy :)
Hi @melgor,
Thanks for sharing your ideas!! Looking forward to hear about your progress...
Regarding the Deconv loss the implementation in the repo is not very good, and I don't think it's possible to use on a bit larger datasets due to its memory inefficiency. Although I like the concept of penalizing correlated features I should probably remove the implementation from the repo for now, but I haven't gotten that far. ;-)
Do you mean that the estimate of the center gets noisy due to dropout? Could very well be like that, and it would be interesting to try out the solution that you propose. Currently my GPU is occupied with training on the MsCeleb-1M dataset (looks promising so far) but there is quite some parameter tuning to be done...
It will be interesting to hear about your findings!! Lately I have seen that it's possible to improve performance from 0.984 to 0.988 by training on a selected subset of the CASIA-WebFace dataset (and no facescrub). So even a relatively clean dataset as CASIA still may have enough label noise to drag down the performance a bit. That was a surprise to me at least. :-)
Yes, Tensorflow is a fantastic tool and seems to improve for every release...
I like the idea of DeConv too because when you are learning the features using CenterLoss or TripletLoss, you can not use DropOut. DeConv resolve that problem so it would be nice to check if this works ok.
Do you mean that the estimate of the center gets noisy due to dropout?
Yes, exactly this what I'm taking about. I think that it may hurt the performance as CenterLoss penatialize fields which was masked by DropOut earlier. You may try it later with DropOut after 'prelogits'.
! Lately I have seen that it's possible to improve performance from 0.984 to 0.988 by training on a selected subset of the CASIA-WebFace dataset (and no facescrub).
This is a scenario which I want to reimplement. The people of Caffe repo with CenerLoss claim that with clean dataset (and merged features from original image and flipped) they get nearly 99%. Using Caffe, I get 98.6% (original + fliped images), so at least I have a right database. If everything will be working, then I will try to learn model with cleaned version of CASIA.
Again, thanks for fantastic code. It is much easier to learn new stuff when you have such nice code to read!
Also felt strange when I read the DropOut and DeConv part. Thank @melgor for pointing them out and making good discussion here.
I have a similar question that might be related:
In the original FaceNet paper, the embedded features are L2-normalized. In facenet_train_classifier.py, the features are normalized in the evaluation phase, but not in the training phase. Is it also some kind of consideration to the dropout layer?
Two relevant experiments had been made:
1) I removed the L2-normalization op in evaluation phase and the performance crashed.
2) I added a L2-normalization op in the training phase and the performance slightly degraded (Acc: 0.980=>0.978) after 100k iterations on VGGFace.
@davidsandberg is working on selecting subset of CASIA-WebFace to get better performance. From #48 I could understand that the process was done by filtering out the images far from centers (correct me if I'm wrong). But what's the intuition behind? Is it to remove correct-labeled samples with rare appearance or to remove the wrong-labeled images?
In either case, in this webpage I found a CASIA-WebFace "cleanlist" (haven't found how they produce it), which shows there are still many wrong-labeled images in the standard published version. Maybe you can run an experiment based on the list and compare the resulting performance.
@ugtony About your experiments, it was right behavior based on theory:
About CASIA: The clean version removed the wrong-labeled images. I think that correct-labeled should stay even they are rare. I will be using the clean_list which you send. But maybe a data-filtering done by @davidsandberg works better, I will see this later.
Thanks for melgor's reply. I'm curious about why normalization during training didn't make the performance better.
The question about filtering and suggestion to test on cleaned CASIA dataset is actually for davidsandberg, sorry for misleading.
@ugtony The intention is to filter out the wrong-labeled examples, but unless the model is perfect already it will not only filter out wrong labeled examples but also correctly labeled examples that can be considered more difficult. And if too many of the difficult examples are removed the training will suffer, so the question is if this procedure can result in a better model. I think the answer is yes.
It should be noted that the final goal is not to clean the CASIA dataset (I'm even using the maxpy cleaned version of casia, so it should be quite clean already) but to apply it to the MsCeleb dataset. This dataset is "intentionally" very dirty and sifting through 8M+ images manually is not anything that I would like to do. ;-)
I tried two ways of cleaning the dataset. The first one assumed that there are some identities that are more dirty than others and removing the classes with the largest intra-class variance we get a "cleaned" dataset. I removed the 10% "worst" classes but got the same performance as the full dataset.
The second method instead removed the 10% of the images that was furthest away from the respective class center. This approach seemed to work well on casia and I'm now trying it for cleaning the MsCeleb dataset.
Thanks for the explanation. Looking forward to seeing your result on the MsCeleb dataset.
Do you know where does maxpy casia come from? I googled it and found some download links. But none of them mentioned how it is made or made by whom.
I will continue my finding in code here:
def center_loss(features, label, alfa, nrof_classes):
"""Center loss based on the paper "A Discriminative Feature Learning Approach for Deep Face Recognition"
(http://ydwen.github.io/papers/WenECCV16.pdf)
"""
with tf.variable_scope('center_loss'):
batch_size = tf.cast(tf.shape(features)[0], tf.float32)
nrof_features = features.get_shape()[1]
centers = tf.get_variable('centers', [nrof_classes, nrof_features], dtype=tf.float32,
initializer = tf.contrib.layers.xavier_initializer())
label = tf.reshape(label, [-1])
centers_batch = tf.gather(centers, label)
loss = tf.nn.l2_loss(features - centers_batch)
loss = tf.truediv(loss, batch_size)
return loss, centers
I added two lines:
batch_size = tf.cast(tf.shape(features)[0], tf.float32)
loss = tf.truediv(loss, batch_size)
Why? -> I look like tf.nn.l2_loss does not use reduce_mean function but reduce_sum. This cause that it sum all L2_Loss from each example and not divide by the batch_size. I checked this implementation with my using Numpy and it is correct. Additional, I remove code for updating the centers and I treat is a learnable variable (like in caffe-face repo).
This changes enable learning the model with CenterLoss with same parameters like in Caffe.
If @davidsandberg you agree with this, I can send a PR.
image_batch = tf.transpose(image_batch, [0, 3, 1, 2])
after getting the image_batch variable.
When creating the mode adding data_format = "NCHW":
with slim.arg_scope([slim.conv2d, slim.max_pool2d, slim.avg_pool2d],
stride=1, padding='SAME', data_format = "NCHW"):
And for Inception layers, concat must be done by 1 channel
mixed = tf.concat(1, [tower_conv, tower_conv1_1, tower_conv2_2])
TensorFlow is very slow for not-optimized operation. I was trying to learn the model using PReLU:
```python
'''Parametric activtion function'''
def PReLu(x, idx, name = "PReLu"):
i_scope = ""
if hasattr(x, 'scope'):
if x.scope: i_scope = x.scope
with tf.variable_scope(i_scope + str(idx) + name) as scope:
alphas = tf.get_variable('alpha', x.get_shape()[-1],
initializer=tf.constant_initializer(0.0),
dtype=tf.float32)
x = tf.nn.relu(x) + tf.mul(alphas, (x - tf.abs(x))) * 0.5
return x
```
The network withPReLUis ~30% slower and take 10-20% memory. So testing my desire architecture is really painful. Currently I decide to useReLU`, maybe I will take a single test with 'ELU'. Any suggestion?
I make a single test using Face-Res architecture with "PReLU" to validate the training with Torch. And in Torch I get ~81% of accuracy, but in TensorFlow 75%. I'm do not know why it happen but it look like in TensorFlow the network overfitt much more. Maybe I miss some regularizarion? Some suggestion?
Hi @melgor
I agree with the scaling of the center loss. But I guess it shouldn't affect the results but just result in another "optimal" center loss coefficient. If you send a PR I will merge it.
For the data format this is very interesting. I didn't think it would have that big impact. I assume that it has to do with CUDNNs native data format. I will try the transpose when my GPU is free for something else. ;-) Thanks!
Hi again @melgor,
I tried to change the data_format to 'NCHW', but I didn't see any speed-up.
I'm running on a Titan X (Pascal) using Cuda 8 and CuDNN 5.1.5, and training takes ~0.38 seconds for one batch (90 examples).
Which setup did you use when you saw the speed-up?
I have GTX 980 (Cuda 8 and CuDNN 5.1.5). Maybe you could check the speed based on benchmark here : I changed the batch-size to 64.
This is the results from this benchmark:
NCHW:
Forward across 100 steps, 0.091 +/- 0.009 sec / batch
Forward-backward across 100 steps, 0.281 +/- 0.028 sec / batch
Forward across 100 steps, 0.103 +/- 0.010 sec / batch
Forward-backward across 100 steps, 0.341 +/- 0.034 sec / batch
So it is the difference. I have pretty much same difference when using "ResNet" architecture. But I did not try the running "inception_resnet_v2", so maybe it is not true for all architecutures.
Hi @davidsandberg, just to confirm that the "label" in Centerloss function is dense label, not one hot vector type , right?
@heimanba89 yes it's dense label. tf.gather function requires indices
Most helpful comment
I will continue my finding in code here:
I added two lines:
Why? -> I look like
tf.nn.l2_lossdoes not usereduce_meanfunction butreduce_sum. This cause that it sum all L2_Loss from each example and not divide by the batch_size. I checked this implementation with my using Numpy and it is correct. Additional, I remove code for updating the centers and I treat is a learnable variable (like in caffe-face repo).This changes enable learning the model with CenterLoss with same parameters like in Caffe.
If @davidsandberg you agree with this, I can send a PR.
after getting the
image_batchvariable.When creating the mode adding
data_format = "NCHW":And for Inception layers,
concatmust be done by1channelTensorFlow is very slow for not-optimized operation. I was trying to learn the model using
PReLU:```python
'''Parametric activtion function'''
def PReLu(x, idx, name = "PReLu"):
If incoming Tensor has a scope, this op is defined inside it
i_scope = ""
if hasattr(x, 'scope'):
if x.scope: i_scope = x.scope
with tf.variable_scope(i_scope + str(idx) + name) as scope:
alphas = tf.get_variable('alpha', x.get_shape()[-1],
initializer=tf.constant_initializer(0.0),
dtype=tf.float32)
return x
```
The network withPReLUis ~30% slower and take 10-20% memory. So testing my desire architecture is really painful. Currently I decide to useReLU`, maybe I will take a single test with 'ELU'. Any suggestion?I make a single test using Face-Res architecture with "PReLU" to validate the training with Torch. And in Torch I get ~81% of accuracy, but in TensorFlow 75%. I'm do not know why it happen but it look like in TensorFlow the network overfitt much more. Maybe I miss some regularizarion? Some suggestion?