Insightface: Pytorch implementation. rescale parameter in sgd

Created on 20 Oct 2020  Â·  29Comments  Â·  Source: deepinsight/insightface

Could you please explain the meaning of the rescale parameter in your sgd implementation?
https://github.com/deepinsight/insightface/blob/master/recognition/partial_fc/pytorch/sgd.py#L42

You set it to the world_size in the training here https://github.com/deepinsight/insightface/blob/master/recognition/partial_fc/pytorch/partial_fc.py#L82
and it seems like it affects gradients in both the backbone and the head.

bug partial_fc

Most helpful comment

@xiaoyang-coder your fix solves the issue with the gradients in the head but in the backbone gradients are still "world_size" smaller than they should be.
I see several options on how to fix it:

  1. Optimizers with different rescale parameter for the backbone (rescale=1) and the head (rescale=2)
  2. Keep one optimizer with rescale=2 and multiply x_grad by the world_size after all-reduce gradient computations https://github.com/deepinsight/insightface/blob/master/recognition/partial_fc/pytorch/partial_fc.py#L143
  3. Use one optimizer with rescale=1, divide grad by world_size*batch_size (as it used to be "grad.div_(grad.size()[0])") and multiply x_grad by the world_size after all-reduce gradient computations.
    Seems like all of these options give exactly the same results. As for me, the third option looks better because it will allow us to use standard pytorch's sgd and has minimal overhead in computations. What do you think?

All 29 comments

@xiaoyang-coder

@xiaoyang-coder I believe here https://github.com/deepinsight/insightface/blob/18d67be1eb376738964c2cabfa87590ae496c15e/recognition/partial_fc/pytorch/partial_fc.py#L135 it's already divided by world_size * batch_size because "logits" tensor has the same size by 0-dimension as "total_features" tensor https://github.com/deepinsight/insightface/blob/18d67be1eb376738964c2cabfa87590ae496c15e/recognition/partial_fc/pytorch/partial_fc.py#L101 that has world_size * batch_size size.
Moreover, I compared gradients with a naive implementation and it seems like for the backbone parameters we even need to multiply gradients by world_size, most likely because of the gradient averaging that is done under the hood by torch.nn.parallel.DistributedDataParallel https://github.com/deepinsight/insightface/blob/18d67be1eb376738964c2cabfa87590ae496c15e/recognition/partial_fc/pytorch/partial_fc.py#L63

Thank you, you are right, let's test again

@golunovas

@xiaoyang-coder your fix solves the issue with the gradients in the head but in the backbone gradients are still "world_size" smaller than they should be.
I see several options on how to fix it:

  1. Optimizers with different rescale parameter for the backbone (rescale=1) and the head (rescale=2)
  2. Keep one optimizer with rescale=2 and multiply x_grad by the world_size after all-reduce gradient computations https://github.com/deepinsight/insightface/blob/master/recognition/partial_fc/pytorch/partial_fc.py#L143
  3. Use one optimizer with rescale=1, divide grad by world_size*batch_size (as it used to be "grad.div_(grad.size()[0])") and multiply x_grad by the world_size after all-reduce gradient computations.
    Seems like all of these options give exactly the same results. As for me, the third option looks better because it will allow us to use standard pytorch's sgd and has minimal overhead in computations. What do you think?

Okay, we will test these methods. @golunovas

@xiaoyang-coder did you have a chance to take a look at that ^?

@golunovas We have tested the third method you mentioned, but it still does not reach the accuracy of the mxnet version, and we still need to find other bugs

Did you notice any improvements in accuracy at least?

I didn’t notice the improvement, maybe too small. @golunovas

can I have your QQ account number? @golunovas

@xiaoyang-coder unfortunately, I don't have an account there. Does email work? or maybe insightface's slack?

We have fixed it

@xiaoyang-coder great, did you manage to reach the same accuracy with the sample_rate=0.1 ?

Yes it has been reached @golunovas

@xiaoyang-coder great. thank you for the effort.

Hi! It looks like there are still some bugs in pytorch code.

In my experiments training with sampling_ratio = 0.1 can successfully converge, but it fails when I try to use all negative classes for computing softmax (sampling_ratio = 1.0).
Loss slightly decreases at the very beginning but very soon it stucks and keeps fluctuating at relatively high values.

Hi! It looks like there are still some bugs in pytorch code.

In my experiments training with sampling_ratio = 0.1 can successfully converge, but it fails when I try to use all negative classes for computing softmax (sampling_ratio = 1.0).
Loss slightly decreases at the very beginning but very soon it stucks and keeps fluctuating at relatively high values.

Can you show the settings of your training?

@anxiangsir yes, sure.
Convergence problems appear when I use default settings:
config.dataset = "emore"
config.embedding_size = 512
config.sample_rate = 1.0
config.fp16 = False
config.momentum = 0.9
config.weight_decay = 5e-4
config.batch_size = 112
config.lr = 0.1

Loss decreases to the values of 37.0-38.0 and stucks at this level.

@mnikitin
I have the same problem,
Did you fix it?

Thanks
meixitu

@meixitu17
No, I didn't. Still have convergence issues with sample_rate = 1.0

Have you pulled the latest version of partialfc pytorch?

I have aleardy released the pytorch training log.

https://github.com/deepinsight/insightface/tree/master/recognition/partial_fc#pretrain-models

@mnikitin @meixitu17 Same problems here!! Have you guys already solved this problem?

@mnikitin @meixitu17 Same problems here!! Are you guys already solved this problem?

I reduce the learning rate to 0.5 and now the training process works fine. But I don't know if it will influence the convergence.

@mnikitin @meixitu17 Same problems here!! Are you guys already solved this problem?

I reduce the learning rate to 0.5 and now the training process works fine. But I don't know if it will influence the convergence.

Training breaks again...

@MccreeZhao @meixitu17
if you still have problems, it may be caused by custom neural net architecture.

Note that in original implementation:

  • data loader provides images normalized to [-1, 1]
  • layers initialized inside of __init__ method of IResNet class, not in the training code
  • IResNet network ends up with batch normalization

So, if you preprocess images inside of your net's forward method, use default initilization, or do not use bn as output layer of the feature extractor, you may face convergence issues.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

nmzszxsl01 picture nmzszxsl01  Â·  4Comments

zhenglaizhang picture zhenglaizhang  Â·  3Comments

zys1994 picture zys1994  Â·  3Comments

wqq19930507 picture wqq19930507  Â·  3Comments

ahkarami picture ahkarami  Â·  4Comments