Insightface: Pytorch implementation. rescale parameter in sgd

Created on 20 Oct 2020 · 29Comments · Source: deepinsight/insightface

Could you please explain the meaning of the rescale parameter in your sgd implementation?
https://github.com/deepinsight/insightface/blob/master/recognition/partial_fc/pytorch/sgd.py#L42

You set it to the world_size in the training here https://github.com/deepinsight/insightface/blob/master/recognition/partial_fc/pytorch/partial_fc.py#L82
and it seems like it affects gradients in both the backbone and the head.

bug partial_fc

Source

golunovas

Most helpful comment

@xiaoyang-coder your fix solves the issue with the gradients in the head but in the backbone gradients are still "world_size" smaller than they should be.
I see several options on how to fix it:

Optimizers with different rescale parameter for the backbone (rescale=1) and the head (rescale=2)
Keep one optimizer with rescale=2 and multiply x_grad by the world_size after all-reduce gradient computations https://github.com/deepinsight/insightface/blob/master/recognition/partial_fc/pytorch/partial_fc.py#L143
Use one optimizer with rescale=1, divide grad by world_size*batch_size (as it used to be "grad.div_(grad.size()[0])") and multiply x_grad by the world_size after all-reduce gradient computations.
Seems like all of these options give exactly the same results. As for me, the third option looks better because it will allow us to use standard pytorch's sgd and has minimal overhead in computations. What do you think?

golunovas on 21 Oct 2020

👀2 👍2 ❤1 🎉1

All 29 comments

@xiaoyang-coder

anxiangsir on 21 Oct 2020

The sum of all gradients should be divided by world_size*batch_size, we do this by doing two divisions
https://github.com/deepinsight/insightface/blob/18d67be1eb376738964c2cabfa87590ae496c15e/recognition/partial_fc/pytorch/partial_fc.py#L135
https://github.com/deepinsight/insightface/blob/master/recognition/partial_fc/pytorch/partial_fc.py#L82

xiaoyang-coder on 21 Oct 2020

👀1

@xiaoyang-coder I believe here https://github.com/deepinsight/insightface/blob/18d67be1eb376738964c2cabfa87590ae496c15e/recognition/partial_fc/pytorch/partial_fc.py#L135 it's already divided by world_size * batch_size because "logits" tensor has the same size by 0-dimension as "total_features" tensor https://github.com/deepinsight/insightface/blob/18d67be1eb376738964c2cabfa87590ae496c15e/recognition/partial_fc/pytorch/partial_fc.py#L101 that has world_size * batch_size size.
Moreover, I compared gradients with a naive implementation and it seems like for the backbone parameters we even need to multiply gradients by world_size, most likely because of the gradient averaging that is done under the hood by torch.nn.parallel.DistributedDataParallel https://github.com/deepinsight/insightface/blob/18d67be1eb376738964c2cabfa87590ae496c15e/recognition/partial_fc/pytorch/partial_fc.py#L63

golunovas on 21 Oct 2020

Thank you, you are right, let's test again

xiaoyang-coder on 21 Oct 2020

@golunovas

xiaoyang-coder on 21 Oct 2020

@xiaoyang-coder your fix solves the issue with the gradients in the head but in the backbone gradients are still "world_size" smaller than they should be.
I see several options on how to fix it:

Optimizers with different rescale parameter for the backbone (rescale=1) and the head (rescale=2)
Keep one optimizer with rescale=2 and multiply x_grad by the world_size after all-reduce gradient computations https://github.com/deepinsight/insightface/blob/master/recognition/partial_fc/pytorch/partial_fc.py#L143
Use one optimizer with rescale=1, divide grad by world_size*batch_size (as it used to be "grad.div_(grad.size()[0])") and multiply x_grad by the world_size after all-reduce gradient computations.
Seems like all of these options give exactly the same results. As for me, the third option looks better because it will allow us to use standard pytorch's sgd and has minimal overhead in computations. What do you think?

golunovas on 21 Oct 2020

👀2 👍2 ❤1 🎉1

Okay, we will test these methods. @golunovas

xiaoyang-coder on 21 Oct 2020

@xiaoyang-coder did you have a chance to take a look at that ^?

golunovas on 5 Nov 2020

@golunovas We have tested the third method you mentioned, but it still does not reach the accuracy of the mxnet version, and we still need to find other bugs

xiaoyang-coder on 6 Nov 2020

Did you notice any improvements in accuracy at least?

golunovas on 6 Nov 2020

I didn’t notice the improvement, maybe too small. @golunovas

xiaoyang-coder on 6 Nov 2020

can I have your QQ account number？ @golunovas

xiaoyang-coder on 6 Nov 2020

@xiaoyang-coder unfortunately, I don't have an account there. Does email work? or maybe insightface's slack?

golunovas on 6 Nov 2020

[email protected]

xiaoyang-coder on 6 Nov 2020

We have fixed it

xiaoyang-coder on 11 Nov 2020

@xiaoyang-coder great, did you manage to reach the same accuracy with the sample_rate=0.1 ?

golunovas on 11 Nov 2020

Yes it has been reached @golunovas

xiaoyang-coder on 11 Nov 2020

@xiaoyang-coder great. thank you for the effort.

golunovas on 11 Nov 2020

Hi! It looks like there are still some bugs in pytorch code.

In my experiments training with sampling_ratio = 0.1 can successfully converge, but it fails when I try to use all negative classes for computing softmax (sampling_ratio = 1.0).
Loss slightly decreases at the very beginning but very soon it stucks and keeps fluctuating at relatively high values.

mnikitin on 14 Nov 2020

Hi! It looks like there are still some bugs in pytorch code.

In my experiments training with sampling_ratio = 0.1 can successfully converge, but it fails when I try to use all negative classes for computing softmax (sampling_ratio = 1.0).
Loss slightly decreases at the very beginning but very soon it stucks and keeps fluctuating at relatively high values.

Can you show the settings of your training?

anxiangsir on 15 Nov 2020

@anxiangsir yes, sure.
Convergence problems appear when I use default settings:
config.dataset = "emore"
config.embedding_size = 512
config.sample_rate = 1.0
config.fp16 = False
config.momentum = 0.9
config.weight_decay = 5e-4
config.batch_size = 112
config.lr = 0.1

Loss decreases to the values of 37.0-38.0 and stucks at this level.

mnikitin on 15 Nov 2020

@mnikitin
I have the same problem,
Did you fix it?

Thanks
meixitu

meixitu17 on 31 Dec 2020

@meixitu17
No, I didn't. Still have convergence issues with sample_rate = 1.0

mnikitin on 31 Dec 2020

Have you pulled the latest version of partialfc pytorch?

anxiangsir on 31 Dec 2020

I have aleardy released the pytorch training log.

https://github.com/deepinsight/insightface/tree/master/recognition/partial_fc#pretrain-models

anxiangsir on 31 Dec 2020

@mnikitin @meixitu17 Same problems here!! Have you guys already solved this problem?

MccreeZhao on 8 Jan 2021

@mnikitin @meixitu17 Same problems here!! Are you guys already solved this problem?

I reduce the learning rate to 0.5 and now the training process works fine. But I don't know if it will influence the convergence.

MccreeZhao on 8 Jan 2021

@mnikitin @meixitu17 Same problems here!! Are you guys already solved this problem?

I reduce the learning rate to 0.5 and now the training process works fine. But I don't know if it will influence the convergence.

Training breaks again...

MccreeZhao on 9 Jan 2021

@MccreeZhao @meixitu17
if you still have problems, it may be caused by custom neural net architecture.

Note that in original implementation:

data loader provides images normalized to [-1, 1]
layers initialized inside of __init__ method of IResNet class, not in the training code
IResNet network ends up with batch normalization

So, if you preprocess images inside of your net's forward method, use default initilization, or do not use bn as output layer of the feature extractor, you may face convergence issues.

mnikitin on 15 Feb 2021

Was this page helpful?

0 / 5 - 0 ratings