Incubator-mxnet: The problems about SGD with momentum when learning rate changes

Created on 13 Jul 2019 · 3Comments · Source: apache/incubator-mxnet

Hi, there.
Currently, SGD(Stochastic Gradient Descent) in MXNet is applied by:

rescaled_grad = lr * (rescale_grad * clip(grad, clip_gradient) + wd * weight)
state = momentum * state + rescaled_grad
weight = weight - state

I found there are two problems in SGD.

Loss of accuracy on float-point number
For the SGD with momentum, the variable state stores the gradients multiplied by learning rate. However, learning rate is usually a small value, such as 1e-3, enabling the state becomes smaller than the gradient. It may loss the accuracy.

The case when learning rate changes.
When learning rate changes, the variable state stores the gradients multiplied by old learning rate. It is wrong.

Solution:
We should update the implement of SGD with momentum, but we should consider the compatibility with old optimizer states.

rescaled_grad = rescale_grad * clip(grad, clip_gradient) + wd * weight
state = momentum * state + rescaled_grad
weight = weight - lr * state

API change Optimizer v2.0

Source

wkcn

👍4

All 3 comments

Additional point here is that both TF and pyTorch use that proposed SGD logic, different than MXNet.

I am looking into making that change, I'm not sure if we need to be concerned about the backward compatibility for 2.0.

FYI @szhengac since you did a legwork for refactoring optimizers for 2.0, @zhreshold and @eric-haibin-lin for GluonCV and GluonNLP visibility.

ptrendx on 1 May 2020

👍1

I also agree that the compatibility of such change is not forced in the upcoming 2.0, and it's extremely rare that you would save/load the states using different mxnet versions.

zhreshold on 1 May 2020

👍1

Putting lr inside the momentum is how Momentum SGD or NAG looks like originally. But putting it outside momentum may be more robust to large batch training in warmup stage.