Hi, there.
Currently, SGD(Stochastic Gradient Descent) in MXNet is applied by:
rescaled_grad = lr * (rescale_grad * clip(grad, clip_gradient) + wd * weight)
state = momentum * state + rescaled_grad
weight = weight - state
I found there are two problems in SGD.
state stores the gradients multiplied by learning rate. However, learning rate is usually a small value, such as 1e-3, enabling the state becomes smaller than the gradient. It may loss the accuracy.state stores the gradients multiplied by old learning rate. It is wrong.Solution:
We should update the implement of SGD with momentum, but we should consider the compatibility with old optimizer states.
rescaled_grad = rescale_grad * clip(grad, clip_gradient) + wd * weight
state = momentum * state + rescaled_grad
weight = weight - lr * state
Additional point here is that both TF and pyTorch use that proposed SGD logic, different than MXNet.
I am looking into making that change, I'm not sure if we need to be concerned about the backward compatibility for 2.0.
FYI @szhengac since you did a legwork for refactoring optimizers for 2.0, @zhreshold and @eric-haibin-lin for GluonCV and GluonNLP visibility.
I also agree that the compatibility of such change is not forced in the upcoming 2.0, and it's extremely rare that you would save/load the states using different mxnet versions.
Putting lr inside the momentum is how Momentum SGD or NAG looks like originally. But putting it outside momentum may be more robust to large batch training in warmup stage.