Incubator-mxnet: LR schedulers do not work in R

Created on 3 Jan 2020 · 10Comments · Source: apache/incubator-mxnet

Neither custom nor pre-built learning rate schedulers have any effect on the training process in R.
The training process is always the same regardless of the scheduling scheme used.

Some debugging shows that the lr_scheduler function is called and the LR is correctly recalculated, however the updated LR is not applied.

I have a strong suspicion that the issue is a side-effect of commit be478700e01944ecfee0c30a7cc5dc07d1b2789a (#11374)
in R-package/R/optimizer.R.
Before it, the LR was used in update calculations directly. Now the LR is baked-in in the executor and is never actually updated.

Bug R

Source

onomatet

Most helpful comment

Found a temporary solution which works for me and doesn't require reinitialization of the executors:

Store original LR in the sgd namespace.
Each update, recalculate LR with a scheduler.
Multiply gradient array by lr / lr_orig prior to feeding it to the executor.

It might not be exactly the same as setting the LR in the executors themselves, but it esentialy works as expected.

onomatet on 22 Jan 2020

👍2

All 10 comments

Minimal example

I hope it will be of some help. Modified after the tutorial.

Code

data(BostonHousing, package="mlbench")

train.ind = seq(1, 506, 3)
train.x = data.matrix(BostonHousing[train.ind, -14])
train.y = BostonHousing[train.ind, 14]

data <- mx.symbol.Variable("data")
fc1 <- mx.symbol.FullyConnected(data, num_hidden=1)
lro <- mx.symbol.LinearRegressionOutput(fc1)

cat("Without scheduler\n")
mx.set.seed(0)
model1 <- mx.model.FeedForward.create(lro, X=train.x, y=train.y,
    ctx=mx.cpu(),     num.round=20, array.batch.size=20,
    learning.rate=2e-6, momentum=0.9,  eval.metric=mx.metric.rmse)

cat("\nWith scheduler\n")
lr_scheduler <- mx.lr_scheduler.FactorScheduler(
    step = 5 * ceiling(length(train.y)/20), factor = 0.1,
    stop_factor_lr = 1e-10)
mx.set.seed(0)
model2 <- mx.model.FeedForward.create(lro, X=train.x, y=train.y,
    ctx=mx.cpu(),     num.round=20, array.batch.size=20,
    learning.rate=2e-6, momentum=0.9,  eval.metric=mx.metric.rmse,
    lr_scheduler = lr_scheduler)

Output

Without scheduler
Start training with 1 devices
[1] Train-rmse=18.0516391330295
[2] Train-rmse=13.9097522099813
[3] Train-rmse=10.666042221917
[4] Train-rmse=10.0117386711968
[5] Train-rmse=9.59162097507053
[6] Train-rmse=9.80227173699273
[7] Train-rmse=9.56405830383301
[8] Train-rmse=9.39033126831055
[9] Train-rmse=9.33245415157742
[10] Train-rmse=9.31073543760512
[11] Train-rmse=9.27919896443685
[12] Train-rmse=9.24656009674072
[13] Train-rmse=9.206680615743
[14] Train-rmse=9.17186906602648
[15] Train-rmse=9.14681609471639
[16] Train-rmse=9.12289328045315
[17] Train-rmse=9.09742567274306
[18] Train-rmse=9.0733421113756
[19] Train-rmse=9.05105861028036
[20] Train-rmse=9.02993933359782

With scheduler
Start training with 1 devices
[1] Train-rmse=18.0516391330295
[2] Train-rmse=13.9097522099813
[3] Train-rmse=10.666042221917
[4] Train-rmse=10.0117386711968
[5] Train-rmse=9.59162097507053
Update[46]: learning rate is changed to 2e-07
[6] Train-rmse=9.80227173699273
[7] Train-rmse=9.56405830383301
[8] Train-rmse=9.39033126831055
[9] Train-rmse=9.33245415157742
[10] Train-rmse=9.31073543760512
Update[91]: learning rate is changed to 2e-08
[11] Train-rmse=9.27919896443685
[12] Train-rmse=9.24656009674072
[13] Train-rmse=9.206680615743
[14] Train-rmse=9.17186906602648
[15] Train-rmse=9.14681609471639
Update[136]: learning rate is changed to 2e-09
[16] Train-rmse=9.12289328045315
[17] Train-rmse=9.09742567274306
[18] Train-rmse=9.0733421113756
[19] Train-rmse=9.05105861028036
[20] Train-rmse=9.02993933359782
Warning messages:
1: In mx.model.select.layout.train(X, y) :
  Auto detect layout of input matrix, use rowmajor..

2: In mx.model.select.layout.train(X, y) :
  Auto detect layout of input matrix, use rowmajor..

Train-rmse is the same in every iteration regardless of the scheduler used.

onomatet on 4 Jan 2020

I stumbled over the same issue here and tested the example provided by @onomatet. I also could verify that using the lr scheduler does not seem to end up in seeing changed/adapted learning rates to appear. @jeremiedb @hetong007 Any explantation/workaround for that?

jens-maus on 21 Jan 2020

Well it looks like the optimizer implementation doesn't consider the learning rate from the scheduler: https://github.com/apache/incubator-mxnet/blob/master/R-package/R/optimizer.R#L40

A quick fix would be checking the existence of the scheduler before lr <- learning.rate. It has been like this before the PR.

hetong007 on 21 Jan 2020

👍1

@hetong007 Thank you for your reply!

Honestly, I don't see how would it help. lr schedulers are called properly on every update here
https://github.com/apache/incubator-mxnet/blob/25505e9da4ea24ce37f1e60916d1afc3fcd15300/R-package/R/optimizer.R#L86
and they modify the lr parameter in the sgd namespace.

From my understanding, however, it doesn't matter since lr is a property of the mx.symbol.sgd_mom_update which cannot be changed after the correspondig executor exec is created in
https://github.com/apache/incubator-mxnet/blob/25505e9da4ea24ce37f1e60916d1afc3fcd15300/R-package/R/optimizer.R#L79
(or can it?)

onomatet on 21 Jan 2020

@onomatet I think you're correct. I'm yet unsure whether the learning rate property could be mutated on the symbol. Otherwise, I guess a reinitialization of the optimizer graphs would be needed. I'll take a closer look by tomorrow.

jeremiedb on 21 Jan 2020

At the moment, I can't think of a quick turnaround for fixing that LR update issue.
Having a condition to rebuild the weights executors seems to be a viable route to take, at least from an initial PoC on SGD. A concern is whether rebuilding execs might cause some memory overhead on large models. It remains to be validated, but if it's safe to assume learning rate update would typically be use only a limited number of times during training, it should be ok.
I expect to be able to submit a proposal by next week.

jeremiedb on 22 Jan 2020

@jeremiedb thank you for your answers! I will give your idea a try as a temporary solution.
My concern is, however, that rebuilding of the executors might be ok for the simple lr schedulers like mx.lr_scheduler.FactorScheduler but it might become a problem for a something like cosine annealing scheduler which requires quite frequent lr updates.

onomatet on 22 Jan 2020

Found a temporary solution which works for me and doesn't require reinitialization of the executors:

Store original LR in the sgd namespace.
Each update, recalculate LR with a scheduler.
Multiply gradient array by lr / lr_orig prior to feeding it to the executor.

It might not be exactly the same as setting the LR in the executors themselves, but it esentialy works as expected.

onomatet on 22 Jan 2020

👍2

@onomatet Glad that you found a viable turn around.

I performed a test with a modified SGD and updater where the execs get rebuilt. On a CNN model on MNIST where the weights executors where updated 22 times during an epoch, training time went from 3.1 sec to 3.6 sec, so roughly 0.025 sec per learning rate update.

From a sprase learning rate update perspective, it appears reasonable, but in scenarios where the lr is continuously adapted, then I doubt it's desirable.

I'd be curious to get some feedback on whether it would be worth reworking the optimizers to integrate such mechanism.

jeremiedb on 23 Jan 2020

@jeremiedb well, it is not an optimal solution, of course, but I believe it is better than nothing at all. In the worst case scenario, one can reduce the frequency of LR updates to get reasonable training speed.

Probably, it is worth fixing this bug dirty this time around and then to leave the TODO ticket for future deeper rework of the mx.symbols if somebody will decide to clean it up.

onomatet on 4 Feb 2020

Was this page helpful?

0 / 5 - 0 ratings