Neither custom nor pre-built learning rate schedulers have any effect on the training process in R.
The training process is always the same regardless of the scheduling scheme used.
Some debugging shows that the lr_scheduler function is called and the LR is correctly recalculated, however the updated LR is not applied.
I have a strong suspicion that the issue is a side-effect of commit be478700e01944ecfee0c30a7cc5dc07d1b2789a (#11374)
in R-package/R/optimizer.R.
Before it, the LR was used in update calculations directly. Now the LR is baked-in in the executor and is never actually updated.
I hope it will be of some help. Modified after the tutorial.
data(BostonHousing, package="mlbench")
train.ind = seq(1, 506, 3)
train.x = data.matrix(BostonHousing[train.ind, -14])
train.y = BostonHousing[train.ind, 14]
data <- mx.symbol.Variable("data")
fc1 <- mx.symbol.FullyConnected(data, num_hidden=1)
lro <- mx.symbol.LinearRegressionOutput(fc1)
cat("Without scheduler\n")
mx.set.seed(0)
model1 <- mx.model.FeedForward.create(lro, X=train.x, y=train.y,
ctx=mx.cpu(), num.round=20, array.batch.size=20,
learning.rate=2e-6, momentum=0.9, eval.metric=mx.metric.rmse)
cat("\nWith scheduler\n")
lr_scheduler <- mx.lr_scheduler.FactorScheduler(
step = 5 * ceiling(length(train.y)/20), factor = 0.1,
stop_factor_lr = 1e-10)
mx.set.seed(0)
model2 <- mx.model.FeedForward.create(lro, X=train.x, y=train.y,
ctx=mx.cpu(), num.round=20, array.batch.size=20,
learning.rate=2e-6, momentum=0.9, eval.metric=mx.metric.rmse,
lr_scheduler = lr_scheduler)
Without scheduler
Start training with 1 devices
[1] Train-rmse=18.0516391330295
[2] Train-rmse=13.9097522099813
[3] Train-rmse=10.666042221917
[4] Train-rmse=10.0117386711968
[5] Train-rmse=9.59162097507053
[6] Train-rmse=9.80227173699273
[7] Train-rmse=9.56405830383301
[8] Train-rmse=9.39033126831055
[9] Train-rmse=9.33245415157742
[10] Train-rmse=9.31073543760512
[11] Train-rmse=9.27919896443685
[12] Train-rmse=9.24656009674072
[13] Train-rmse=9.206680615743
[14] Train-rmse=9.17186906602648
[15] Train-rmse=9.14681609471639
[16] Train-rmse=9.12289328045315
[17] Train-rmse=9.09742567274306
[18] Train-rmse=9.0733421113756
[19] Train-rmse=9.05105861028036
[20] Train-rmse=9.02993933359782
With scheduler
Start training with 1 devices
[1] Train-rmse=18.0516391330295
[2] Train-rmse=13.9097522099813
[3] Train-rmse=10.666042221917
[4] Train-rmse=10.0117386711968
[5] Train-rmse=9.59162097507053
Update[46]: learning rate is changed to 2e-07
[6] Train-rmse=9.80227173699273
[7] Train-rmse=9.56405830383301
[8] Train-rmse=9.39033126831055
[9] Train-rmse=9.33245415157742
[10] Train-rmse=9.31073543760512
Update[91]: learning rate is changed to 2e-08
[11] Train-rmse=9.27919896443685
[12] Train-rmse=9.24656009674072
[13] Train-rmse=9.206680615743
[14] Train-rmse=9.17186906602648
[15] Train-rmse=9.14681609471639
Update[136]: learning rate is changed to 2e-09
[16] Train-rmse=9.12289328045315
[17] Train-rmse=9.09742567274306
[18] Train-rmse=9.0733421113756
[19] Train-rmse=9.05105861028036
[20] Train-rmse=9.02993933359782
Warning messages:
1: In mx.model.select.layout.train(X, y) :
Auto detect layout of input matrix, use rowmajor..
2: In mx.model.select.layout.train(X, y) :
Auto detect layout of input matrix, use rowmajor..
Train-rmse is the same in every iteration regardless of the scheduler used.
I stumbled over the same issue here and tested the example provided by @onomatet. I also could verify that using the lr scheduler does not seem to end up in seeing changed/adapted learning rates to appear. @jeremiedb @hetong007 Any explantation/workaround for that?
Well it looks like the optimizer implementation doesn't consider the learning rate from the scheduler: https://github.com/apache/incubator-mxnet/blob/master/R-package/R/optimizer.R#L40
A quick fix would be checking the existence of the scheduler before lr <- learning.rate. It has been like this before the PR.
@hetong007 Thank you for your reply!
Honestly, I don't see how would it help. lr schedulers are called properly on every update here
https://github.com/apache/incubator-mxnet/blob/25505e9da4ea24ce37f1e60916d1afc3fcd15300/R-package/R/optimizer.R#L86
and they modify the lr parameter in the sgd namespace.
From my understanding, however, it doesn't matter since lr is a property of the mx.symbol.sgd_mom_update which cannot be changed after the correspondig executor exec is created in
https://github.com/apache/incubator-mxnet/blob/25505e9da4ea24ce37f1e60916d1afc3fcd15300/R-package/R/optimizer.R#L79
(or can it?)
@onomatet I think you're correct. I'm yet unsure whether the learning rate property could be mutated on the symbol. Otherwise, I guess a reinitialization of the optimizer graphs would be needed. I'll take a closer look by tomorrow.
At the moment, I can't think of a quick turnaround for fixing that LR update issue.
Having a condition to rebuild the weights executors seems to be a viable route to take, at least from an initial PoC on SGD. A concern is whether rebuilding execs might cause some memory overhead on large models. It remains to be validated, but if it's safe to assume learning rate update would typically be use only a limited number of times during training, it should be ok.
I expect to be able to submit a proposal by next week.
@jeremiedb thank you for your answers! I will give your idea a try as a temporary solution.
My concern is, however, that rebuilding of the executors might be ok for the simple lr schedulers like mx.lr_scheduler.FactorScheduler but it might become a problem for a something like cosine annealing scheduler which requires quite frequent lr updates.
Found a temporary solution which works for me and doesn't require reinitialization of the executors:
sgd namespace.lr / lr_orig prior to feeding it to the executor.It might not be exactly the same as setting the LR in the executors themselves, but it esentialy works as expected.
@onomatet Glad that you found a viable turn around.
I performed a test with a modified SGD and updater where the execs get rebuilt. On a CNN model on MNIST where the weights executors where updated 22 times during an epoch, training time went from 3.1 sec to 3.6 sec, so roughly 0.025 sec per learning rate update.
From a sprase learning rate update perspective, it appears reasonable, but in scenarios where the lr is continuously adapted, then I doubt it's desirable.
I'd be curious to get some feedback on whether it would be worth reworking the optimizers to integrate such mechanism.
@jeremiedb well, it is not an optimal solution, of course, but I believe it is better than nothing at all. In the worst case scenario, one can reduce the frequency of LR updates to get reasonable training speed.
Probably, it is worth fixing this bug dirty this time around and then to leave the TODO ticket for future deeper rework of the mx.symbols if somebody will decide to clean it up.
Most helpful comment
Found a temporary solution which works for me and doesn't require reinitialization of the executors:
sgdnamespace.lr / lr_origprior to feeding it to the executor.It might not be exactly the same as setting the LR in the executors themselves, but it esentialy works as expected.