I was using the python version of MXNet on a 64-bit Ubuntu 16.04 OS.
I changed the fully connected layers in the VGG-16 network and I wanted to fine-tune it.
At first I used the 'fixed_param_names' in mx.mod.Module class and it seemed to be working very well. The accuracy was getting better and better, with a 75% accuracy at 10th epoch.
Then I saved a checkpoint at the 10th epoch, changed the module by removing the 'fixed_param_names' parameters, set the 'lr_mult' to 0 by calling 'opt.set_lr_mult()', and finally loaded the checkpoint file to continue the training process. However, the accuracy rapidly dropped to about 50% (2 classes were included in the training set.)
Here's the code segment for using 'fixed_param_names':
net, arg_params, aux_params = mx.model.load_checkpoint('../model/vgg16', 0)
name_list = [k for k in arg_params if not 'fc' in k]
mod = mx.module.Module(net, context=ctx, work_load_list=wl, fixed_param_names=name_list)
Here's the codes for 'set_lr_mult' method:
opt = mx.optimizer.Adam(learning_rate=0.001)
mult_dict = {k:0.0 for k in arg_params if not 'fc' in k}
opt.set_lr_mult(mult_dict)
mod.init_optimizer(optimizer=opt)
I understand that the 'fixed_param_names' parameter is related to 'grad_req' when calling the executor, and no memory is allocated for gradients in this way. For 'lr_mult' case, the gradients are calculated, but are not added to the weights as the learning rate was 0.
But I guess this should only make difference in memory occupation and computation speed. The result should have been the same, as the weights are the same in both cases. Why were they different??
Maybe I was wrong in understanding the matters. Could someone help me with that?
By the way, I cannot understand the difference between 'fixed_param_names' and the 'BlockGrad' operator. I guess both of them save the memory cosuming, only that 'BlockGrad' cut off the backpropagation completely. I guess this could be implemented with 'fixed_param_names' by specifying all layers before the blocking node, right? Furthermore, if I wanted to freeze the middle part of a network while keep the rest parts trainable: {trainable parts} <-- {frozen parts} <-- {trainable parts} <--{deviation}, I guess that 'BlockGrad' would not help, but 'fixed_param_names' would function.
As the results of my training was not correct, I guess there must have been something wrong in my understanding. Would someone please help me? THX.
BlockGrad sets the gradient to 0, but you are still learning weight decay.
I suggest a "HowTo" label to this issue and change the subject to "How to freeze certain layers in finetune".
@precedenceguo Thanks for reply.
I found out that I did not understand the mechanism of opt.set_lr_mult() method.
The program would lookup the attributes in the whole network recursively when the 'set_lr_mult()' method was called.
If the variables did not have such attributes, nothing would be done. The weights would be updated with the original learning rate. In other words, the set_lr_mult() method took no effects.
I added the 'lr_mult' attributes to the variables in advance, and then installed the optimizer. Everything was OK then.
@back2yes I met the same issue. How can I add the 'lr_mult' attributes to the variables in advance ? In symbol, how can I implement this ? Please tell me more detailed information, thanks!
@bruinxiong Check out this comment: https://github.com/apache/incubator-mxnet/issues/8584#issuecomment-343150185.
Most helpful comment
BlockGrad sets the gradient to 0, but you are still learning weight decay.
I suggest a "HowTo" label to this issue and change the subject to "How to freeze certain layers in finetune".