I am wondering if there is a way that I can use different learning rates for different layers like what is in Caffe. I am trying to modify a pre-trained model and use it for other tasks. What I want is to speed up the training for new added layers and keep the trained layers at low learning rate in order to prevent them from being distorted. for example, I have a 5-conv-layer pre-trained model. Now I add a new conv layer and fine tune it. The first 5 layers would have learning rate of 0.00001 and the last one would have 0.001. Any idea how to achieve this?
use Optimizer.set_lr_scale(). Related to #1150
sorry, could you be more specific? I saw there is a parameter called "args_lrscale". I am not quite sure how to set it. Could you make it more clear in terms of my case? I would really appreciate. I am a newbie, just trying to learn it from scratch. Thank you.
In fact, mxnet is so flexible that you can update your parameters one by one with different Optimizers and different learning rates.
Use code the following code:
https://github.com/dmlc/mxnet/blob/master/python/mxnet/optimizer.py#L376
And of course you should maintain the binding of symbols and NDArrays by your self.
It says in the code that "args_lrscale : dict of index to float" in set_lr_scale(self, args_lrscale) and also "index is an unique integer key used to index the parameters". What exactly is the index? If I want to set lr_scale in the first two layers, what are their index?
according my understanding, the index is consistent with symbol's staticgraph dfs visit order. Reference https://github.com/dmlc/mxnet/blob/4616e8b0a3e8f9258d2403f533166718f82903bc/src/symbol/symbol.cc#L203-L215
you can check with symbol.list_arguments. and set lr_scale with the index in the returned list
Hi, I have implemented it by set lr_mult and wd_mult in operator prop's parameter, so we can get lr_mult and wd_mult from operator prop as same as InferShape.
struct ConvolutionParam : public dmlc::Parameter<ConvolutionParam> {
TShape kernel;
TShape stride;
TShape dilate;
TShape pad;
uint32_t num_filter;
uint32_t num_group;
uint64_t workspace;
std::vector<float> lr_mult;
std::vector<float> wd_mult;
bool no_bias;
DMLC_DECLARE_PARAMETER(ConvolutionParam) {
int shape[] = {1, 1};
DMLC_DECLARE_FIELD(kernel).describe("convolution kernel size: (y, x)");
DMLC_DECLARE_FIELD(stride).set_default(TShape(shape, shape + 2))
.describe("convolution stride: (y, x)");
DMLC_DECLARE_FIELD(dilate).set_default(TShape(shape, shape + 2))
.describe("convolution dilate: (y, x)");
shape[0] = shape[1] = 0;
DMLC_DECLARE_FIELD(pad).set_default(TShape(shape, shape + 2))
.describe("pad for convolution: (y, x)");
DMLC_DECLARE_FIELD(num_filter).set_range(1, 100000)
.describe("convolution filter(channel) number");
DMLC_DECLARE_FIELD(num_group).set_default(1)
.describe("Number of groups partition. "
"This option is not supported by CuDNN, you can use SliceChannel to num_group,"
"apply convolution and concat instead to achieve the same need.");
DMLC_DECLARE_FIELD(workspace).set_default(512).set_range(128, 4096)
.describe("Tmp workspace for convolution (MB).");
DMLC_DECLARE_FIELD(lr_mult).set_default(std::vector<float>({1.0, 1.0}))
.describe("convolution learning rate scale of weight and bias");
DMLC_DECLARE_FIELD(wd_mult).set_default(std::vector<float>({1.0, 1.0}))
.describe("convolution weight decay scale of weight and bias");
DMLC_DECLARE_FIELD(no_bias).set_default(false)
.describe("Whether to disable bias parameter.");
}
};
so, you must add a vector parameter in dmlc-core/include/dmlc/parameter.h firstly, and add a function such as GetMult to get the lr_mult and wd_mult you set when the symbol created as same as InferShape.
@antinucleon I think we should make lr_mult and wd_mult rely on names, not indexes. It's more natual.
Most helpful comment
@antinucleon I think we should make lr_mult and wd_mult rely on names, not indexes. It's more natual.