Hello,
Using other frameworks I have used weight decay on the cost function rather than layer wise. How does weight decay per layer work?
And if I wanted to do weight decay to the cost function how would it be on Keras. Thank you.
Cheers,
EM
Weight decay is added by Regularizers
(docs). It only works on layers you add this argument.
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs, but feel free to re-open it if needed.
Can somebody explain the original question? I also know weight decay only in the cost function. So how should I implement this in Keras?
@michaelschleiss, you can take a look at the docs that @joelthchao referred to.
The question is about whether or not it is possible to add the weight decay term manually to the global cost function in a Keras model. While it is possible to create your own cost function, in addition to Keras' preimplemented cost functions (see the docs here), it doesn't seems as if there is an easy to get hold of the weight matrix, which you would need in order to add the weight decay term to the cost. Only the true labels and the predicted labels seem to be accessible from within a custom cost method.
If you are interested in adding regularization to your network, your best bet would be to consider layer-specific regularizers that you would have to add one by one.
I have also this problem. I think the global weight decay is actually not equal to the layer-wise weight decay.
@perryshao global weight decay is just summation of all layer-wise weight decays. Therefore, if you want to have a "global weight decay" in your loss function, just put Regularizer on all the layers with trainable weights.
This issue has been automatically marked as stale because it has not had recent activity. It will be closed after 30 days if no further activity occurs, but feel free to re-open a closed issue if needed.
@joelthchao If I use Regularizer on all the layers, what the regularization parameter should be? Suppose I have global weight decay rate to be 0.005 in caffe, is it the same as using 0.005 for each layer in keras?
@zekun-li Yes.
with deeper and deeper layers - it would be nice to have a cost function weight decay so you dont have to add it to each layer.
This could be of help to anyone who is looking for how to code this, ty.
for layer in my_model.layers:
if hasattr(layer, 'kernel_regularizer'):
layer.kernel_regularizer= regularizers.l2(weight_decay)
@tutysara thanks for example. is this same as your example ? :
trainoptmodel = Sequential()
trainoptmodel.add(Dense(25,input_dim=10, activation='tanh', kernel_regularizer=regularizers.l2(0.01)))
trainoptmodel.add(Dense(3,activation="linear", kernel_regularizer=regularizers.l2(0.01))
And mean of this code is global weight decay is identified as 0.01 ?
@tutysara this does not work.
I just lost an hour to investigate it. Adding regularizers this way will not affect layer.losses
. Layers losses are constructed when adding layer, so adding regularizers after layer was constructed will not affect model loss.
@mjmikulski Hey sorry, didn't know it doesn't take effect after layer construction. I based my implementation based on @joelthchao comment. How did you end up solving it?
@tutysara
I just added regularizer manualy on each layer:
from keras.regularizers import L1L2
dense_regularizer = L1L2(l2=0.0001)
# ...
model.add(Dense(128, kernel_regularizer=dense_regularizer))
# ...
@EnriqueSMarquez
And answering original question: weight decay is exactly the same what L2 norm (see this, page 227).
In keras you cannot add a global weight decay, you can instead add regularizers for each layer (and to be precise, separately to layer weights (aka kernel) and biases).
@joelthchao In the view of mathematical equation of L2 norm, it is obvious that the global weight decay using L2 norm is not exactly same with the layer-wise weight decay using L2 norm. For example, sqrt(x1^2+x2^2)+sqrt(y1^2+y2^2), and sqrt(x1^2+x2^2+y1^2+y2^2), suppose that x vector is the weights of layer 1 and y is the weights of layer 2.
@perryshao From mathematical point of view you are perfectly right - there is a sqrt in L2.
But in keras (or maybe more generally in ML), L2 means rather "sum of squares", not really a proper L2 norm, see keras implementation:
class L1L2(Regularizer):
# ...
def __call__(self, x):
regularization = 0.
if self.l1:
regularization += K.sum(self.l1 * K.abs(x))
if self.l2:
regularization += K.sum(self.l2 * K.square(x))
return regularization
@mjmikulski Yes, you are right. I checked with the codes, where it is not really a proper L2 norm, just "a sum of squares". thank you for your reply.
@EnriqueSMarquez
weight decay is exactly the same what L2 norm
Apparently it's not. Here is an explanation why:
https://bbabenko.github.io/weight-decay/
TL/DR: Weight Decay is subtracted directly from weights on each step as is, but L2_reg is added to loss, hence it affects weights as derivative (multiplied by 2). To be consistent with weight decay it should be divided by 2 in loss function.
When using sophisticated optimizers there are some other differences:
http://www.fast.ai/2018/07/02/adam-weight-decay/
The problem of adding kernel_regularizer to the existing layer is still unsolved for me. Even though there is a stackoverflow question claims it should work:
https://stackoverflow.com/questions/48330137/adding-regularizer-to-an-existing-layer-of-a-trained-model-without-resetting-wei
@apatsekin good point! I have not thought about it.
Let me paste here part of bbabenko article you mention:
In frameworks that implement L2 regularization, the gradients the solver is using are for “total loss”, which has L2 regularization baked into it (this fact is typically abstracted away from the solver — it doesn’t “know” anything about whether a regularization term was included in the loss or not). In frameworks that implement weight decay, the solver is considering only the gradient for L.
What I said:
weight decay is exactly the same what L2 norm
is true for a plain gradient descend (up to factor 2 which is irrelevant when you do a hyperparameter search, but is sth to remember when you try to reproduce an experiment). Situation gets more complicated when you accumulate the gradient some way. Thanks @apatsekin for pointing it out.
Anyway, what is wrong with this stackoverflow question? The answer makes a point and explains OP's observation.
@mjmikulski thanks for reply!
Anyway, what is wrong with this stackoverflow question? The answer makes a point and explains OP's observation.
As you pointed earlier in this thread:
I just lost an hour to investigate it. Adding regularizers this way will not affect layer.losses
And I confirm that _recompile_ doesn't pick it up. However, the stackoverflow thread claims the opposite.
# Add weight decay to the whole model
regularizer = tf.keras.regularizers.l2(WEIGHT_DECAY)
decay_attributes = ['kernel_regularizer', 'bias_regularizer',
'beta_regularizer', 'gamma_regularizer']
for layer in model.layers:
for attr in decay_attributes:
if hasattr(layer, attr):
setattr(layer, attr, regularizer)
As pointed out by @lars76 below, setting an attribute like kernel_regularizer
on an existing layer will only affect the model's config. Even when compiling the model afterwards, the regularization loss will be ignored. To counteract this, we need to reload the model using
model = model_from_json(model.to_json())
This gets especially ugly on a model with already trained weights, as we need to save and reload the weights as well. The function below does all this for you and also works with tf.keras
instead of keras
.
Example usage:
model = add_l1l2_regularizer(model, l2=0.01)
model.compile(...)
import os
import tempfile
def add_l1l2_regularizer(model, l1=0.0, l2=0.0, reg_attributes=None):
# Add L1L2 regularization to the whole model.
# NOTE: This will save and reload the model. Do not call this function inplace but with
# model = add_l1l2_regularizer(model, ...)
if not reg_attributes:
reg_attributes = ['kernel_regularizer', 'bias_regularizer',
'beta_regularizer', 'gamma_regularizer']
if isinstance(reg_attributes, str):
reg_attributes = [reg_attributes]
regularizer = keras.regularizers.l1_l2(l1=l1, l2=l2)
for layer in model.layers:
for attr in reg_attributes:
if hasattr(layer, attr):
setattr(layer, attr, regularizer)
# So far, the regularizers only exist in the model config. We need to
# reload the model so that Keras adds them to each layer's losses.
model_json = model.to_json()
# Save the weights before reloading the model.
tmp_weights_path = os.path.join(tempfile.gettempdir(), 'tmp_weights.h5')
model.save_weights(tmp_weights_path)
# Reload the model
model = keras.models.model_from_json(model_json)
model.load_weights(tmp_weights_path, by_name=True)
return model
Also note that depending on the optimizer you use, L2 regularization does not necessarily correspond to weight decay (https://arxiv.org/abs/1711.05101).
Thanks @batzner, very well written. It did change the loaded model's regularizers.
# Add weight decay to the whole model regularizer = tf.keras.regularizers.l2(WEIGHT_DECAY) decay_attributes = ['kernel_regularizer', 'bias_regularizer', 'beta_regularizer', 'gamma_regularizer'] for layer in model.layers: for attr in decay_attributes: if hasattr(layer, attr): setattr(layer, attr, regularizer)
For example, the kernel regularizer of the first conv layer of Keras mobilenet_v2
with the weight decay = 0.00004
looks as follows:
Conv1: {'class_name': 'L1L2', 'config': {'l1': 0.0, 'l2': 3.9999998989515007e-05}}
# Add weight decay to the whole model regularizer = tf.keras.regularizers.l2(WEIGHT_DECAY) decay_attributes = ['kernel_regularizer', 'bias_regularizer', 'beta_regularizer', 'gamma_regularizer'] for layer in model.layers: for attr in decay_attributes: if hasattr(layer, attr): setattr(layer, attr, regularizer)
Cool It's solved my problem, thank u
Thanks @batzner, very well written. It did change the loaded model's regularizers.
# Add weight decay to the whole model regularizer = tf.keras.regularizers.l2(WEIGHT_DECAY) decay_attributes = ['kernel_regularizer', 'bias_regularizer', 'beta_regularizer', 'gamma_regularizer'] for layer in model.layers: for attr in decay_attributes: if hasattr(layer, attr): setattr(layer, attr, regularizer)
For example, the kernel regularizer of the first conv layer of Keras
mobilenet_v2
with the weight decay =0.00004
looks as follows:Conv1: {'class_name': 'L1L2', 'config': {'l1': 0.0, 'l2': 3.9999998989515007e-05}}
Nice solution, thank you
@batzner code is not working, because it only changes the model config, but does not create the addition tensors itself. One can check it easily by running
for l in model.layers:
print(l.losses)
A workaround is the following code
def create_model():
model = your_model()
model.save_weights("tmp.h5")
# optionally do some other modifications (freezing layers, adding convolutions etc.)
....
regularizer = l2(WEIGHT_DECAY / 2)
for layer in model.layers:
for attr in ['kernel_regularizer', 'bias_regularizer']:
if hasattr(layer, attr) and layer.trainable:
setattr(layer, attr, regularizer)
out = model_from_json(model.to_json())
out.load_weights("tmp.h5", by_name=True)
return out
Weight decay refers only to the bias and kernel weights (not BN see https://arxiv.org/pdf/1810.12281.pdf). This means for SGD use 'kernel_regularizer' + 'bias_regularizer' with l2(WEIGHT_DECAY / 2). And for Adam one has to create a custom optimizer based on tf.contrib.opt.AdamWOptimizer (see https://arxiv.org/pdf/1711.05101.pdf). As somebody else already noted, when in a paper, it is written weight decay = 0.0005, we have to divide by 2 (due to the derivative of ||weight||_2^2).
@lars76 Thanks for pointing it out! I didn't know only the config got changed and this issue 12053 was just opened 6 days ago. Also,the way keras handles the regularization loss differs from tensorflow, see issue 21587. You're right about the layer.losses
, since model.compile() in training.py
merely adds the network losses to the total loss, and a regularization loss tensor has to be added by that point. And your workaround works for me, at least I can see during the compile step, it adds loss tensor for each layer regularizer.
So if you just need to load a pre-defined model without pre-trained weights, you don't need to save/load weights.
import tensorflow as tf
from tensorflow.keras.models import model_from_json
model = tf.keras.applications.MobileNetV2(weights=None)
#Add regularizer to your layers
......
model = model_from_json(model.to_json())
model.compile(...)
To test if it works
for layer in model.layers:
print('%s: %s ' % (layer.get_config().get('name'), layer.losses) )
and you should see the result like this:
Conv1: [<tf.Tensor 'Conv1_1/kernel/Regularizer/add:0' shape=() dtype=float32>]
Excellent workaround, thanks!
@lars76 thank you for the hint! I updated my comment to make clear that the naïve solution only changes the model's config. I also added a note to highlight that L2 regularization doesn't directly correspond to weight decay.
As for the paper you referenced (https://arxiv.org/pdf/1810.12281.pdf), it states:
For clarity, we ignore the parameters γ and β, which do not impact the performance in practice.
As far as I understand it, they disable the learnable affine parameters of Batch Normalization completely and therefore don't regularize them. But if I were to enable them, is it harmful to also regularize them?
No, you can also regularize them. I just disabled gamma/beta, because the authors of the paper wrote that it made no difference performance-wise and older papers include only regular weights/biases (since BN didn't exist). Maybe you're even right, it might be more logical to set regularizers to both weights too, since beta corresponds to bias weights and gamma to regular weights. I think then the correct definition for SGD weight decay is "Loss function + 1/2 sum w_i^2 where w_i are all trainable weights".
How about this:
# a utility function to add weight decay after the model is defined.
def add_weight_decay(model, weight_decay):
if (weight_decay is None) or (weight_decay == 0.0):
return
# recursion inside the model
def add_decay_loss(m, factor):
if isinstance(m, tf.keras.Model):
for layer in m.layers:
add_decay_loss(layer, factor)
else:
for param in m.trainable_weights:
with tf.keras.backend.name_scope('weight_regularizer'):
regularizer = lambda: tf.keras.regularizers.l2(factor)(param)
m.add_loss(regularizer)
# weight decay and l2 regularization differs by a factor of 2
add_decay_loss(model, weight_decay/2.0)
return
If your model contains nested Models object, you can use this function:
def add_l2_weight_decay(net: Model, weights_decay=5e-4):
reg = l2(weights_decay)
for layer in net.layers:
if isinstance(layer, Model):
add_l2_weight_decay(layer, weights_decay)
for attr in ['kernel_regularizer', 'bias_regularizer']:
if hasattr(layer, attr) and layer.trainable:
setattr(layer, attr, reg)
Most helpful comment
@batzner code is not working, because it only changes the model config, but does not create the addition tensors itself. One can check it easily by running
A workaround is the following code
Weight decay refers only to the bias and kernel weights (not BN see https://arxiv.org/pdf/1810.12281.pdf). This means for SGD use 'kernel_regularizer' + 'bias_regularizer' with l2(WEIGHT_DECAY / 2). And for Adam one has to create a custom optimizer based on tf.contrib.opt.AdamWOptimizer (see https://arxiv.org/pdf/1711.05101.pdf). As somebody else already noted, when in a paper, it is written weight decay = 0.0005, we have to divide by 2 (due to the derivative of ||weight||_2^2).