Keras: Is it the same adding weight decay to all the layers (including input and output layer) than adding the weight decay term to the cost function?

Created on 13 May 2016 · 31Comments · Source: keras-team/keras

Hello,

Using other frameworks I have used weight decay on the cost function rather than layer wise. How does weight decay per layer work?

And if I wanted to do weight decay to the cost function how would it be on Keras. Thank you.

Cheers,
EM

stale

Source

EnriqueSMarquez

Most helpful comment

@batzner code is not working, because it only changes the model config, but does not create the addition tensors itself. One can check it easily by running

for l in model.layers:
    print(l.losses)

A workaround is the following code

def create_model():
    model = your_model()
    model.save_weights("tmp.h5")

    # optionally do some other modifications (freezing layers, adding convolutions etc.)
    ....

    regularizer = l2(WEIGHT_DECAY / 2)
    for layer in model.layers:
        for attr in ['kernel_regularizer', 'bias_regularizer']:
            if hasattr(layer, attr) and layer.trainable:
                setattr(layer, attr, regularizer)

    out = model_from_json(model.to_json())
    out.load_weights("tmp.h5", by_name=True)

    return out

Weight decay refers only to the bias and kernel weights (not BN see https://arxiv.org/pdf/1810.12281.pdf). This means for SGD use 'kernel_regularizer' + 'bias_regularizer' with l2(WEIGHT_DECAY / 2). And for Adam one has to create a custom optimizer based on tf.contrib.opt.AdamWOptimizer (see https://arxiv.org/pdf/1711.05101.pdf). As somebody else already noted, when in a paper, it is written weight decay = 0.0005, we have to divide by 2 (due to the derivative of ||weight||_2^2).

lars76 on 22 Jan 2019

👍9 ❤3 🎉2

All 31 comments

Weight decay is added by Regularizers (docs). It only works on layers you add this argument.

joelthchao on 13 May 2016

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs, but feel free to re-open it if needed.

stale[bot] on 23 May 2017

Can somebody explain the original question? I also know weight decay only in the cost function. So how should I implement this in Keras?

michaelschleiss on 13 Jun 2017

@michaelschleiss, you can take a look at the docs that @joelthchao referred to.

The question is about whether or not it is possible to add the weight decay term manually to the global cost function in a Keras model. While it is possible to create your own cost function, in addition to Keras' preimplemented cost functions (see the docs here), it doesn't seems as if there is an easy to get hold of the weight matrix, which you would need in order to add the weight decay term to the cost. Only the true labels and the predicted labels seem to be accessible from within a custom cost method.

If you are interested in adding regularization to your network, your best bet would be to consider layer-specific regularizers that you would have to add one by one.

sebastianbk on 22 Jun 2017

👍2

I have also this problem. I think the global weight decay is actually not equal to the layer-wise weight decay.

perryshao on 11 Jul 2017

@perryshao global weight decay is just summation of all layer-wise weight decays. Therefore, if you want to have a "global weight decay" in your loss function, just put Regularizer on all the layers with trainable weights.

joelthchao on 11 Jul 2017

👎7 👍5

This issue has been automatically marked as stale because it has not had recent activity. It will be closed after 30 days if no further activity occurs, but feel free to re-open a closed issue if needed.

stale[bot] on 9 Oct 2017

@joelthchao If I use Regularizer on all the layers, what the regularization parameter should be? Suppose I have global weight decay rate to be 0.005 in caffe, is it the same as using 0.005 for each layer in keras?

zekun-li on 12 Jan 2018

@zekun-li Yes.

joelthchao on 22 Jan 2018

with deeper and deeper layers - it would be nice to have a cost function weight decay so you dont have to add it to each layer.

emoen on 21 Feb 2018

This could be of help to anyone who is looking for how to code this, ty.

for layer in my_model.layers:
    if hasattr(layer, 'kernel_regularizer'):
        layer.kernel_regularizer= regularizers.l2(weight_decay)

tutysara on 7 Mar 2018

👍12 👎7

@tutysara thanks for example. is this same as your example ? :

trainoptmodel = Sequential()
trainoptmodel.add(Dense(25,input_dim=10, activation='tanh', kernel_regularizer=regularizers.l2(0.01)))
trainoptmodel.add(Dense(3,activation="linear", kernel_regularizer=regularizers.l2(0.01))

And mean of this code is global weight decay is identified as 0.01 ?

HerocypheR on 15 May 2018

@tutysara this does not work.

I just lost an hour to investigate it. Adding regularizers this way will not affect layer.losses. Layers losses are constructed when adding layer, so adding regularizers after layer was constructed will not affect model loss.

mjmikulski on 6 Jul 2018

❤1

@mjmikulski Hey sorry, didn't know it doesn't take effect after layer construction. I based my implementation based on @joelthchao comment. How did you end up solving it?

tutysara on 10 Jul 2018

@tutysara
I just added regularizer manualy on each layer:

from keras.regularizers import L1L2
dense_regularizer = L1L2(l2=0.0001)
# ...
model.add(Dense(128, kernel_regularizer=dense_regularizer))
# ...

@EnriqueSMarquez
And answering original question: weight decay is exactly the same what L2 norm (see this, page 227).

In keras you cannot add a global weight decay, you can instead add regularizers for each layer (and to be precise, separately to layer weights (aka kernel) and biases).

mjmikulski on 10 Jul 2018

@joelthchao In the view of mathematical equation of L2 norm, it is obvious that the global weight decay using L2 norm is not exactly same with the layer-wise weight decay using L2 norm. For example, sqrt(x1^2+x2^2)+sqrt(y1^2+y2^2), and sqrt(x1^2+x2^2+y1^2+y2^2), suppose that x vector is the weights of layer 1 and y is the weights of layer 2.

perryshao on 12 Jul 2018

@perryshao From mathematical point of view you are perfectly right - there is a sqrt in L2.

But in keras (or maybe more generally in ML), L2 means rather "sum of squares", not really a proper L2 norm, see keras implementation:

class L1L2(Regularizer):
    # ...
    def __call__(self, x):
        regularization = 0.
        if self.l1:
            regularization += K.sum(self.l1 * K.abs(x))
        if self.l2:
            regularization += K.sum(self.l2 * K.square(x))
        return regularization

mjmikulski on 13 Jul 2018

👍3

@mjmikulski Yes, you are right. I checked with the codes, where it is not really a proper L2 norm, just "a sum of squares". thank you for your reply.

perryshao on 16 Jul 2018

@EnriqueSMarquez

weight decay is exactly the same what L2 norm

Apparently it's not. Here is an explanation why:
https://bbabenko.github.io/weight-decay/
TL/DR: Weight Decay is subtracted directly from weights on each step as is, but L2_reg is added to loss, hence it affects weights as derivative (multiplied by 2). To be consistent with weight decay it should be divided by 2 in loss function.
When using sophisticated optimizers there are some other differences:
http://www.fast.ai/2018/07/02/adam-weight-decay/

The problem of adding kernel_regularizer to the existing layer is still unsolved for me. Even though there is a stackoverflow question claims it should work:
https://stackoverflow.com/questions/48330137/adding-regularizer-to-an-existing-layer-of-a-trained-model-without-resetting-wei

apatsekin on 22 Oct 2018

@apatsekin good point! I have not thought about it.

Let me paste here part of bbabenko article you mention:

In frameworks that implement L2 regularization, the gradients the solver is using are for “total loss”, which has L2 regularization baked into it (this fact is typically abstracted away from the solver — it doesn’t “know” anything about whether a regularization term was included in the loss or not). In frameworks that implement weight decay, the solver is considering only the gradient for L.

What I said:

weight decay is exactly the same what L2 norm

is true for a plain gradient descend (up to factor 2 which is irrelevant when you do a hyperparameter search, but is sth to remember when you try to reproduce an experiment). Situation gets more complicated when you accumulate the gradient some way. Thanks @apatsekin for pointing it out.

Anyway, what is wrong with this stackoverflow question? The answer makes a point and explains OP's observation.

mjmikulski on 23 Oct 2018

@mjmikulski thanks for reply!

Anyway, what is wrong with this stackoverflow question? The answer makes a point and explains OP's observation.

As you pointed earlier in this thread:

I just lost an hour to investigate it. Adding regularizers this way will not affect layer.losses

And I confirm that _recompile_ doesn't pick it up. However, the stackoverflow thread claims the opposite.

apatsekin on 24 Oct 2018

~~# Add weight decay to the whole model~~
~~regularizer = tf.keras.regularizers.l2(WEIGHT_DECAY)~~
~~decay_attributes = ['kernel_regularizer', 'bias_regularizer',~~
~~'beta_regularizer', 'gamma_regularizer']~~

~~for layer in model.layers:~~
~~for attr in decay_attributes:~~
~~if hasattr(layer, attr):~~
~~setattr(layer, attr, regularizer)~~

As pointed out by @lars76 below, setting an attribute like kernel_regularizer on an existing layer will only affect the model's config. Even when compiling the model afterwards, the regularization loss will be ignored. To counteract this, we need to reload the model using

model = model_from_json(model.to_json())

This gets especially ugly on a model with already trained weights, as we need to save and reload the weights as well. The function below does all this for you and also works with tf.keras instead of keras.

Example usage:

model = add_l1l2_regularizer(model, l2=0.01)
model.compile(...)

import os
import tempfile

def add_l1l2_regularizer(model, l1=0.0, l2=0.0, reg_attributes=None):
    # Add L1L2 regularization to the whole model.
    # NOTE: This will save and reload the model. Do not call this function inplace but with
    # model = add_l1l2_regularizer(model, ...)

    if not reg_attributes:
        reg_attributes = ['kernel_regularizer', 'bias_regularizer',
                          'beta_regularizer', 'gamma_regularizer']
    if isinstance(reg_attributes, str):
        reg_attributes = [reg_attributes]

    regularizer = keras.regularizers.l1_l2(l1=l1, l2=l2)

    for layer in model.layers:
        for attr in reg_attributes:
            if hasattr(layer, attr):
                setattr(layer, attr, regularizer)

    # So far, the regularizers only exist in the model config. We need to
    # reload the model so that Keras adds them to each layer's losses.
    model_json = model.to_json()

    # Save the weights before reloading the model.
    tmp_weights_path = os.path.join(tempfile.gettempdir(), 'tmp_weights.h5')
    model.save_weights(tmp_weights_path)

    # Reload the model
    model = keras.models.model_from_json(model_json)
    model.load_weights(tmp_weights_path, by_name=True)

    return model

Also note that depending on the optimizer you use, L2 regularization does not necessarily correspond to weight decay (https://arxiv.org/abs/1711.05101).

batzner on 15 Dec 2018

👍5

Thanks @batzner, very well written. It did change the loaded model's regularizers.

# Add weight decay to the whole model
regularizer = tf.keras.regularizers.l2(WEIGHT_DECAY)
decay_attributes = ['kernel_regularizer', 'bias_regularizer',
                    'beta_regularizer', 'gamma_regularizer']

for layer in model.layers:
    for attr in decay_attributes:
        if hasattr(layer, attr):
            setattr(layer, attr, regularizer)

For example, the kernel regularizer of the first conv layer of Keras mobilenet_v2 with the weight decay = 0.00004 looks as follows:

Conv1: {'class_name': 'L1L2', 'config': {'l1': 0.0, 'l2': 3.9999998989515007e-05}}

nicolefinnie on 27 Dec 2018

👍1

# Add weight decay to the whole model
regularizer = tf.keras.regularizers.l2(WEIGHT_DECAY)
decay_attributes = ['kernel_regularizer', 'bias_regularizer',
                    'beta_regularizer', 'gamma_regularizer']

for layer in model.layers:
    for attr in decay_attributes:
        if hasattr(layer, attr):
            setattr(layer, attr, regularizer)

Cool It's solved my problem, thank u

Tiaspetto on 3 Jan 2019

Thanks @batzner, very well written. It did change the loaded model's regularizers.
# Add weight decay to the whole model
regularizer = tf.keras.regularizers.l2(WEIGHT_DECAY)
decay_attributes = ['kernel_regularizer', 'bias_regularizer',
                    'beta_regularizer', 'gamma_regularizer']

for layer in model.layers:
    for attr in decay_attributes:
        if hasattr(layer, attr):
            setattr(layer, attr, regularizer)
For example, the kernel regularizer of the first conv layer of Keras mobilenet_v2 with the weight decay = 0.00004 looks as follows:
Conv1: {'class_name': 'L1L2', 'config': {'l1': 0.0, 'l2': 3.9999998989515007e-05}} 

Nice solution, thank you

mrluin on 6 Jan 2019

@batzner code is not working, because it only changes the model config, but does not create the addition tensors itself. One can check it easily by running

for l in model.layers:
    print(l.losses)

A workaround is the following code

def create_model():
    model = your_model()
    model.save_weights("tmp.h5")

    # optionally do some other modifications (freezing layers, adding convolutions etc.)
    ....

    regularizer = l2(WEIGHT_DECAY / 2)
    for layer in model.layers:
        for attr in ['kernel_regularizer', 'bias_regularizer']:
            if hasattr(layer, attr) and layer.trainable:
                setattr(layer, attr, regularizer)

    out = model_from_json(model.to_json())
    out.load_weights("tmp.h5", by_name=True)

    return out

lars76 on 22 Jan 2019

👍9 ❤3 🎉2

@lars76 Thanks for pointing it out! I didn't know only the config got changed and this issue 12053 was just opened 6 days ago. Also,the way keras handles the regularization loss differs from tensorflow, see issue 21587. You're right about the layer.losses, since model.compile() in training.py merely adds the network losses to the total loss, and a regularization loss tensor has to be added by that point. And your workaround works for me, at least I can see during the compile step, it adds loss tensor for each layer regularizer.

So if you just need to load a pre-defined model without pre-trained weights, you don't need to save/load weights.

import tensorflow as tf
from tensorflow.keras.models import model_from_json

model = tf.keras.applications.MobileNetV2(weights=None)
#Add regularizer to your layers 
......
model = model_from_json(model.to_json())
model.compile(...)

To test if it works

for layer in model.layers:
    print('%s: %s ' % (layer.get_config().get('name'), layer.losses) )

and you should see the result like this:

Conv1: [<tf.Tensor 'Conv1_1/kernel/Regularizer/add:0' shape=() dtype=float32>]

Excellent workaround, thanks!

nicolefinnie on 22 Jan 2019

👍1

@lars76 thank you for the hint! I updated my comment to make clear that the naïve solution only changes the model's config. I also added a note to highlight that L2 regularization doesn't directly correspond to weight decay.

As for the paper you referenced (https://arxiv.org/pdf/1810.12281.pdf), it states:

For clarity, we ignore the parameters γ and β, which do not impact the performance in practice.

As far as I understand it, they disable the learnable affine parameters of Batch Normalization completely and therefore don't regularize them. But if I were to enable them, is it harmful to also regularize them?

batzner on 22 Jan 2019

👍1

No, you can also regularize them. I just disabled gamma/beta, because the authors of the paper wrote that it made no difference performance-wise and older papers include only regular weights/biases (since BN didn't exist). Maybe you're even right, it might be more logical to set regularizers to both weights too, since beta corresponds to bias weights and gamma to regular weights. I think then the correct definition for SGD weight decay is "Loss function + 1/2 sum w_i^2 where w_i are all trainable weights".

lars76 on 22 Jan 2019

How about this:

# a utility function to add weight decay after the model is defined.
def add_weight_decay(model, weight_decay):
    if (weight_decay is None) or (weight_decay == 0.0):
        return

    # recursion inside the model
    def add_decay_loss(m, factor):
        if isinstance(m, tf.keras.Model):
            for layer in m.layers:
                add_decay_loss(layer, factor)
        else:
            for param in m.trainable_weights:
                with tf.keras.backend.name_scope('weight_regularizer'):
                    regularizer = lambda: tf.keras.regularizers.l2(factor)(param)
                    m.add_loss(regularizer)

    # weight decay and l2 regularization differs by a factor of 2
    add_decay_loss(model, weight_decay/2.0)
    return

mathmanu on 16 Aug 2019

If your model contains nested Models object, you can use this function:

def add_l2_weight_decay(net: Model, weights_decay=5e-4):
    reg = l2(weights_decay)
    for layer in net.layers:
        if isinstance(layer, Model):
            add_l2_weight_decay(layer, weights_decay)
        for attr in ['kernel_regularizer', 'bias_regularizer']:
            if hasattr(layer, attr) and layer.trainable:
                setattr(layer, attr, reg)