Keras: Bias initialization

Created on 6 Jul 2015 · 8Comments · Source: keras-team/keras

Hello!

I was looking my network weights, when I realize that all my bias values on a Dense layer were 0s. I saw on the Dense Layer initialization: self.b = shared_zeros((self.output_dim)). Shouldn't be self.b = shared_ones((self.output_dim))?

I don't know if I didn't understand, or if it exists somewhere inside the code that changes it. I'm kinda new on Theano stuff.

Source

felipefariax

Most helpful comment

@fchollet, I wouldn't be surprised that initializing biases to zero is standard, but it seems to be unlikely to be optimal for many normal problems one might want to solve with FC NNs. For instance, consider regression with relu-style activations. If the biases are initialized to zero, the half-planes of positive activation specified by each neuron are starting in a non-generic configuration (the bounding hyperplanes all pass through the origin)! This seems like a terrible way to start finding a solution to the regression problem since you would want the positive domains to have irregular overlaps.

It would be nice to have options for bias initialization.

d-rams on 1 May 2016

👍5

All 8 comments

Starting with zeros for the bias is the norm in fully-connected layers.

It has recently been shown that one-initialized biases worked better in LSTM, though. But LSTM are very different from feedforward dense nets.

fchollet on 6 Jul 2015

Thanks for the quick answer @fchollet!

Do you know which paper states that?

Another doubt, not very related to the bias, but to the shared_zeros on the SGD optimizer on get_updates

m = shared_zeros(p.get_value().shape) # momentum
v = self.momentum * m - lr * g # velocity

Since m == 0, it doesn't mean that self.momentum * m == 0 => v = -lr * g?

felipefariax on 6 Jul 2015

Re: which paper, whilst it was introduced far earlier (Gers et al., 2000), the improvements were again shown in a variety of tasks in An Empirical Exploration of Recurrent Network Architectures (Jozefowicz et al., 2015):
"We found that adding a bias of 1 to the LSTM’s forget gate closes the gap between the LSTM and the GRU."

Smerity on 6 Jul 2015

👍1

Thank you, @Smerity!

Do you know my previous question about the velocity calculation?

felipefariax on 6 Jul 2015

It would be nice to have options for bias initialization.

d-rams on 1 May 2016

👍5

Or add a Bias core layer
In this famous paper in bioinformatics:
http://www.nature.com/nbt/journal/v33/n8/extref/nbt.3300-S2.pdf
The bias of last Dense layer in init to -4.0 due the dataset have background bias

deepbind_model = [
motif_scan(num_motifs = 16,
motif_len = 24,
weight_decay = loguniform(1e-10,1e-3),
init_scale = loguniform(1e-7,1e-3),
bias(),
rectify(),
maxpool(),
full(num_units = 32,
weight_decay = loguniform(1e-10,1e-3),
init_scale = loguniform(1e-5,1e-2),
rectify(),
dropout(expected_value = choice([0.5, 0.75, 1.0])),
full(num_units = 1,
weight_decay = loguniform(1e-10,1e-3),
init_scale = loguniform(1e-5,1e-2),
bias(init_bias = -4.0),
]

sun9700 on 14 Sep 2016

@fchollet or add a Bias in Core Layer
In this famous bioinformatics paper published in nature
http://www.nature.com/nbt/journal/v33/n8/extref/nbt.3300-S2.pdf
the bias final Dense layer is init to -0.4 due to the dataset

sun9700 on 14 Sep 2016

Nice find, SunYu!
On Tue, Sep 13, 2016 at 10:34 PM SunYu [email protected] wrote:

Or add a Bias core layer

—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
https://github.com/fchollet/keras/issues/339#issuecomment-246911035, or mute
the thread
https://github.com/notifications/unsubscribe-auth/ACWGWMRTsUVmpBKSUluxRGd_Hhv5g4H2ks5qp4d5gaJpZM4FSVp0
.