Hello!
I was looking my network weights, when I realize that all my bias values on a Dense layer were 0s. I saw on the Dense Layer initialization: self.b = shared_zeros((self.output_dim)). Shouldn't be self.b = shared_ones((self.output_dim))?
I don't know if I didn't understand, or if it exists somewhere inside the code that changes it. I'm kinda new on Theano stuff.
Starting with zeros for the bias is the norm in fully-connected layers.
It has recently been shown that one-initialized biases worked better in LSTM, though. But LSTM are very different from feedforward dense nets.
Thanks for the quick answer @fchollet!
Do you know which paper states that?
Another doubt, not very related to the bias, but to the shared_zeros on the SGD optimizer on get_updates
m = shared_zeros(p.get_value().shape) # momentum
v = self.momentum * m - lr * g # velocity
Since m == 0, it doesn't mean that self.momentum * m == 0 => v = -lr * g?
Re: which paper, whilst it was introduced far earlier (Gers et al., 2000), the improvements were again shown in a variety of tasks in An Empirical Exploration of Recurrent Network Architectures (Jozefowicz et al., 2015):
"We found that adding a bias of 1 to the LSTM’s forget gate closes the gap between the LSTM and the GRU."
Thank you, @Smerity!
Do you know my previous question about the velocity calculation?
@fchollet, I wouldn't be surprised that initializing biases to zero is standard, but it seems to be unlikely to be optimal for many normal problems one might want to solve with FC NNs. For instance, consider regression with relu-style activations. If the biases are initialized to zero, the half-planes of positive activation specified by each neuron are starting in a non-generic configuration (the bounding hyperplanes all pass through the origin)! This seems like a terrible way to start finding a solution to the regression problem since you would want the positive domains to have irregular overlaps.
It would be nice to have options for bias initialization.
Or add a Bias core layer
In this famous paper in bioinformatics:
http://www.nature.com/nbt/journal/v33/n8/extref/nbt.3300-S2.pdf
The bias of last Dense layer in init to -4.0 due the dataset have background bias
deepbind_model = [
motif_scan(num_motifs = 16,
motif_len = 24,
weight_decay = loguniform(1e-10,1e-3),
init_scale = loguniform(1e-7,1e-3),
bias(),
rectify(),
maxpool(),
full(num_units = 32,
weight_decay = loguniform(1e-10,1e-3),
init_scale = loguniform(1e-5,1e-2),
rectify(),
dropout(expected_value = choice([0.5, 0.75, 1.0])),
full(num_units = 1,
weight_decay = loguniform(1e-10,1e-3),
init_scale = loguniform(1e-5,1e-2),
bias(init_bias = -4.0),
]
@fchollet or add a Bias in Core Layer
In this famous bioinformatics paper published in nature
http://www.nature.com/nbt/journal/v33/n8/extref/nbt.3300-S2.pdf
the bias final Dense layer is init to -0.4 due to the dataset
Nice find, SunYu!
On Tue, Sep 13, 2016 at 10:34 PM SunYu [email protected] wrote:
Or add a Bias core layer
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
https://github.com/fchollet/keras/issues/339#issuecomment-246911035, or mute
the thread
https://github.com/notifications/unsubscribe-auth/ACWGWMRTsUVmpBKSUluxRGd_Hhv5g4H2ks5qp4d5gaJpZM4FSVp0
.
Most helpful comment
@fchollet, I wouldn't be surprised that initializing biases to zero is standard, but it seems to be unlikely to be optimal for many normal problems one might want to solve with FC NNs. For instance, consider regression with relu-style activations. If the biases are initialized to zero, the half-planes of positive activation specified by each neuron are starting in a non-generic configuration (the bounding hyperplanes all pass through the origin)! This seems like a terrible way to start finding a solution to the regression problem since you would want the positive domains to have irregular overlaps.
It would be nice to have options for bias initialization.