I want to stack two bidirectional RNNs and i was wondering what is the correct way to do it. My main question is what data each of the deeper RNNs read. To be more specific:
Lets say that we have two RNNs each one with 64 cells, lets call them F1 and B1 (F-forward, B-backwards). Lets say that for our problem we just want to create a dense representation of our input. In the case of Bidirectional RNNs we take the outputs from the last timestep for each one and concatenate them to get a vector of size 128.
output = Bidirectional(GRU(64))(input)
Now in order to get a more high-level representation, i want to stack another bidirectional layer on top, again with each RNN having 64 cells, F2 and B2. Is this correct?
output1 = Bidirectional(GRU(64, return_sequences=True))(input)
output2 = Bidirectional(GRU(64))(output1)
This example runs just fine. But is it the _correct way_ to do it?
Question 1 - How exactly does the previous example work? I want to understand what exactly is the input of the second forward RNN F2.
A) is it the output(s) of just the F1?
B) is it the concatenation of the outputs of both F1 and B1?
which one is it?
Question 2 - is there even a correct way (A-B)? Are there scenarios that both make sense?
Q1: It's B), since the Bidirectional wrapper (by default) returns the concatenation of F1 and B1.
Q2: I don't think it makes sense to speak of a "correct way" when talking about neural network architectures, but FWIW, I would have interpreted "a stack of bi-RNNs" to refer to option B).
_Careful, personal intuition follows:_ I believe that constructing it like option A) would lead to a less powerful model, since you would only combine the information from the forward and backward passes at the very end, for the output. On the contrary, with option B), the second RNN of the stack can already make use of both forward & backward information to calculate its own representations.
@mbollmann Thanks! I needed to know how keras works in order to be able to think better about the design of my architecture. I also tried this:
forw = GRU(units, return_sequences=True)(_input)
forw = GRU(units)(forw)
back = GRU(units, return_sequences=True, go_backwards=True)(_input)
back = GRU(units, go_backwards=True)(back)
output = merge([forw, back], mode='concat')
and it seems that your intuition is correct as i got inferior performance compared to simple stacking two Bidirectional layers.
And why not ?
rnn = GRU(units, return_sequences=True)(_input)
rnn = GRU(units, return_sequences=True, go_backwards=True)(rnn)
rnn = GRU(units)(rnn)
rnn = GRU(units, go_backwards=True)(rnn)
it is something I see quite often.
@mbollmann Thanks! I needed to know how keras works in order to be able to think better about the design of my architecture. I also tried this:
forw = GRU(units, return_sequences=True)(_input) forw = GRU(units)(forw) back = GRU(units, return_sequences=True, go_backwards=True)(_input) back = GRU(units, go_backwards=True)(back) output = merge([forw, back], mode='concat')and it seems that your intuition is correct as i got inferior performance compared to simple stacking two
Bidirectionallayers.
Hi, in my case, I think your code isn't working. In the second backward-layer, it will return all zero. In my result, first layer will return reversed sequence. So, I think that we have to reverse first output and pass it to second backward-layer and reverse it again like code below
back = GRU(units, return_sequences=True, go_backwards=True)(_input)
back = layers.Lambda(lambda x: K.reverse(x, axes=-2))(back) # reverse backward-lstm tensor
back = GRU(units, go_backwards=True)(back)
back = layers.Lambda(lambda x: K.reverse(x, axes=-2))(back)
Most helpful comment
@mbollmann Thanks! I needed to know how keras works in order to be able to think better about the design of my architecture. I also tried this:
and it seems that your intuition is correct as i got inferior performance compared to simple stacking two
Bidirectionallayers.