Keras: Attention Model Available!

Created on 24 Mar 2016 · 14Comments · Source: keras-team/keras

Hi,

I implemented an attention model for doing textual entailment problems. Here is the code. Its a bit worse than the paper, but works decently well. Hope this comes handy for beginners in keras like me.

Comments are welcome!

Shout outs to @farizrahman4u @fchollet @pasky for their help and patience in answering queries on github.

Source

shyamupa

👍17

Most helpful comment

I've just started a project to collect all the possible information about attention with Keras:

https://github.com/philipperemy/keras-attention-mechanism

Check this out! It's still at an early stage. I'm currently working on it!

philipperemy on 29 May 2017

👍15 ❤2

All 14 comments

awesome job.
A minor comment on L139
https://github.com/shyamupa/snli-entailment/blob/master/amodel.py#L139
TimeDistributedDense layer will produce a 3D tensor shape of (batch_size,L,1), and when you apply the softmax activation, the output maybe not correct.
because in the last dimension there is only one unit, causing a constant output of 1, losing the meaning of attention.
I think you can try to use TimeDistributedDense with linear activation, then Flatten it (to get 2D tensor), and apply softmax afterwards.

ymcui on 25 Mar 2016

👍3

@ymcui Good catch! I tried your modification and am noticing some improvements. Thanks!

shyamupa on 25 Mar 2016

That's nice work!

What I got stuck on however when I was thinking about this is the fact that in the paper, they use two different RNNs in series, whereas you use only a single common RNN for both premise and hypothesis. I think that'll probably require some small Keras modifications to allow "initialize from node".

pasky on 25 Mar 2016

True. I implemented what they called shared encoding. The difference b/w the models is about 2 pts in their experiments. Lasagne has this feature to initialize the hidden state, but writing this model there would lead to code bloat. Maybe something can be done for keras RNN too? :) @fchollet

shyamupa on 25 Mar 2016

Oh, I somehow missed that experiment. So this isn't that important. Nice!

pasky on 25 Mar 2016

Depends on what you mean by important (2% on that dataset is about 200 questions). Also notice that I train embeddings along with the model, while they fix it to word2vec/glove.

shyamupa on 25 Mar 2016

@shyamupa The implementation of Bi-GRU seems problematic ( https://github.com/shyamupa/snli-entailment/blob/master/amodel.py#L128). This is an old issue: #2074 #1725 #1703 #1674 #1432 #1282 any plan to fix it officially? @fchollet

DingKe on 26 Mar 2016

I see, I was not aware of this. I was using LSTMs earlier, but switched to GRU because they are supposed to train faster. Hope LSTMs dont have the same issue..

shyamupa on 26 Mar 2016

The LSTMs have the same issue.

pasky on 27 Mar 2016

@DingKe I don't think that there are plans to fix the go_backward stuff because it is consistent with the go backward of Theano. At least it won't be solved at Backend level. I think Something needs to be done in the Recurrent class however. I originally added the go_backwards in the Recurrent class by simply wrapping the Theano scan keywords but we need to fix this issue at least in the example.

dbonadiman on 29 Mar 2016

My understanding is that this is mainly a matter of someone finding the
time to submit a patch that flips go_backwards sequences in the
Recurrent class frontends (after applying K.rnn), checks that the same
is done for masks and writes some testcases + docs. Just a little grunt
work. I hoped to do it but I can't seem to be getting around to attend
properly to even a simpler PR that's already open... :(

pasky on 29 Mar 2016

Hi,

I'm trying to implement a similar attention model in Keras. Does the go_backwards bug still exist? If not, can someone give a small example on how to fix it.

Thanks.