Keras: attention layer requires another PR

Created on 27 Nov 2015 · 16Comments · Source: keras-team/keras

Hello all,

I implemented soft-attention layer (both dense & timedistributed). However, it depends on another PR. what's the best practice to create new PR?

stale

Source

volkancirik

Most helpful comment

I just wanted to add a link to a standalone example of that, which may be also easier to read:

https://github.com/brmson/dataset-sts/blob/master/examples/anssel_attn.py

(after a little more work, I intend to contribute a simplified version as an example in Keras itself)

pasky on 12 Feb 2016

👍2 ❤1 🎉1

All 16 comments

Can we see your implementation?

farizrahman4u on 27 Nov 2015

@fchollet declared hold on all PRs that do not address the backend abstraction. We will be solely working into getting Keras with Tensorflow + Theano working and up to date. So, if we accept PRs to master right now, we will have to change it later anyway. I recommend you to wait a bit because there may be changes (although very little, like 1 char) in the way we write layers in Keras. I recommend you now to develop your model in parallel in your own github. Also if you can get help from @farizrahman4u I'm sure the result will be awesome.

EderSantana on 27 Nov 2015

@farizrahman4u Here is the attention layer. It only works for Graph though.

it requires a small hack in core layer and naming convention is a bit off for which I need help from Keras community. It will be much better in couple of commits.

An encoder-decoder with attention would be like this:

mode = Graph() 
model.add_input(name='input', input_shape=(None,len(chars)))
model.add_node(RNN(128), name='encoder_rnn', input='input')
model.add_node(RepeatVector(MAXLEN), name ='recurrent_context', input = 'encoder_rnn')
model.add_node(RNN(256, return_sequences = True), name='encoder_context', input='input')
model.add_node(TimeDistributedAttention(prev_dim = 128, att_dim = 64, return_sequences = True), name='attention', inputs=['encoder_context','recurrent_context'], merge_mode = 'join_att')
model.add_node(TimeDistributedDense(len(chars)), name='tdd', input='attention')
model.add_node(Activation('softmax'), name = 'softmax',input = 'tdd')
model.add_output(name='output', input='softmax')

or visual-attention

image_model = Graph()
image_model.add_input(name = 'input', input_shape = Ximages[0].shape)
image_model.add_node(Convolution2D(12, 3, 3, border_mode='full'), name = 'c1',input = 'input')
image_model.add_node(Activation('relu'), name = 'a1',input = 'c1')
image_model.add_node(Convolution2D(12, 3, 3), name = 'c2',input = 'a1')
image_model.add_node(Activation('relu'), name = 'a2',input = 'c2')
image_model.add_node(MaxPooling2D(pool_size=(2, 2)), name = 'p1',input = 'a2')
image_model.add_node(Convolution2D(10, 3, 3, border_mode='full'), name = 'c3',input = 'p1')
image_model.add_node(Activation('relu'), name = 'a3',input = 'c3')
image_model.add_node(Convolution2D(10, 3, 3), name = 'c4',input = 'a3')
image_model.add_node(Activation('relu'), name = 'a4',input = 'c4')
image_model.add_node(PreAttention(), name='pre_attention', input= 'c4')
image_model.add_node(DenseAttention(att_dim = 128), name='dense_attention', input= 'pre_attention')
image_model.add_node(Dense(answer_size), name = 'd',input = 'dense_attention')
image_model.add_node(Activation('softmax'), name = 'softmax',input = 'd')

volkancirik on 27 Nov 2015

👍1

@wolet You should be using the LambdaMerge layer instead of the 'hack'. That way, this could be merged easily to Keras without changing the core layers (After we are done with TensorFlow of course), and would work seamlessly in Sequential and Graph models.

farizrahman4u on 29 Nov 2015

👍2

@farizrahman4u I did not know about LambdaMerge, thanks for pointing out!

volkancirik on 29 Nov 2015

You could also see #1051

elanmart on 30 Nov 2015

@farizrahman4u I could not find a way to use LambdaMerge in Graph models. Would you mind giving me a simple example?

volkancirik on 1 Dec 2015

(Not tested)

def func(X):
    #your merge  function here. X is a  list of input tensors
   #this function should output the merged  tensor
    pass

def output_shape(shapes):
    #shapes = list of output shapes of input tensors 
    #this function should output the shape of the merged tensor
    pass

input1 = Dense(....)
input2 = Dense(....)

lambda_merge = LambdaMerge([input1, input2], func, output_shape)

graph = Graph()
graph.add_input(input1, name='input1')
graph.add_input(input2, name='input2')

graph.add_node(lambda_merge, name='lambda_merge')
graph.add_node(Dense(....), name='dense1', input='lambda_merge')

farizrahman4u on 1 Dec 2015

@wolet do you plan to convert this code to the generic Keras backend (which supports both Tensorflow and Theano)? I am going to need an attention layer in the near future and you have already put so much work on this, so I don't see a reason for implementing it from scratch :)

jfsantos on 9 Jan 2016

@jfsantos I haven't used the new API after Tensorflow changes. I will check the new API and see how I can contribute.

volkancirik on 14 Jan 2016

In my environment

graph.add_input(input1, name='input1')

cause

TypeError: add_input() got multiple values for keyword argument 'name'

niitsuma on 29 Jan 2016

I figured that someone might find an example of attention layer like in 1506.03340 or 1511.04108 useful, so here is mine.

The setup: Transforming sequence of embeddings e1s to e1sm by multiplying it with a per-token attention. The attention is determined by similarity with another embedding e0a, and focused to a single or few points in the sequence by softmax, as in the papers above. The token attention scalar can be generated by a couple of ways, the original papers use w*tanh(e0a + W*e1s).

    model.add_node(name='e1sa', input='e1s',  # consider another nonlinearity here
                   layer=TimeDistributedDense(input_dim=int(N*sdim), output_dim=int(N*adim), W_regularizer=l2(l2reg))
    model.add_node(name='e0sa', input='e0a',
                   layer=RepeatVector(s1pad))
    model.add_node(name='esa[0]', inputs=['e0sa', 'e1sa'], merge_mode='sum',
                   layer=Activation(T.tanh))
    model.add_node(name='esa[1]', input='esa[0]',
                   layer=TimeDistributedDense(input_dim=int(N*adim), output_dim=1, W_regularizer=l2(l2reg)))
    model.add_node(name='esa[2]', input='esa[1]',
                   layer=Flatten(input_shape=(s1pad, 1)))
    model.add_node(name='esa[3]', input='esa[2]',
                   layer=Activation('softmax'))
    # and now just multiply timewise
    model.add_node(name='esa[4]', input='esa[3]',
                   layer=RepeatVector(int(N*sdim)))
    model.add_node(name='esa', input='esa[4]',
                   layer=Permute((2,1)))
    model.add_node(name='e1sm', inputs=['e1s', 'esa'], merge_mode='mul',
                   layer=Activation('linear'))

Posting it here as it was a bit difficult for me to figure out as a Keras/Theano newbie. I don't know if it's worth making a dedicated layer in Keras API for this, though. For one, because in my experiments I've found dot-product similarity to work a lot better than the weighed sum (but still researching):

    def batched_batched_dot(s):
        """ from (x,y,z)-shaped pair, produce (x,y)-shaped pair that replaces the z-vector pairs by their dot-products """
        import theano
        import theano.tensor as T
        return theano.scan(fn=lambda xm, ym: T.batched_dot(xm, ym),
                           outputs_info=None, sequences=s, non_sequences=None)[0]
    model.add_node(name='esa[0]',  # nested batched_dot
               layer=LambdaMerge([model.nodes['e0sa'], model.nodes['e1sa']],
                                 batched_batched_dot,
                                 lambda s: (s[1][0], s[1][1])))
    model.add_node(name='esa[3]', input='esa[0]',
                   layer=Activation('softmax'))

I hope to soon submit a PR adding an example that does some serious stuff with NLP embedding sequences and includes the attention mechanism, though!

pasky on 7 Feb 2016

I just wanted to add a link to a standalone example of that, which may be also easier to read:

https://github.com/brmson/dataset-sts/blob/master/examples/anssel_attn.py

(after a little more work, I intend to contribute a simplified version as an example in Keras itself)

pasky on 12 Feb 2016

👍2 ❤1 🎉1

@pasky Thanks a lot!

ylqfp on 30 Apr 2016

@pasky / @wolet did you ever ported this to the generic keras backend?

thomasjungblut on 16 May 2017

This issue has been automatically marked as stale because it has not had recent activity. It will be closed after 30 days if no further activity occurs, but feel free to re-open a closed issue if needed.