Keras: How to add Attention on top of a Recurrent Layer (Text Classification)

Created on 7 Jan 2017  ·  114Comments  ·  Source: keras-team/keras

I am doing text classification. Also I am using my pre-trained word embeddings and i have a LSTM layer on top with a softmax at the end.

vocab_size = embeddings.shape[0]
embedding_size = embeddings.shape[1]

model = Sequential()

model.add(Embedding(
        input_dim=vocab_size,
        output_dim=embedding_size,
        input_length=max_length,
        trainable=False,
        mask_zero=True,
        weights=[embeddings]
    ))

model.add(LSTM(200, return_sequences=False))
model.add(Dropout(0.5))

model.add(Dense(3, activation='softmax', activity_regularizer=activity_l2(0.0001)))

Pretty simple. Now I want to add attention to the model, but i don't know how to do it.

My understanding is that i have to set return_sequences=True so as the attention layer will weigh each timestep accordingly. This way the LSTM will return a 3D Tensor, right?
After that what do i have to do?
Is there a way to easily implement a model with attention using Keras Layers or do i have to write my own custom layer?

If this can be done with the available Keras Layers, I would really appreciate an example.

Most helpful comment

@patyork, I'm sorry, but I don't see how this implements attention at all?

From my understanding, the softmax in the Bengio et al. paper is not applied over the LSTM output, but over the output of an attention model, which is calculated from the LSTM's hidden state at a given timestep. The output of the softmax is then used to modify the LSTM's internal state. Essentially, attention is something that happens within an LSTM since it is both based on and modifies its internal states.

I actually made my own attempt to create an attentional LSTM in Keras, based on the very same paper you cited, which I've shared here:

https://gist.github.com/mbollmann/ccc735366221e4dba9f89d2aab86da1e

There are several different ways to incorporate attention into an LSTM, and I won't claim 100% correctness of my implementation (though I'd appreciate any hints if something seems terribly wrong!), but I'd be surprised if it was as simple as adding a softmax activation.

All 114 comments

It's been a while since I've used attention, so take this with a grain of salt.

return_sequences does not necessarily need to be True for attention to work; the underlying computation is the same, and this flag should be used only based on whether you need 1 output or an output for each timestep.

As for implementing attention in Keras.. There are two possible methods: a) add a hidden Activation layer for the softmax or b) change the recurrent unit to have a softmax.

On option a): this would apply attention to the output of the recurrent unit but not to the output/input passed to the next time step. I don't this is what is desired. In this case, the LSTM should have a squashing function applied, as LSTMs don't do too well with linear/relu style activation.

On option b): this would apply attention to the output of the recurrentcy, and also to the output/input passed to the next timestep. I think that this is what is desired, but I could be wrong. In this case, the linear output of the neurons would be squashed directly by the softmax; if you wish to apply a pre-squashing such as sigmoid or tanh before the softmax calculation, you would need a custom activation that does both in one step.

I could draw a diagram if necessary, and I should probably read the activation papers again..

@patyork Thanks for the reply.
Do you have a good paper (or papers) in mind (for attention)? I am reading a lot about attention, and i want to try it out, because i really like the idea. But even though i think i understand the concept i don't have a clear understanding of how it works and how to implement it.

If it is possible i would like for someone to offer an example in Keras.

PS. is this the correct place to ask such question or i should do it at https://groups.google.com/d/forum/keras-users?

@baziotis This area is supposed to be more for bugs as opposed to "how to implement" questions. I admit I don't often look at the google group, but that is a valid place to ask these questions, as well as on the Slack channel.

Bengio et. al has a pretty good paper on attention (soft attention is the softmax attention).

An example of method a) I described:

vocab_size = embeddings.shape[0]
embedding_size = embeddings.shape[1]

model = Sequential()

model.add(Embedding(
        input_dim=vocab_size,
        output_dim=embedding_size,
        input_length=max_length,
        trainable=False,
        mask_zero=True,
        weights=[embeddings]
    ))

model.add(LSTM(200, return_sequences=False))
model.add(Activation('softmax')) #this guy here
model.add(Dropout(0.5))

model.add(Dense(3, activation='softmax', activity_regularizer=activity_l2(0.0001)))

example b), with simple activation:

vocab_size = embeddings.shape[0]
embedding_size = embeddings.shape[1]

model = Sequential()

model.add(Embedding(
        input_dim=vocab_size,
        output_dim=embedding_size,
        input_length=max_length,
        trainable=False,
        mask_zero=True,
        weights=[embeddings]
    ))

model.add(LSTM(200, return_sequences=False, activation='softmax'))
model.add(Dropout(0.5))

model.add(Dense(3, activation='softmax', activity_regularizer=activity_l2(0.0001)))

example b) with sigmoid and then softmax (non-working, but the idea):

vocab_size = embeddings.shape[0]
embedding_size = embeddings.shape[1]

def myAct(out):
    return K.softmax(K.tanh(out))

model = Sequential()

model.add(Embedding(
        input_dim=vocab_size,
        output_dim=embedding_size,
        input_length=max_length,
        trainable=False,
        mask_zero=True,
        weights=[embeddings]
    ))

model.add(LSTM(200, return_sequences=False, activation=myAct))
model.add(Dropout(0.5))

model.add(Dense(3, activation='softmax', activity_regularizer=activity_l2(0.0001)))

In addition, I should say that my notes about whether a) or b) above is what you probably need are based on your example, where you want one output (making option b probably the correct way). Attention is often used in spaces like caption generation where there is more than 1 output such as setting return_sequences=True. For those cases, I think that option a) is the described usage, such that the recurrency keeps all the information passing forward, and it's just the higher layers that utilize the attention.

@patyork Thanks for the examples and for the paper. I new that posting here would get more attention :P

I will try them and post back.

@patyork, I'm sorry, but I don't see how this implements attention at all?

From my understanding, the softmax in the Bengio et al. paper is not applied over the LSTM output, but over the output of an attention model, which is calculated from the LSTM's hidden state at a given timestep. The output of the softmax is then used to modify the LSTM's internal state. Essentially, attention is something that happens within an LSTM since it is both based on and modifies its internal states.

I actually made my own attempt to create an attentional LSTM in Keras, based on the very same paper you cited, which I've shared here:

https://gist.github.com/mbollmann/ccc735366221e4dba9f89d2aab86da1e

There are several different ways to incorporate attention into an LSTM, and I won't claim 100% correctness of my implementation (though I'd appreciate any hints if something seems terribly wrong!), but I'd be surprised if it was as simple as adding a softmax activation.

@mbollmann You are correct that none of the solutions @patyork is what i want. i want to get a weight distribution (importance) for the outputs from each timestep of the RNN. Like in the paper: "Hierarchical Attention Networks for Document Classification" but in my case i want just the representation of a sentence. I am trying to implement this using the available keras layers.

Similar idea in this paper.

@baziotis That indeed looks conceptually much simpler. I could just take a very short glance right now, but is there a specific point where you got stuck?

@mbollmann Please do if you can.
I am trying to implement it right now and trying to understand the Keras API.

I don't have a working solution but i think i should set return_sequences=True in the RNN in order to get the intermediate outputs and masking=False.
On top of that i am thinking i should put a TimeDistributed(Dense(1)) with a softmax activation. But i haven't figured out how to put everything together.

Also i think that putting masking=False won't affect the performance as the attention layer will assign the correct weights on the padded words. Am i right?

Edit: to clarify i want to implement an attention mechanism like the one in [1].
attention mechanism

  1. Zhou, Peng, et al. "Attention-based bidirectional long short-term memory networks for relation classification." The 54th Annual Meeting of the Association for Computational Linguistics. 2016.

I tried this:

_input = Input(shape=[max_length], dtype='int32')

    # get the embedding layer
    embedded = embeddings_layer(embeddings=embeddings_matrix,
                                trainable=False, masking=False, scale=False, normalize=False)(_input)

    activations = LSTM(64, return_sequences=True)(embedded)

    # attention
    attention = TimeDistributed(Dense(1, activation='tanh'))(activations) 
    attention = Flatten()(attention)
    attention = Activation('softmax')(attention)

    activations = Merge([activations, attention], mode='mul')

    probabilities = Dense(3, activation='softmax')(activations)

    model = Model(input=_input, output=probabilities)
    model.compile(optimizer='adam', loss='categorical_crossentropy', metrics=[])

and i get the following error:

  File "...\keras\engine\topology.py", line 1170, in __init__
    node_indices, tensor_indices)
  File "...\keras\engine\topology.py", line 1193, in _arguments_validation
    layer_output_shape = layer.get_output_shape_at(node_indices[i])
AttributeError: 'TensorVariable' object has no attribute 'get_output_shape_at'

@baziotis The cause of the error probably is that you need to use the merge function (lowercase), not the Merge layer (uppercase).

Apart from that, as far as I understood it:

The part with the tanh activation (Equation 5 in Yang et al., Equation 9 in Zhou et al.) comes before the multiplication with a trained context vector/parameter vector which reduces the dimensionality to "one scalar per timestep". For Yang et al., that seems to be a Dense layer which doesn't yet reduce the dimensionality (though this is a little unclear to me), so I'd expect TimeDistributed(Dense(64, activation='tanh')). For Zhou et al., they just write "tanh", so you'd probably not even need a Dense layer, just the tanh activation after the LSTM.

For the multiplication with a trained context vector/parameter vector, I believe (no longer -- see EDIT) this might be a simple Dense(1) in Keras, without the TimeDistributed wrapper, since we want to have individual weights for each timestep, but I'm not totally sure about this and haven't tested it. I'd imagine something like this, but take this with a grain of salt:

    # attention after Zhou et al.
    attention = Activation('tanh')(activations)    # Eq. 9
    attention = Dense(1)(attention)                # Eq. 10
    attention = Flatten()(attention)               # Eq. 10
    attention = Activation('softmax')(attention)   # Eq. 10
    activations = merge([activations, attention], mode='mul')  # Eq. 11

(EDIT: Nope, doesn't seem that way, they train a parameter vector with dimensionality of the embedding, not a matrix with a timestep dimension.)

My apologies; this would explain why I was not impressed with the results from my "attention" implementation.

There is an implementation here that seems to be working for people.

@mbollmann you were right about the merge, it is different from Merge #2467.

I think this is really close:

units = 64
max_length = 50

_input = Input(shape=[max_length], dtype='int32')

# get the embedding layer
embedded = embeddings_layer(embeddings=embeddings_matrix,
                            trainable=False, masking=False, scale=False, normalize=False)(_input)

activations = LSTM(units, return_sequences=True)(embedded)

# compute importance for each step
attention = TimeDistributed(Dense(1, activation='tanh'))(activations) 
attention = Flatten()(attention)
attention = Activation('softmax')(attention)
attention = RepeatVector(units)(attention)
attention = Permute([2, 1])(attention)

# apply the attention
sent_representation = merge([activations, attention], mode='mul')
sent_representation = Lambda(lambda xin: K.sum(xin, axis=0))(sent_representation)
sent_representation = Flatten()(sent_representation)

probabilities = Dense(3, activation='softmax')(sent_representation)

model = Model(input=_input, output=probabilities)
model.compile(optimizer='adam', loss='categorical_crossentropy', metrics=[])

but i get an error because Lamda doesn't output the right dimensions. I should be getting [1,units] right?
What am i doing wrong?


Update: i tried explicitly passing the output_shape for Lambda and the model compiles:

sent_representation = merge([activations, attention], mode='mul')
sent_representation = Lambda(lambda xin: K.sum(xin, axis=0), output_shape=(units, ))(sent_representation)
# sent_representation = Flatten()(sent_representation)

but now i get the following error:

ValueError: Input dimension mis-match. (input[0].shape[0] = 128, input[1].shape[0] = 50)
Apply node that caused the error: Elemwise{Composite{(i0 * log(i1))}}(dense_2_target, Elemwise{Clip}[(0, 0)].0)
Toposort index: 155
Inputs types: [TensorType(float32, matrix), TensorType(float32, matrix)]
Inputs shapes: [(128, 3), (50, 3)]
Inputs strides: [(12, 4), (12, 4)]
Inputs values: ['not shown', 'not shown']
Outputs clients: [[Sum{axis=[1], acc_dtype=float64}(Elemwise{Composite{(i0 * log(i1))}}.0)]]

Well i found out why it wasn't working. I was expecting the input to Lamda to be (max_length, units) but it was (None, max_length, units), so i just had to change the axis to 1. This now works.

units = 64
max_length = 50
vocab_size = embeddings.shape[0]
embedding_size = embeddings.shape[1]


_input = Input(shape=[max_length], dtype='int32')

# get the embedding layer
embedded = Embedding(
        input_dim=vocab_size,
        output_dim=embedding_size,
        input_length=max_length,
        trainable=trainable,
        mask_zero=masking,
        weights=[embeddings]
    )(_input)

activations = LSTM(units, return_sequences=True)(embedded)

# compute importance for each step
attention = TimeDistributed(Dense(1, activation='tanh'))(activations) 
attention = Flatten()(attention)
attention = Activation('softmax')(attention)
attention = RepeatVector(units)(attention)
attention = Permute([2, 1])(attention)

# apply the attention
sent_representation = merge([activations, attention], mode='mul')
sent_representation = Lambda(lambda xin: K.sum(xin, axis=1))(sent_representation)

probabilities = Dense(3, activation='softmax')(sent_representation)

model = Model(input=_input, output=probabilities)
model.compile(optimizer='adam', loss='categorical_crossentropy', metrics=[])

I would like if someone could verify that this implementation is correct.

@baziotis Looks good to me. I re-read the description in Zhou et al. and the code looks like it does what they describe. I no longer understand how what they're doing does anything useful, since the attention model only depends on the input and applies the same weights at every timestep, but ... that's probably just my insufficient understanding (I'm used to slightly different types of attention). :)

@mbollmann i am confused about the same thing. can you give an example of the type of attention that you have in mind? I think that i have to put the word (embedding) in the calculation of the attention.

From what i understand the Dense layer:

  1. assigns a _different_ weight (importance) to each timestep
  2. BUT the importance is static. Essentially this means that each word position in a sentence has different importance, but the importance comes from the position of the word and not the word itself.

I plotted the weights of the TimeDistributed(Dense(1, activation='tanh'))(activations) in a heatmap:

att
My interpretation is that the positions with big weights play more important role, so the output of the RNN for those steps will have i bigger impact in the final representation of the sentence.

The problem is that this is _static_. If an _important_ word happens to occur in a position with a small weight then the representation of the sentence won't be good enough.

I would like some feedback on this, and preferably a good paper with a better attention mechanism.

@baziotis Are you sure you don't have it the wrong way around?

The Dense layer takes the output of the LSTM at one timestep and transforms it. The TimeDistributed wrapper applies the same Dense layer with the same weights to each timestep -- which means the output of the calculation cannot depend on the position/timestep since the Dense layer doesn't even know about it.

So my confusion seems to be of a different nature than yours. :)

(In short: I don't see what calculating a softmax and multiplying the original vector by that gets you that a plain TimeDistributed(Dense(...)) couldn't already learn. However, I work on attentional models where the output is also a time-series, which means that I have multiple output timesteps for which the model should learn to attend to different input timesteps. I think that's not directly comparable to your situation, since you only have one output.)

@mbollmann I'm also a bit confused (but I have been from the get go). I think this blog post is fairly informative, or at least has some decent pictures.

So, @baziotis is using time series with multiple output steps (LSTM, with return_sequences=True). The first dense layer is applying weights over each individual time step output from the LSTM, which I'm not sure is accomplishing the intended behavior of looking at all the past activations and assigning weights to those, as in this picture:
image

I'm thinking the code above is just the line at,T feeding into the attention layer at each timestep. The fallout of this is that the attention is just determining which activations are important, not which timesteps are important.

@mbollmann i thought that the TimeDistributed applies different weights to each timestep...
In that case everything is wrong.
How can i make it so i can apply different weights to each timestep?
Can this be done with the available keras layers? Any hint?

TimeDistributed applies the same weight set across every timestep.

You'd need to setup a standard Dense layer as a matrix e.g. Dense(20) where 20 is the lookback length. You'd then feed examples of 20 timesteps to train. This is where I'm quite confused about implementing attention, as in theory it looks like this lookback is infinite, not fixed at a certain length.

Sorry for the miss-click.
So if i have inputs of constant length, lets say 50 then is this what i have to do?

activations = LSTM(units, return_sequences=True)(embedded)

# compute importance for each step
attention = Dense(50 , activation='tanh')(activations) 
attention = Flatten()(attention)

Actually, no, I think you would just remove the TimeDistributed wrapper and keep Dense(1) - I need to implement it real quick and check some shapes though.

So I guess that is what you are looking for.

  • 50 timesteps
  • Feeds into a regular Dense(1), which provides separate weights for the 50 timesteps
  • Calculates attention and multiplies against the 50 timesteps to apply attention
  • Sums (this reduces the 50 timesteps to 1 output; this is where this attention implementation differs from what most of what I've read describes)
  • Dense layer that produces output of shape (None, 3)
_input = Input(shape=[max_length], dtype='int32')

# get the embedding layer
embedded = Embedding(
        input_dim=vocab_size,
        output_dim=embedding_size,
        input_length=max_length,
        trainable=False,
        mask_zero=False
    )(_input)

activations = LSTM(units, return_sequences=True)(embedded)

# compute importance for each step
attention = Dense(1, activation='tanh')(activations)
attention = Flatten()(attention)
attention = Activation('softmax')(attention)
attention = RepeatVector(units)(attention)
attention = Permute([2, 1])(attention)


sent_representation = merge([activations, attention], mode='mul')
sent_representation = Lambda(lambda xin: K.sum(xin, axis=-2), output_shape=(units,))(sent_representation)

probabilities = Dense(3, activation='softmax')(sent_representation)

I think this (ugly) chart maps the above out pretty well; it's up to you to determine if it makes sense for what you are doing:
image

@patyork Thanks! I think this is what is described in the paper.
What they are trying to do from what i understand is: instead of using just the last output of the RNN, they use the weighted sum of all the intermediate outputs.

I have a question about this line:

sent_representation = Lambda(lambda xin: K.sum(xin, axis=-2), output_shape=(units,))(sent_representation)

Why axis=-2. How does this sum the tensors? I am using axis=1.

continuing from my last comment, this is what is described in the blog post that you mentioned. See after the image that you posted...

The y‘s are our translated words produced by the decoder, and the x‘s are our source sentence words. The above illustration uses a bidirectional recurrent network, but that’s not important and you can just ignore the inverse direction. _The important part is that each decoder output word y_t now depends on a weighted combination of all the input states, not just the last state._ The a‘s are weights that define in how much of each input state should be considered for each output. So, if a_{3,2} is a large number, this would mean that the decoder pays a lot of attention to the second state in the source sentence while producing the third word of the target sentence. The a's are typically normalized to sum to 1 (so they are a distribution over the input states).

What different kind of attention do you have in mind? In the article attention is described in the context of machine translation. In my case (classification) i just want a better representation for the sentence.

Yeah, after thinking about this, it makes sense. The softmax multiplication will weight the timestep outputs (most will be near zero, some nearer to 1) and the so the sum of those will be close to the outputs of the "near to 1" timesteps - pretty clever.

In this case, axis=-2 is equivalent to axis=1; I use the reverse indexing all the time, so that I never have to remember that Keras includes the batch_size (the None aspect) in those shapes. You ran into this gotcha earlier; using the reverse indexing means I never have to think about that aspect - and you'll see that form of indexing throughout the actual Keras code for this reason.

I just mean that implementation seems a little limiting - you have to set T=50 or another limit; it can't be an infinite or undefined variable, which means you have to throw away the first T-1 (49) outputs/training outputs. As that image leads me to believe, the T should be infinite/undefined/variable, Something like the TimeDistributed wrapper could provide. Perhaps this is a good thing, perhaps not - I haven't tried both ways (obviously).

Phew, a lot happened here, and I think I agree with most of what was written. Using Dense(1) without the TimeDistributed wrapper was what I was already trying to argue for yesterday, some dozens posts above, so that does seem correct to me as well in this scenario.

@mbollmann I read that - it seems like you talked yourself out of that at some point though, based on the edit. I was confusing/arguing with myself to no end throughout this entire issue as well.

I learned quite a bit though, at least.

@patyork @mbollmann Thank you both! I learned a lot.

Btw after runnng some tests, i am not impressed. I see no obvious improvement compared to the classic senario (using just the last timestep). But the idea is interesting...

@patyork This may be stupid, but what do you mean by saying:

it can't be an infinite or undefined variable, which means you have to throw away the first T-1 (49) outputs/training outputs.

Why are they thrown? They are used in the weighted sum, aren't they? *
I agree that this is limiting as it won't work with masking (series of varying length).

*Do you mean the timesteps that are padded to keep a constant length?

Sorry - hard concept to explain in words, so:
image

Say you have a sequence of 8 items; you want to apply attention with a lookback of 3; by default, you have to feed the first 3 items to get the first output, so you wind up with length - T + 1 or 8-3+1=6 possible outputs (the blue arrows). Looking at the diagram, there would generally be 8 outputs/targets available, but 2 were dropped (the red arrows).

Thinking some more, I've just realized the solution: add T-1 padding inputs at the beginning, like so:
image
..so you've now got the first output (dependent on the masked/padding input and the first input) and the second output (dependent on one pad, the first, and the second inputs) as they should be.

The only rub here is how to implement it: I don;t think the Lambda nor Flatten layers support masking, so those padding inputs would need to be "neutral" data as they can;t be masked out easily.

Hope that makes sense.

Edit: to make clear: the colored boxes in the diagrams are the window of T items that produce an output. There are 6 boxes (8-3+1) in the first diagram which seems like you have to drop 2 (T-1) of the outputs. The second diagram has all 8 boxes/outputs with padding leading in.

If i understand correctly what you are saying, this applies only in the case of stateful RNNs.
I see the problem and i haven't thought about it. The idea with the padded data is nice, the only problem is how you create/find/define this "neutral" data depending on the problem.

In my case i pass each observation (sentence) in one go so i don't have to deal with this problem.

I wouldn't say it applies on to stateful, it would apply to any sequence-to-sequence problem (which, as you say, isn't what your problem is) where the output sequence length is the same (or longer) as the input sequence length; perhaps frame classification problems or word-by-word translation problems (such as the example in that blog I linked to earlier).

And you're right, "neutral" data is very problem specific; and I said "neutral" just for the fact that masking is not implemented for those two layers as far as I know, otherwise masking would definitely be the best.

You are right.
Thank you.

I am closing this.

I think it should be better to keep this open for a while.
Maybe someone else will have something interesting to add to the discussion...

@baziotis @mbollmann Thanks a lot for your clarification and complete discussion. I am also trying to implement attention. It is mostly same with @mbollmann but with different H matrix. Hope it will work. I will ask questions if I get stuck.

Coming back to it... is there a way to make it work with masking?
The main problem is that the attention = Dense(1, activation='tanh')(activations) layer computes weights for every timestep even for the zero-padded and this results in _a lot_ of wasted time. Most of the times the length of the input series is around 10 but it can go as long as 70. So i have increased max_length in order to cover these long inputs.

Is there a way to make it work using the existing layers, or with a simple workaround or this can only be done by implementing a custom Layer for attention?

@baiziotis It should work with masking like every other layer, i.e. calculate everything, then apply the mask by multiplication, setting the masked timesteps to zero. I'm not sure what the alternative would be -- computations are usually performed in batches via matrix multiplication, which kind of requires padding to make all samples have the same shape. Or am I misunderstanding something?

There are (were, as of a month ago) several layers that don't support Masking. I think Repeat Vector and Flatten were among those that don't support it.

Even if Masking was supported, it would still calculate the N=70 timesteps each batch (even all of the 0's).

I'm not sure exactly what your code is looking like now, but Recurrent networks allow you to pass any length sequence at any time; in keras, I think you have to set the input shape to (None, ...) instead of (70, ...) and you'll be able to feed it a batch with any length of timesteps. Then you can create a random batch, and then pad (or if it works, mask) the samples in the batch up to the length of the longest sequence in the batch, which is probably <70 long.

ex:

len(batch[0]) = 14
len(batch[1]) = 21
len(batch[2]) = 10

batch[0] -> pad to length 21
batch[1] -> no padding needed
batch[2] -> pad to length 21

Sorry for the late reply... To be more specific, my problem is text (sentence) classification. The length of the sentences in my dataset follow a normal distribution and i have set the max_length to the length of the longest sentence. But this results in a lot of wasted computations as most of the sentences have around 20-30 words.

@mbollmann i though that what masking was doing was restricting the RNN from processing the padded timesteps. My problem was not about the performance but the efficiency of the model. The masked timesteps won't matter in the weighted sum (attention) anyway.

@patyork You are correct that the problem with masking has to do with the Flatten, Permute, Repeat layers.
As you can see form all the code chunks that i have posted above, i nowhere explicitly define the input shape to the RNN. The problem begins with the embedding layer where i have to fill the input_length param.

_embedding = Embedding(
        input_dim=vocab_size,
        output_dim=embedding_size,
        input_length=max_length,
        trainable=trainable,
        mask_zero=masking,
        weights=[embeddings]
    )

But as soon as i set masking to True i get an error that Flatten does not support masking.
Also i don't understand what input shape i have to set to None. To the Embedding?

Edit: phrasing...

I admit, I haven't used the Embedding layer at all. From the docs though:

input_length: Length of input sequences, when it is constant. This argument is required if you are going to connect Flatten then Dense layers upstream (without it, the shape of the dense outputs cannot be computed).

Theoretically, since it's an LSTM that follows, perhaps input_length could be None? This layer would generally output a shape (None, 70, embedding_size), but (None, None, embedding_size) is a valid shape for an LSTM to take I think. (None, None, ....) is the shape that the TimeDistributed wrapper gives us, batch_size=None and SequenceLength=None.

@patyork check this out https://github.com/fchollet/keras/issues/1047#issuecomment-158793786
I can't have _actual_ variable length inputs. i have to use padding, in which case i have to use masking, which doesn't work with Flatter on Repeat.

Also from Embedding Layer doc:

mask_zero: Whether or not the input value 0 is a special "padding"
value that should be masked out.
This is useful for recurrent layers which may take
variable length input. If this is True then _all subsequent layers
in the model need to support masking or an exception will be raised_.
If mask_zero is set to True, as a consequence, index 0 cannot be
used in the vocabulary (input_dim should equal |vocabulary| + 2).

Update: Just to make sure can someone please clarify something. In this simple case where we use masking:

model = Sequential()
model.add(embeddings_layer(masking=True))
model.add(LSTM(128))
model.add(Dense(3, activation='softmax'))
model.compile(optimizer='adam', loss='categorical_crossentropy')

is the RNN going to process the masked timesteps or not?

Yeah, I guess when you're using Embedding the input length can't be variable, and masking isn't supported, so you're back to just setting max_len=70 and padding.

Sorry, I'm confused again. (EDIT: resolved below. ) @patyork, when you use Dense(1), doesn't that reduce the full input sequence to a single number? I wouldn't be surprised if this didn't improve anything then, since you're effectively just multiplying everything by a scalar. Shouldn't it be something like Dense(max_length) instead, since we want one number for each input timestep?

My understanding is that Dense(1) is applied to each timestep independently squeezing the timestep vector to a single number with the tahn activation and then the softmax is applied to max_lengthxDense(1) outputs.
This model:

activations = LSTM(64, return_sequences=True, consume_less='mem')(words)
activations_weights = Dense(1, activation='tanh')(activations)
activations_weights = Flatten()(activations_weights)
activations_weights = Activation('softmax')(activations_weights)
activations_weights = RepeatVector(64)(activations_weights)
activations_weights = Permute([2, 1])(activations_weights)
activations_weighted = merge([activations, activations_weights], mode='mul')
sent_representation = Lambda(lambda x: K.sum(x, axis=-2), output_shape=(64,))(activations_weighted)

probabilities = Dense(classes)(sent_representation)
probabilities = Activation('softmax')(probabilities)

gives this:

lstm_1 (LSTM)                    (None, 50, 64)        67840       embedding_1[0][0]                
____________________________________________________________________________________________________
dense_1 (Dense)                  (None, 50, 1)         65          lstm_1[0][0]                     
____________________________________________________________________________________________________
flatten_1 (Flatten)              (None, 50)            0           dense_1[0][0]                    
____________________________________________________________________________________________________
activation_1 (Activation)        (None, 50)            0           flatten_1[0][0]                  
____________________________________________________________________________________________________
repeatvector_1 (RepeatVector)    (None, 64, 50)        0           activation_1[0][0]               
____________________________________________________________________________________________________
permute_1 (Permute)              (None, 50, 64)        0           repeatvector_1[0][0]             
____________________________________________________________________________________________________
merge_1 (Merge)                  (None, 50, 64)        0           lstm_1[0][0]                     
                                                                   permute_1[0][0]                  
____________________________________________________________________________________________________
lambda_1 (Lambda)                (None, 64)            0           merge_1[0][0]                    
____________________________________________________________________________________________________
dense_2 (Dense)                  (None, 3)             195         lambda_1[0][0]                   
____________________________________________________________________________________________________
activation_2 (Activation)        (None, 3)             0           dense_2[0][0]                    

and changing to:

activations = LSTM(64, return_sequences=True, consume_less='mem')(words)
activations_weights = Dense(max_length, activation='tanh')(activations)
...

gives the following error:

ValueError: Only layers of same output shape can be merged using mul mode. Layer shapes: [(None, 50, 64), (None, 2500, 64)]

Sorry, my bad, the weekend made me forget what I'd only looked up last week in the Keras docs...

I'm still not sure what your concern is regarding the masked timesteps, by the way. My understanding is that masked timesteps should not affect the result of the computation, and everything else is an implementation detail... but not sure if that is what your questions refer to.

@mbollmann So this means that my understanding of how the dense connects to the timesteps is correct?

Regarding the masking... my question was not about the correctness of the computations. it was about whether it was possible to avoid calculating weights for each of the padded inputs (words). Since i use the Embedding layer it seems that this cannot be done.

My last concern (at least for now...) has to do with the TimeDistributed. You made it clear before that for my use case the correct use for the weight calculation is the Dense layer, but reading https://github.com/fchollet/keras/issues/1029 and https://groups.google.com/forum/#!topic/keras-users/suKYo6L1bSI and http://stackoverflow.com/questions/36812351/keras-attention-layer-over-lstm, where these guys use TimeDistributed(Dense(1)) instead of Dense(1) confused me again.

Edit: in https://github.com/fchollet/keras/issues/1029 they don't talk about attention, just about TimeDistributedDense and Dense in general. it's just that different things where written and made me have doubts.

@baziotis How the mask is handled (and therefore, whether masked timesteps are computed and then dropped, or not computed to begin with) depends on the implementation of each layer, and I'm not totally sure how RNNs handle it (the code is in the rnn function of theano_backend.py and tensorflow_backend.py, respectively). However, the computation of n padded sequences with equal length can be much faster than the computation of n sequences with differing lengths, and the result will be the same in any case, so I'd say it's nothing to worry about.

Regarding the Dense layer, after reading the docs once more and doing some experimentation of my own, I actually don't see how Dense and TimeDistributed(Dense) differ at all -- they appear to do the exact same thing:

from keras.layers import Input, Dense, TimeDistributed
from keras.models import Model
import numpy as np

input_shape = (10, 128)
inputs = Input(shape=input_shape)
layer = Dense(64)
x = np.array([np.random.random(input_shape)])

# without TimeDistributed
output = layer(inputs)
model = Model(input=inputs, output=output)
model.compile(optimizer='sgd', loss='mse')
y1 = model.predict(x)

# with TimeDistributed
output = TimeDistributed(layer)(inputs)
model = Model(input=inputs, output=output)
model.compile(optimizer='sgd', loss='mse')
y2 = model.predict(x)

# => "True"
print(np.array_equal(y1, y2))

This confuses me a lot. Did the functionality of Dense change at some point? Does it make a difference for training?

Dense began to handle the Time dimension about a month ago. TimeDistributedDense is now deprecated and TimeDistributed(Dense) unnecessary.

-----Original Message-----
From: "Marcel Bollmann" notifications@github.com
Sent: ‎1/‎16/‎2017 8:21 AM
To: "fchollet/keras" keras@noreply.github.com
Cc: "Pat York" pat.york@nevada.unr.edu; "Mention" mention@noreply.github.com
Subject: Re: [fchollet/keras] How to add Attention on top of a Recurrent Layer(Text Classification) (#4962)

@baziotis How the mask is handled (and therefore, whether masked timesteps are computed and then dropped, or not computed to begin with) depends on the implementation of each layer, and I'm not totally sure how RNNs handle it (the code is in the rnn function of theano_backend.py and tensorflow_backend.py, respectively). However, the computation of n padded sequences with equal length can be much faster than the computation of n sequences with differing lengths, and the result will be the same in any case, so I'd say it's nothing to worry about.
Regarding the Dense layer, after reading the docs once more and doing some experimentation of my own, I actually don't see how Dense and TimeDistributed(Dense) differ at all -- they appear to do the exact same thing:
from keras.layers import Input, Dense, TimeDistributed
from keras.models import Model
import numpy as np

input_shape = (10, 128)
inputs = Input(shape=input_shape)
layer = Dense(64)
x = np.array([np.random.random(input_shape)])

without TimeDistributed

output = layer(inputs)
model = Model(input=inputs, output=output)
model.compile(optimizer='sgd', loss='mse')
y1 = model.predict(x)

with TimeDistributed

output = TimeDistributed(layer)(inputs)
model = Model(input=inputs, output=output)
model.compile(optimizer='sgd', loss='mse')
y2 = model.predict(x)

=> "True"

print(np.array_equal(y1, y2))
This confuses me a lot. Did the functionality of Dense change at some point? Does it make a difference for training?

You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub, or mute the thread.

So by using Dense the _same_ or _different_ weights are applied to each timestep? See https://github.com/fchollet/keras/issues/1029#issuecomment-270289453.

Because if i am just scaling the timesteps there is no attension...

A quick check shows the same weight shapes are present in Dense() and TimeDistributed(Dense), meaning the same weights of shape (48, 1) are used at each time step. @mbollmann's checks above imply that as well.

If you want/think it is necessary, you'd need to apply a Flatten() before the Dense, I guess.

I admit, I still don't think I fully understand attention, haha. I should have tried it back when we were first discussing it.

If i downgrade to a previous version is there a layer which will provide me with the desired functionality (Dense() or TimeDistributed(Dense))?

Also:

class TimeDistributed(Wrapper):
    """This wrapper allows to apply a layer to every
    temporal slice of an input.

    The input should be at least 3D,
    and the dimension of index one will be considered to be
    the temporal dimension.

    Consider a batch of 32 samples, where each sample is a sequence of 10
    vectors of 16 dimensions. The batch input shape of the layer is then `(32, 10, 16)`
    (and the `input_shape`, not including the samples dimension, is `(10, 16)`).

    --> You can then use `TimeDistributed` to apply a `Dense` layer to each of the 10 timesteps, _independently_:
    ```python
        # as the first layer in a model
        model = Sequential()
        model.add(TimeDistributed(Dense(8), input_shape=(10, 16)))
        # now model.output_shape == (None, 10, 8)

        # subsequent layers: no need for input_shape
        model.add(TimeDistributed(Dense(32)))
        # now model.output_shape == (None, 10, 32)
...

@fchollet can you please clear the confusion? which one applies different weights to each timestep?

Thanks @patyork, I didn't know that! That basically renders everything I said above invalid...

@baziotis You're definitely applying the same weights to each timestep, either way. That's what my code above demonstrates.

In that case, flattening the input before the Dense layer (as @patyork suggested) would be closer to what I had in mind -- that would make every index at every timestep have an individual weight:

activations = LSTM(64, return_sequences=True)(words)
activations_weights = Flatten()(activations)
activations_weights = Dense(max_length, activation='tanh')(activations_weights)
activations_weights = Activation('softmax')(activations_weights)
activations_weights = RepeatVector(64)(activations_weights)
activations_weights = Permute([2, 1])(activations_weights)

(Untested though.)

If you want to downgrade, you'd have to clone keras from some time before this commit from Dec 19th and install.

((Personally, I think it's doing what you want. The Dense layer outputs 70 items (one for each timestep, using the same weights); softmax is applied, making some timesteps more important, a bit of repeating/permuting and it applies the weights to the values from the activations. This is exactly what the other people were doing with the TimeDist wrapper on it.))

((Personally, I think it's doing what you want. The Dense layer outputs 70 items (one for each timestep, using the same weights); softmax is applied, making some timesteps more important, a bit of repeating/permuting and it applies the weights to the values from the activations. This is exactly what the other people were doing with the TimeDist wrapper on it.))

But if the weighting of a timestep only depends on that timestep's input and nothing else, I still don't understand how this could learn anything that just another Dense activation (without all that softmax and multiplication in-between) couldn't. It might be my insufficient understanding of (this particular type of) attention, but I guess that's unrelated to Keras now and we won't solve that particular problem here...

1 - @patyork So if i downgrade to 1.1.2 and then to this:

activations = LSTM(64, return_sequences=True, consume_less='mem')(words)
activations_weights = TimeDistributed(Dense(1, activation='tanh'))(activations) # TimeDistributed
activations_weights = Flatten()(activations_weights)
activations_weights = Activation('softmax')(activations_weights)
activations_weights = RepeatVector(64)(activations_weights)
activations_weights = Permute([2, 1])(activations_weights)
activations_weighted = merge([activations, activations_weights], mode='mul')
sent_representation = Lambda(lambda x: K.sum(x, axis=-2), output_shape=(64,))(activations_weighted)

will i have the desired behavior?

2 - i don't understand why Flatten will do the same thing. It will just convert the 2D tensor with the timesteps to a big 1D tensor and then by applying the Dense i will have lost the difference between the timesteps.
Here is what i mean:

lstm_1 (LSTM)                    (None, 50, 64)        67840       embedding_1[0][0]                
____________________________________________________________________________________________________
flatten_1 (Flatten)              (None, 3200)          0           lstm_1[0][0]                     
____________________________________________________________________________________________________
dense_1 (Dense)                  (None, 50)            160050      flatten_1[0][0]                  
____________________________________________________________________________________________________
activation_1 (Activation)        (None, 50)            0           dense_1[0][0]                    
____________________________________________________________________________________________________
repeatvector_1 (RepeatVector)    (None, 64, 50)        0           activation_1[0][0]               
____________________________________________________________________________________________________
permute_1 (Permute)              (None, 50, 64)        0           repeatvector_1[0][0]             
____________________________________________________________________________________________________
merge_1 (Merge)                  (None, 50, 64)        0           lstm_1[0][0]                     
                                                                   permute_1[0][0]                  
____________________________________________________________________________________________________
lambda_1 (Lambda)                (None, 64)            0           merge_1[0][0]                    
____________________________________________________________________________________________________

@baziotis

  1. No, TimeDistributed(Dense) hasn't changed and will do the same thing in 1.1.2.

  2. Yes, that's exactly what I was trying to achieve: having a different weight for each index at each timestep, which is exactly what happens if you flatten them into one big 1D tensor. Afterwards, we reshape the result to a 2D tensor again with essentially one weight per timestep. This is what I presumed the cited literature was doing, but I may be wrong (and in that case, confused, as I explained in my previous comment).

It's just blind leading blind here. I haven't the foggiest about what the desired behavior is so I won't comment.

  • All I can say is Dense() == TimeDistributedDense() == TimeDistributed(Dense()) as of December 19th; read any code/discussions previous to that date very carefully.
  • You can downgrade Keras if you don't like the Dense behavior

First of all, i am sorry if i wasn't clear and i thank you for helping me so far.

What i want is just:
1 - get a scalar weight for each timestep
2 - do a weighted sum of the timesteps in order to get a "better" representation of the input (instead of just getting the last timestep of just a simple average of the timesteps)

Without saying anything else and confuse you, from what i just said, how would you do it?

@baziotis Totally depends on how that scalar weight should be calculated.

If the weight of one timestep should be calculated from only that same timestep's input -- your comment here https://github.com/fchollet/keras/issues/4962#issuecomment-272888992

If the weight of one timestep should be calculated from all timesteps's input -- my comment here:
https://github.com/fchollet/keras/issues/4962#issuecomment-272919449

That's how I see it, and further than that, I agree that it's just blind leading blind now (and we'd need someone with experience with that particular type of attention to weigh in).

@mbollmann just to be on the same page. When you said a hundred posts back that you have in mind a different kind of attention did you have in mind the calculation on your comment here: https://github.com/fchollet/keras/issues/4962#issuecomment-272919449 ?

At least this way there is no ambiguity about what is going on...

@baziotis No, I'm working on attention in an encoder/decoder architecture. I have an encoder RNN that provides the activations (just as in your example), and a decoder RNN which calculates attention weights based on the input activations and its own current hidden state. I guess it's not really applicable to your scenario.

@mbollmann i imagine you mean something like bahdanau et al. 2014. What i want is up to a point what you do during the encoding phase. See equations (5) and (6).

bahdanau et al 2014

The context vector is the weighted sum of the hidden states (timesteps). right? How do you compute the weights for each h? Whether it is h or [h->;h<-] for the BRNN doesn't make a difference.

I have mistaken the outputs for the hidden states... This one does a weighted sum of the hidden states, not the outputs.

@baziotis I compute the attention weights from a combination of the h_{1..T} and s_{t-1}, essentially following this Xu et al. paper, Sec. 3.1.2, implemented in this gist I also linked to a few hundred comments back: https://gist.github.com/mbollmann/ccc735366221e4dba9f89d2aab86da1e

@mbollmann thanks for the code!

1) So keras doesn't offer the ability to get the intermediate hidden states instead of the outputs? I have to subclass / extend Recurrent or LSTM/GRU in order to do so?

2) Also is this the distinction between soft and hard attention? Soft=using hidden states, Hard=using outputs?

@mbollmann I was looking at your other gists and i found this :)))
This maybe exactly what i need. By reading the example i see here that you return the hidden states.
This means i can use them instead of the outputs for calculating the weights. Is that correct??

@baziotis Erm, no, there seems to be some kind of fundamental misunderstanding here.

  1. The point is that Keras processes input layer-by-layer, i.e. first all timesteps of the first layer, then all timesteps of the second, and so on... the point of my AttentionLSTM is that I want to calculate attention weights for each timestep in the same layer, based on the hidden state of the very same AttentionLSTM after each timestep. There's no way you could do this without writing a custom layer -- I can't perform the calculations in a separate layer before calling an LSTM since they're supposed to depend on the hidden states of that LSTM, and modify the behaviour of that very same LSTM.

  2. Not at all. Hard attention uses a probability distribution. I'm actually not too familiar with it, but there's an explanation in the Xu et al. paper and also on various deep learning blogs.

  3. Re the HiddenStateLSTM gist: The main point here is really to have inputs that set the hidden states. In Keras, the outputs of an LSTM are the hidden states -- there's no difference!

@mbollmann Thanks for clearing things.
1) So this means hidden states == outputs == return_sequences=True?

Edit:
2) So in keras when we stack two RNNs the second RNN takes as input the hidden states of the previous RNN?

3) From bahdanau et al. 2014 in this image:
hidden
what keras returns with return_sequences=True is the his?

@baziotis Yes to all.

I can really recommend studying the Keras code for these things, too; it's a little daunting at first but very insightful. In this case, you can read it off the step functions of the SimpleRNN/GRU/LSTM layers, which return a tuple (output, states), and all of them basically return the same thing for both.

@mbollmann Will do. Thanks again! :)

So i moved all the attention stuff in a Custom Layer..

class Attention(Layer):
    def __init__(self, **kwargs):
        """
        Attention operation for temporal data.
        # Input shape
            3D tensor with shape: `(samples, steps, features)`.
        # Output shape
            2D tensor with shape: `(samples, features)`.
        :param kwargs:
        """
        self.supports_masking = True
        self.init = initializations.get('glorot_uniform')
        super(Attention, self).__init__(**kwargs)

    def build(self, input_shape):
        assert len(input_shape) == 3
        self.W = self.init((input_shape[-1],), name='{}_W'.format(self.name))
        self.b = K.ones((input_shape[1],), name='{}_b'.format(self.name))
        self.trainable_weights = [self.W, self.b]

        super(Attention, self).build(input_shape)

    def call(self, x, mask=None):
        eij = K.tanh(K.dot(x, self.W) + self.b)
        ai = K.exp(eij)
        weights = ai / K.sum(ai, axis=1).dimshuffle(0, 'x')
        weighted_input = x * weights.dimshuffle(0, 1, 'x')
        return weighted_input.sum(axis=1)

    def get_output_shape_for(self, input_shape):
        return input_shape[0], input_shape[-1]

Here is a super simple example:

_input = Input(shape=[max_length], dtype='int32')
words = embeddings_layer(max_length=max_length, embeddings=embeddings,
                         trainable=False, masking=False, scale=False, normalize=False)(_input)

activations = LSTM(64, return_sequences=True, consume_less='mem')(words)
sentence = Attention()(activations)

probabilities = Dense(classes)(sentence)
probabilities = Activation('softmax')(probabilities)

model = Model(input=_input, output=probabilities)
model.compile(optimizer=Adam(clipnorm=5.), loss='categorical_crossentropy')

It works but masking=False. I added self.supports_masking = True but when enabling masking in the Embedding Layer then the final Dense Layer given as error:

ValueError: Layer dense_1 does not support masking, but was passed an input_mask: Elemwise{neq,no_inplace}.0

There are almost no examples (this doesn't really show much) on creating Custom Layers and the few blog posts that i have found are too simplistic. Fortunately my layer doesn't have to do much.

In my case what do i have to do to support masking?

@baziotis But it's not your layer that's the problem, it's the Dense layer (according to the error message). Does it work with TimeDistributed(Dense)? If so, that would seem like an oversight when they adapted Dense for 3d inputs maybe...

@mbollmann Why use do i have to use TimeDistributed(Dense)? Look at the shape of the final Dense input. It is just an 1D Tensor.... The attention layer just compresses the timesteps to a single vector.

Layer (type)                     Output Shape          Param #     Connected to                     
====================================================================================================
input_1 (InputLayer)             (None, 50)            0                                            
____________________________________________________________________________________________________
embedding_1 (Embedding)          (None, 50, 300)       150000900   input_1[0][0]                    
____________________________________________________________________________________________________
lstm_1 (LSTM)                    (None, 50, 100)       160400      embedding_1[0][0]                
____________________________________________________________________________________________________
attention_1 (Attention)          (None, 100)           150         lstm_1[0][0]                     
____________________________________________________________________________________________________
dense_1 (Dense)                  (None, 3)             303         attention_1[0][0]                
____________________________________________________________________________________________________
activation_1 (Activation)        (None, 3)             0           dense_1[0][0]                    
====================================================================================================

Also i tried masking=True, just the LSTM with return_sequences=False and then the final Dense and it works. So this means that my Layer should be doing something with the mask but i don't know what...

@baziotis Sorry, my bad. I think your layer should probably discard the mask, since you can't logically have masked timesteps after squashing the input to 2d. Look into overriding compute_mask() of your layer to return None. I'm not 100% sure of all the consequences of this, though.

@mbollmann i was about to post that! Take a look in what i did.

class Attention(Layer):
    def __init__(self, **kwargs):
        self.supports_masking = True
        self.init = initializations.get('glorot_uniform')
        super(Attention, self).__init__(**kwargs)

    def build(self, input_shape):
        assert len(input_shape) == 3
        self.W = self.init((input_shape[-1],), name='{}_W'.format(self.name))
        self.b = K.ones((input_shape[1],), name='{}_b'.format(self.name))
        self.trainable_weights = [self.W, self.b]

        super(Attention, self).build(input_shape)

    def compute_mask(self, input, input_mask=None):
        return None

    def call(self, x, mask=None):
        eij = K.tanh(K.dot(x, self.W) + self.b)
        ai = K.exp(eij)
        weights = ai / K.sum(ai, axis=1).dimshuffle(0, 'x')
        weighted_input = x * weights.dimshuffle(0, 1, 'x')
        return weighted_input.sum(axis=1)

    def get_output_shape_for(self, input_shape):
        return input_shape[0], input_shape[-1]

I tried that and it works. But i need you to tell me your opinion. I read the compute_mask(x, mask) function of the Layer i override and it looks that what was happening, was that my layer was passing the mask to the Dense, which makes no sense in my case.

Is this correct?

Also should i make any change in my call function?

    def call(self, x, mask=None):
        a = K.tanh(K.dot(x, self.W) + self.b)
        ai = K.exp(K.dot(a, self.u))
        weights = ai / K.sum(ai, axis=1).dimshuffle(0, 'x')
        weighted_input = x * weights.dimshuffle(0, 1, 'x')
        return weighted_input.sum(axis=1)

Update: is there a way to debug the call function in runtime? How can i tell what is the value of x and verify what i am doing is correct? I tried to set device=cpu and set a breakpoint but this doen't work (this must have to do with the function being symbolic - i don't know exactly how this works...). My main concern is if i have to change anything in the call(). Other than that, things look good and i think i am getting better results now...

@patyork i would like to also hear your opinion.

@baziotis Yes to the compute_mask() thing, that's what I was getting at.

As to your call() function, I'm not sure what the point of the additional K.dot(a, self.u) operation is. For a straight equivalence to your previous code, it shouldn't be there I think. But as I'm sure you know by now, there are many ways to approach the same basic idea... :)

@mbollmann oops that is from another Layer :P (look at Yang et al.), i made i mistake... This is how it is:

    def call(self, x, mask=None):
        eij = K.tanh(K.dot(x, self.W) + self.b)
        ai = K.exp(eij)
        weights = ai / K.sum(ai, axis=1).dimshuffle(0, 'x')
        weighted_input = x * weights.dimshuffle(0, 1, 'x')
        return weighted_input.sum(axis=1)

My concern is masking. If the masking thing is correct i think we are good.

Update 1: Also regarding the zeroed x's how can i calculate the K.dot(x, self.W) only on the non zeroed x's?
Update 2: What do you think of K.dot(K.not_equal(x, 0), self.W)? Would that be correct? it returns a bool thensor so it wouldn't but i think it would be better if somehow i calculated the product only on the non padded timesteps.

@baziotis curious - did you finally get this attention mechanism working on your classification task? I've been reading through but I haven't seen any one mention that it worked qualitatively :) only that it worked (as in the code ran!).

/cc @patyork and @mbollmann - have you guys posted an attention layer anywhere as well?

@viksit i have not settled to a final version of the layer (i am having some probelms with masking) but i'll tell you what i have observed so far: I see a clear but _not big_ improvement.

The most important thing i have observed is related to the series length. For small lengths no attention and attention give about the same results. But as i increase the length, the RNN without attention starts degrading (not able to remember very long term dependencies) while the RNN+Attention gives the same results. So this is a big plus i think, and especially in my case where i have sentences that go up to 50 words. Basically this is the exact observation as in Bahdanau et al. - see figure 2.

My concern now is if this is correct:

    def call(self, x, mask=None):
        eij = K.tanh(K.dot(x, self.W))
        if self.bias:
            eij += self.b

        ai = K.exp(eij)

        # apply mask
        if mask is not None:
            ait = K.cast(mask, 'float32')
            ait *= mask

        weights = ai / K.sum(ai, axis=1).dimshuffle(0, 'x')
        weighted_input = x * weights.dimshuffle(0, 1, 'x')
        return weighted_input.sum(axis=1)

I tried applying the mask directly on x, which was what i thought i should do and i got a dimension mismatch error. I am confused about the dimensions. But this works and the networks trains fine.
I see no obvious difference after including the mask in the calculation but maybe i am doing something wrong.

I will post the 2 versions of my attention layer when i am finished.

@baziotis ait *= mask just multiplies the mask with itself, no? I.e. your code is not actually using it anywhere?

@mbollmann you are right, this is so embarrassing...

I've been staring at the screen for hours and i missed it. No wonder i was seeing no difference in the results.
This is what i was meaning to do:

    def call(self, x, mask=None):
        eij = K.tanh(K.dot(x, self.W))
        if self.bias:
            eij += self.b

        ai = K.exp(eij)

        # apply mask
        if mask is not None:
            mask = K.cast(mask, 'float32')
            ai *= mask

        weights = ai / K.sum(ai, axis=1).dimshuffle(0, 'x')
        weighted_input = x * weights.dimshuffle(0, 1, 'x')
        return weighted_input.sum(axis=1)

And what is happening is that after a given point i get NaN's in the loss.

Update:
As it was pointed out here the problem with the NaN's has to do with the way i did the softmax calculation (with the exp). I replaced it with the K.softmax.

    def call(self, x, mask=None):
        eij = K.tanh(K.dot(x, self.W))
        if self.bias:
            eij += self.b

        # apply mask
        if mask is not None:
            mask = K.cast(mask, 'float32')
            eij *= mask

        a = K.softmax(eij)
        a = K.expand_dims(a)
        weighted_input = x * a
        return K.sum(weighted_input, axis=1)

This is stable and the network trains. But this problem was caused only after i applied the mask. Is this the right way to apply the mask?

Just an update. I have ended up with these two Layers:
Attention: https://gist.github.com/cbaziotis/6428df359af27d58078ca5ed9792bd6d
AttentionWithContext: https://gist.github.com/cbaziotis/7ef97ccf71cbc14366835198c09809d2

Any comments are welcome...

Hey @baziotis :) Thank you so much for your work. This is just what I was looking for. I have some trouble getting your layers to work on my machine. Using Theano they compile, but in my model and system using Theano as a backend is too slow.

Trying to run it with Tensorflow results in the following crash:

  File "~/.virtualenvs/master/lib/python3.5/site-packages/keras/models.py", line 327, in add
    output_tensor = layer(self.outputs[0])
  File "~/.virtualenvs/master/lib/python3.5/site-packages/keras/engine/topology.py", line 569, in __call__
    self.add_inbound_node(inbound_layers, node_indices, tensor_indices)
  File "~/.virtualenvs/master/lib/python3.5/site-packages/keras/engine/topology.py", line 632, in add_inbound_node
    Node.create_node(self, inbound_layers, node_indices, tensor_indices)
  File "~/.virtualenvs/master/lib/python3.5/site-packages/keras/engine/topology.py", line 164, in create_node
    output_tensors = to_list(outbound_layer.call(input_tensors[0], mask=input_masks[0]))
  File "~/code/rorschach/prediction/layer/attention_layer.py", line 66, in call
    eij = K.tanh(K.dot(x, self.W))
  File "~/.virtualenvs/master/lib/python3.5/site-packages/keras/backend/tensorflow_backend.py", line 799, in dot
    y_permute_dim = [y_permute_dim.pop(-2)] + y_permute_dim
IndexError: pop index out of range

Currently using Keras 1.2.0 and I've tried both Tensorflow 0.11.0 and 0.12.1 without luck.

@OptimusCrime I am using Theano as a backend and have not experienced any slowdowns. Are you sure that the reason for the slowdowns is the attention layers?

BTW, in tensorflow if you are using AttentionWithContext Layer the dot doesn't work, as it is pointed here, so what you have to do is:

Replace this:

uit = K.dot(x, self.W)

if self.bias:
    uit += self.b

uit = K.tanh(uit)
ait = K.dot(uit, self.u) # replace this

a = K.exp(ait)

With this:

uit = K.dot(x, self.W)

if self.bias:
    uit += self.b

uit = K.tanh(uit)

mul_a = uit  * self.u # with this
ait = K.sum(mul_a, axis=2) # and this

a = K.exp(ait)

Also please look at the updated gists as i have updated them with a fix.

Hello. Thanks for the code @cbaziotis.
I was having the same problem and now it works with no errors. But there is still a problem with the output dimensions. I tried this:

inputs = [[[0,0,0],[0,0,0],[0,0,0],[0,0,0]],[[1,2,3],[4,5,6],[7,8,9],[10,11,12]],[[10,20,30],[40,50,60],[70,80,90],[100,110,120]]]

hidden_size = 6
sent_size = 4
doc_size = 3

model = Sequential()
model.add(LSTM(hidden_size,input_shape = (sent_size,doc_size),return_sequences = True))
model.add(AttentionWithContext())

print "First layer:"
intermediate_layer_model = Model(input=model.input,output=model.layers[0].output)
print intermediate_layer_model.predict(inputs)
print ""
print "Second layer:"
intermediate_layer_model = Model(input=model.input,output=model.layers[1].output)
print intermediate_layer_model.predict(inputs)

and it is giving me this result:

First layer:
[[[ 0.          0.          0.          0.          0.          0.        ]
  [ 0.          0.          0.          0.          0.          0.        ]
  [ 0.          0.          0.          0.          0.          0.        ]
  [ 0.          0.          0.          0.          0.          0.        ]]

 [[ 0.04093511 -0.00982957 -0.          0.25834009 -0.39604828 -0.169927  ]
  [ 0.         -0.         -0.          0.68305802 -0.73000526 -0.1271846 ]
  [ 0.         -0.         -0.          0.79648596 -0.83882242 -0.        ]
  [ 0.         -0.         -0.          0.79895407 -0.79928428 -0.        ]]

 [[ 0.          0.         -0.          0.23120573 -0.76159418 -0.32464135]
  [ 0.          0.         -0.          0.76159418 -0.76159418 -0.        ]
  [ 0.          0.         -0.          0.76159418 -0.76159418 -0.        ]
  [ 0.          0.         -0.          0.76159418 -0.76159418 -0.        ]]]

Second layer:
[[ 0.          0.          0.          0.          0.          0.        ]
 [ 0.00770082 -0.00184916  0.          0.66687739 -0.71645236 -0.06456213]
 [ 0.          0.          0.          0.68619043 -0.76159418 -0.04615331]]

Shouldn't the Attention output have dimensions (samples, features), that this case should be (3,4)?

No. The attention layer all that does is to compute a weighted sum of the outputs of the RNN.
In your case for example:

  1. the first layer outputs 3 (4,6) tensors.
  2. The weighted sum of a (4,6) tensor will be a (1,6) tensor (a 6 dimensional vector). We compress each column _not_ each row.
  3. Then at the second layer you have a (3,6) tensor, which is correct.

Hi guys,
I have the following model to correct input language sentence in a one-hot vector that is not in the standard English vocabulary. How can I introduces Attention mechanism to the model, so that the output will be the relevant information that will give the sentence a meaning

hiddenStateSize = 256
hiddenLayerSize = 256
model = Sequential()

The output of the LSTM layer are the hidden states of the LSTM for every time step.

model.add(GRU(hiddenStateSize, return_sequences = True, input_shape=(maxSequenceLength, len(char_2_id))))
model.add(Dense(1, activation='tanh')
model.add(Flatten())
model.add(Activation('softmax'))

#

I got stuck from this moment

#

model.add(TimeDistributed(Dense(hiddenLayerSize)))
model.add(TimeDistributed(Activation('relu')))
model.add(TimeDistributed(Dense(len(char_2_id))))
model.add(TimeDistributed(Activation('softmax')))

----SGD-------

sgd = SGD(lr=0.01, decay=1e-6, momentum=0.9, nesterov=True)

define model

%time model.compile(loss='categorical_crossentropy', optimizer = sgd , metrics=['accuracy'])

@cbaziotis I've been your AttentionWithContext code at https://gist.github.com/cbaziotis/7ef97ccf71cbc14366835198c09809d2

For some reason the output shape is wrong. See the model.summary() output below:

Layer (type)                 Output Shape              Param #
=================================================================
text_input (InputLayer)      (None, 100)               0
_________________________________________________________________
embedding_1 (Embedding)      (None, 100, 100)          2361000
_________________________________________________________________
masking_1 (Masking)          (None, 100, 100)          0
_________________________________________________________________
bidirectional_1 (Bidirection (None, 100, 256)          175872
_________________________________________________________________
attention_with_context_1 (At (None, 100, 256)          66048
_________________________________________________________________
output (Dense)               (None, 100, 34)           8738
=================================================================

Shouldn't attention_with_context_1 have an output shape of (None, 256) as listed in your documentation for your function? It should output a 2D tensor of shape (samples, features). The peculiar thing is when I retrieve the layer and get the output it shows the correct shape:

>>> att_layer.output
<tf.Tensor 'attention_with_context_1/Sum_2:0' shape=(?, 256) dtype=float32>
>>> # but this returns the wrong shape
>>> att_layer.output_shape
(None, 100, 256)

Any ideas?

@cbaziotis Found the issue. Turns out if you write a custom layer and it modifies the input shape, you need a compute_output_shape method. See here for a fork that now works.

>>> model.summary()
_________________________________________________________________
Layer (type)                 Output Shape              Param #
=================================================================
text_input (InputLayer)      (None, 100)               0
_________________________________________________________________
embedding_2 (Embedding)      (None, 100, 100)          2361000
_________________________________________________________________
masking_4 (Masking)          (None, 100, 100)          0
_________________________________________________________________
bidirectional_5 (Bidirection (None, 100, 256)          175872
_________________________________________________________________
attention_with_context_4 (At (None, 256)               66048
_________________________________________________________________
output (Dense)               (None, 34)                8738
=================================================================

OK I did not read the whole discussion, but Zhou, Peng, et al. says H is a matrix where every column has the dimensionality of a word vector. Why is that? I think it should be the units of LSTM layer, which can be chosen to be the same as word vector dimensionality, of course, but it does not have to be so?

Hey, have a look at this repo:

https://github.com/philipperemy/keras-attention-mechanism

It shows how to build an attention module of top of a recurrent layer.

Thanks

@philipperemy I tested your approach. Indeed you can learn an attention vector, but testing across a suite of contrived problems, I see the model is just as skillful as a plan Dense + LSTM combination. Attention is an optimization that should lift skill or decrease training time for the same skill. Perhaps you have an example where your approach is more skillful than a straight Dense + LSTM setup with the same resources?

@cbaziotis After testing, I believe your attention method is something new/different inspired by Bahdanau, et al. [1]. It does not appear skillful on contrived problems either. Perhaps you have a good demonstration of where it does do well?

@mbollmann is correct as far as I can tell. The attention approach of Bahdanau, et al. requires access to the decoder hidden state (decoder output) of the last time step in order to compute the current time step (s_i-1 in the paper). This is unavailable unless you write your own layer and access it.

[1] https://arxiv.org/pdf/1409.0473.pdf

@jbrownlee
Would it be possible to share some of these 'test' case contrived problems? It would be extremely helpful in terms of debugging and evaluating the efficacy of various attention implementations.

@cbaziotis , How will the above attention mechanism work for the imdb example in keras? The input size is (5000, 80) (#max_length=80) and output is (5000, ). This the model for training :
```
input_ = Input(shape=(80,), dtype='float32')
print (input_.get_shape()) #(?, 80)
input_embed = Embedding(max_features, 128 ,input_length=80)(input_)
print (input_embed.get_shape()) #(?, 80, 128)

activations = LSTM(64, return_sequences=True)(input_embed)
attention = TimeDistributed(Dense(1, activation='tanh'))(activations)
attention = Flatten()(attention)
attention = Activation('softmax')(attention)
attention = RepeatVector(64)(attention)
attention = Permute([2, 1])(attention)  
print (activations.get_shape())                   #(?, ?, 64)
print (attention.get_shape())                     #(?, ?, 64)

sent_representation = merge([activations, attention], mode='mul')
sent_representation = Lambda(lambda x_train: K.sum(x_train, axis=1), output_shape=(5000,))(sent_representation)
print (sent_representation.get_shape())           #(?, 64)
probabilities = Dense(1, activation='softmax')(sent_representation)      #Expected (5000,)
model = Model(inputs=input_, outputs=probabilities)
model.summary()

Error : ValueError: Dimensions must be equal, but are 64 and 5000 for 'dense_2/MatMul' (op: 'MatMul') with input shapes: [?,64], [5000,1].

Hi, @cbaziotis Thanks for your code.
As you did not conduct special treatment for the padded words, I am wondering if the attention mechanism will assign the correct weights (close to zero) on the padded words.

If you read carefully you will see that i have posted the updated versions of the layers. Here you go:

model.add(LSTM(64, return_sequences=True))
model.add(AttentionWithContext())
# next add a Dense layer (for classification/regression) or whatever...
model.add(LSTM(64, return_sequences=True))
model.add(Attention())
# next add a Dense layer (for classification/regression) or whatever...

And as i say, the layers take into account the mask.
Edit: also note that i have not tested them with Keras 2, but i imagine that you will need to make some minor syntactic changes.

does the attention+lstm improve the accuracy in text classification? In my dataset, I find that, there is no difference with mean pooling + lstm.

@cbaziotis I have a query regarding the attention:
activations=LSTM(neu,activation='relu',return_sequences=True,return_state=True)(inputs)
This statement applies attention on output of LSTM. Does this imply on h (hidden state) where h=o_t (tanh(c_t))

I read somewhere, that in
activations,hh,cc=LSTM(neu,activation='relu',return_sequences=True,return_state=True)(inputs)

that hh is hidden state and cc is the cell state. Are hh and cc the final hidden and cell states?

Also what is the difference between attention and attention with context

@Ravin0512 Any updates?

@Ravin0512 i recently made this tool https://github.com/cbaziotis/neat-vision

Just make sure to return the attention scores besides the final representation of the sentence from the attention layer.

@cbaziotis As per sharing the weights across time-steps, I think it is fine. Even Andrew Ng's Sequence Models course have shared weight implementation.

  1. can one make the attentionmodel shorter by using the dot function of keras.laysers ?

inputs=Input(shape=(input_len,))
embedded=Embedding(input_dim, embedding_dim)(inputs)
activation=LSTM(hidden_dim, return_sequences=True)(embedded)
attention=TimeDistributed(Dense(1,use_bias=False, activation='linear'))(activation)
attention=Flatten()(attention)
attention=Activation('softmax')(attention)
representation=dot([attention,activation],axes=1)

isnt it the same as the Long Version
attention=TimeDistributed(Dense(1,use_bias=False, activation='linear'))(activation)
attention=Flatten()(attention)
Attention=Activation('softmax')(attention)
attention=RepeatVector(self.hidden_dim)(attention)
attention=Permute([2,1])(attention)
activation=multiply([attention,activation])
representation=Lambda(lambda x: K.sum(x,axis=1))(activation)

the dot function contracts the Tensor at the axis=1 sum_t a_t*h_th= h_h

the dense layer for the activation shouldnt have a bias, since the weights accoording th zhou work only on the hidden components of the hidden states. further more in zhous model a linear activation is enough

as far as i understood the Attention-dense layer has to be time distributed. because the weights act on the hidden states components they have the same role mor or less as all matrices in the the recurrent layer which all share the weights over time.

the time dependence oft the activation factors rises from the the hidden state differences (components deiffer an therefore alpha(t)=softmax(w^T*h_t) differs,

@Ravin0512 I just found an ugly method.
First you need to define a simple network structure before your attention layer (here the attention layer is the fourth layer).
sent_before_att = K.function([sent_model.layers[0].input, K.learning_phase()], [sent_model.layers[2].output])
And you then take out the attention layer weight.
sent_att_w = sent_model.layers[3].get_weights()
And use the sent_before_att function to get the vector after the layer before the attention layer.
sent_each_att = sent_before_att([sentence, 0])
In addtion, you need to define a function to calculate the attention weights, here is the funtion named cal_att_weights, you can use numpy to realize the same thing you define the attention layer.
Finally the sent_each_att is the attention weight you want.
sent_each_att = cal_att_weights(sent_each_att, sent_att_w)

@cbaziotis the best attention visualization tools I have ever seen 👍

i want to Regression output with Attention LSTM

I tried this:
`def Attention_LSTM(self):

    _input = Input(shape=(self.seq_length, self.feature_length,))

    LSTM_layer = LSTM(self.n_hidden, return_sequences=True)(_input)

    # Attention layer
    attention = TimeDistributed(Dense(1, activation='tanh'))(LSTM_layer)
    attention = Flatten()(attention)
    attention = Activation('softmax')(attention)
    attention = RepeatVector(self.n_hidden)(attention)
    attention = Permute([2,1])(attention)

    #sent_representation = merge([LSTM_layer, attention], mode='mul')
    sent_representation = multiply([LSTM_layer, attention])
    sent_representation = Lambda(lambda xin: K.sum(xin, axis=-1))(sent_representation)

    probabilities = TimeDistributed(Dense(1, activation='sigmoid'))(sent_representation)

    model = Model(inputs=_input, outputs=probabilities)
    return model`

but gives the following error:

` assert len(input_shape) >= 3
AssertionError

`

my understanding may be inadequate...

Sorry, made an error, activaton and flatten had to be changed, (first flatten and than activation('softmax') fixed it.

I tested my Version, and it worked, so far as i could see

here is the graph of an example with 1 layer GRU und nextword prediction with attantion including shapes for clarification
sequence length=20,
hidden_dim=128,
embedding_dim=32,
vocabulary_size=397

(for real language processing typical stacked lstms insteand of grus and higher hidden_dims and embedding_dims are used. ist only a toy example)

next_errlog_layr1_slen20_hdim128_edim32_attn1_graph

Hi, @stevewyl -- what is inside that cal_att_weights call?
I'm following this post to detect the weights per word in an inputted test text. It implements the attentive layer from @cbaziotis and then tacks on that cal_att_weights method to inspect the weights per word.
The dimensions of the weight array I get back are correct, but the weights themselves are crazy small -- all of them hover around 0.0000009.
Does this calculation step look correct to you?

def cal_att_weights(output, att_w): eij = np.tanh(np.dot(output[0], att_w[0]) + att_w[1]) eij = np.dot(eij, att_w[2]) eij = eij.reshape((eij.shape[0], eij.shape[1])) ai = np.exp(eij) weights = ai / np.sum(ai) return weights

attention = Flatten()(attention)
for this line I am getting error:
Layer flatten_4 does not support masking, but was passed an input_mask: Tensor("time_distributed_6/Reshape_3:0", shape=(None, None), dtype=bool)

Hello all ,

I am trying to use attention on top of a BiLSTM in tensorflow 2.
Also, I am using pretrained word embeddings.

my model is the following:

units=250
EMBEDDING_DIM=310
MAX_LENGTH_PER_SENTENCE=65
encoder_input = keras.Input(shape=(MAX_LENGTH_PER_SENTENCE))
x =layers.Embedding(input_dim=len(embedding_matrix), output_dim=EMBEDDING_DIM, input_length=MAX_LENGTH_PER_SENTENCE,
                              weights=[embedding_matrix],
                              trainable=False)(encoder_input)

activations =layers.Bidirectional(tf.keras.layers.LSTM(units))(x)
activations = layers.Dropout(0.5)(activations)

attention=layers.Dense(1, activation='tanh')(activations)
attention=layers.Flatten()(attention)
attention=layers.Activation('softmax')(attention)
attention=layers.RepeatVector(units*2)(attention)
attention=layers.Permute((2, 1))(attention)

sent_representation = layers.Multiply()([activations, attention])
sent_representation = layers.Lambda(lambda xin: K.sum(xin, axis=-2), output_shape=(units*2,))(sent_representation)

sent_representation = layers.Dropout(0.5)(sent_representation)

probabilities = layers.Dense(4, activation='softmax')(sent_representation)


encoder = keras.Model(inputs=[encoder_input], outputs=[probabilities],name='encoder')
encoder.summary()

Could you please let me know if my implementation is correct ?
What makes me worry is that the result with attention model do not have an improvement.

Thanks in advance.

Hey everyone. I saw that everyone adds Dense( ) layer in their custom attention layer, which I think isn't needed.

image

This is an image from a tutorial here. Here, we are just multiplying 2 vectors and then doing several operations on these vectors only. So what is the need of Dense( ) layer. Is the tutorial on 'how does attention work' wrong?

Was this page helpful?
0 / 5 - 0 ratings