Hi,
I'm still working on sequence to sequence learning using an encoder and decoder architecture, and I'd like to implement a different approach than what I've mostly seen with keras.
The common appraoch seems to be to encode the inputs using one RNN with return_sequences=False,
to take that vector and repeat it output_maxlen
times, and stack a decoder on top of that.
This is a fine approach, but from what I understand it's a bit different than some of the literature (like Sutskevers Seq2Seq model). What I'd actually like to do is this:
This would require essentially two compiled models: for training, we input all the ground truth outputs (shifted by one index) to train the decoder, during testing we need the decoder to be stateful, feed it the input encoding to the hidden state and a <START>
tag as the first input, and then feed it each of it's previous predictions one at a time.
All of this should be possible to implement with Keras from what I see, except for the part where we condition the decoder by feeding it an initial hidden state. Is there a way to do this in Keras? I could feed it the hidden state as the first input instead, but this is not ideal since the input vectors are tokens (or even characters) after that, so they sort of occupy a different space, conceptually.
Another important point: I need to specify the state symbolically, so that the entire model stays trainable.
My best guess currently is to set the decoder to be stateful even in training, set the rnn.states
variable as the symbolic output of the previous layer, and reset the state after each sample. Should this work?
I'm not sure about your first post. Are you saying that the first input would just be the conditioning vector?
That could work but it's not the same as the first hidden state being the vector, because it would be addressed with a different weight matrix.
a hacky solution I've used for similar situations:
One super easy way to accomplish this is to subclass whichever RNN (LSTM, GRU, whatever) and have the call(...)
take as input a tupled x. So, when you call it, it becomes rnn([x_input, x_initial]) where
x_inputis the actual data and
x_initial` is going to be used to override the initial weights.
now, in the rnn's call function:
first, unpack if you need to
if isinstance(x, (tuple, list)):
x, custom_initial = x
else:
custom_initial = None
then set the custom initial
if self.stateful and custom_initial:
raise Exception('some message here about how this is bad')
elif custom_initial:
# the bit we care about
initial_states = custom_initial
elif self.stateful:
initial_states = self.states
else:
initial_states = self.get_initial_states(x)
the thing you need to take care with this approach: you have to adjust all other instances where the input is fed into the layer to adjust for the tupled input. I think compute_mask and get_output_shape_for will be the two functions you need to adjust.
the major reason behind doing it this way is that it adds the conditioning vector you want to be your initial state into Keras' computational graph. This saves a ton of headaches.
Cool, thanks - that looks like what I'm looking for. I notice you're raising an exception on purpose - any particular reason this wouldn't work with a stateful model? Rather, is there any way to do prediction with this _without_ a stateful model?
Also, you say this adds the conditioning vector into the computational graph. Since the conditioning vector should actually be a symbolic variable (output of the encoder), does that mean this approach let's me train the whole encoder-decoder model's weights? That would of course be important.
A stateful model would preclude having an initial hidden state because it would be the hidden state.
I didn't include the exception because I'm lazy, haha. Also, I'm presuming this would be personal use, so the exception should be thrown in there for your own sanity, not given guide lines to users.
edit: the nth pass of a stateful model where n > 0 would preclude it. The initial hidden state of a stateful model is the last hidden state from the last pass.
What about after a state reset? There must still be _some_ initial hidden state, why can't we set it?
my best advice is to check the code. all of those questions are in the class definition. and you'll get a far better understanding of what's going on and how to mold it to your situation.
the hidden state is currently not able to be set because no one has written to code to do it. Or, they have, but haven't PRed it. it gets initialized to None when it's not stateful. Normal use cases never call 'reset states'.
also, you're not talking about setting the initial hidden state. you're talking about linking it into the computational graph. That is a different situation. If you want to see what that means, I would start with keras.engine.topology
and the Layer.__call__
. Keras has its own computational graph to give it a lot of the functionality it has. this is why I recommended just spoofing the inputs as a fast and easy workaround.
Fair enough, thanks for your help. I think my test-time problem can be solved by simply explicitly setting the state variable in a stateful RNN in the beginning and after a reset (no need for symbolic links there, since we're only evaluating). I'm also looking into the seq2seq package to see if I can manage to port it to keras 1.0.4, that would also provide much more tools to play with.
So in seq2seq, there's this StateTransferLSTM layer which I've ported to Keras 1.0.4, and which claims to transfer its hidden state to other LSTM layers you have specified.
It does that by manipulating the target layer's updates
variable, which I tried to investigate in the Keras code, but it's still not completely clear to me what this actually does. In particular, this seems completely different from the suggestion by @braingineer, and particularly the comment about there being a difference between just _setting_ the state and making it part of the _computational graph_ makes me wonder if the seq2seq implementation is actually correctly doing what it advertises to do. (I do realize that broadcasting state from one layer to another and setting an arbitrary initial state is not exactly the same thing, though.)
(Also, I've checked out Layer.__call__
but I don't know what I'm seeing. My current idea of the Keras computational graph is that it's basically building an abstract representation of all the computations the model has to do, so the backend can thoroughly optimize them for efficient calculation, but I'm not sure if this is accurate. Also, it maybe strays a bit far from the original question...)
@mbollman Could you point me to your port of the StateTransferLSTM? I'm trying to port it myself, but there seems to be trouble with the order in which my layers are being built inside the seq2seq model.
As to whether seq2seq works correctly: I'd say if it trains to learn, it probably does, since a missing connection in the computational graph would make backprop fail to work correctly (though there might of course just be another connection that keep things together)
I believe I've done nothing but port the get_output
function to a call
function as described in the Keras migration docs. The code currently lives in a private repo, but here's a gist of the StateTransferLSTM:
https://gist.github.com/mbollmann/2d38bd38259a03ea83999de32dbe3466
I've successfully used this with a model similar to the SimpleSeq2seq
one, though I've not tried the more advanced models of the seq2seq repo with it.
@braingineer Can I ask you a question? you mentioned taking care of the input for functions like compute_mask()
and get_output_shape_for()
. I was thinking about modifying functions that have x
as input, like step(self, x, states)
and get_constants(self, x)
etc. I am very new at Theano and Keras and I am trying to understand this. Could you explain how to know which functions to modify? If I create a class Decoder(Recurrent)
, then do I need to write functions that are used as in the LSTM
class, or do I also need to override functions just in the base class Recurrent
?
Many thanks for your time!
Just a little update from me:
I managed to build a custom layer derived from LSTM that takes two inputs (actual input + initial hidden state) and returns two outputs (actual output + last hidden state). This way, I was able to build an encoder--decoder model where the only connection between encoder and decoder was the hidden state transfer, and successfully train it in a way that simply wasn't possible with the StateTransferLSTM from seq2seq (on my task, I get 3% accuracy with seq2seq's StateTransferLSTM using broadcast_state()
, and 93%+ with my own custom LSTM). I believe that adding the transferred hidden state to the computational graph is the crucial difference here.
@braingineer, your explanations in this thread were crucial for me, I probably wouldn't have managed to do this without them! Thanks a lot!
@mbollmann Any chance you could share that layer code? It could be very useful for many people, for me at the very least 馃憤
@phdowling Here you go:
https://gist.github.com/mbollmann/29fd21931820c64095617125824ea246
That implementation might be quite hacky and possibly incomplete/buggy, but it works fine for my use case. LSTM actually keeps two internal states, and right now I transfer them both -- I'm not sure if this is what you'd want, but it didn't seem to make much of a difference.
Also, note that I have my own (very hacky and inefficient) decoding functions built around these models, which take care of feeding the predicted output of one timestep as input to the next one. (Basically, for _n_ timesteps, I'm just doing _n_ separate predictions, each one a timestep longer than the previous one...) You'll probably have to do _something_ to that effect to efficiently use an encoder--decoder model.
@mbollmann Thank you for sharing your code!
I saw your post also here: #2968: keras does not currently support feeding output into input
Since you subclassed LSTM, the step
function does not change, that means you need to somehow integrate the output y_tm1
(suppose you have maybe another dense layer from h_tm1
to get that) into either the h_tm1
or c_tm1
. That is different from what Cho (appendix A) does, which is using a separate weight matrix for y_tm1
. Am I correct?
I did a minimal subclass myself differently. I changed the step function to compute that y_tm1
and process it with its own weight matrix. Below is the class:
class DecoderLSTM(Recurrent):
# Since initial hidden state is replicated from the input, there should be
# input_dim == hidden_dim
def __init__(self, output_dim,
init='glorot_uniform', inner_init='orthogonal',
forget_bias_init='one', activation='tanh',
out_activation='linear', inner_activation='hard_sigmoid',
**kwargs):
self.output_dim = output_dim
self.init = initializations.get(init)
self.inner_init = initializations.get(inner_init)
self.forget_bias_init = initializations.get(forget_bias_init)
self.activation = activations.get(activation)
self.out_activation = activations.get(out_activation)
self.inner_activation = activations.get(inner_activation)
super(DecoderLSTM, self).__init__(**kwargs)
def call(self, x, mask=None):
# input shape: (nb_samples, time (padded with zeros), input_dim)
# note that the .build() method of subclasses MUST define
# self.input_spec with a complete input shape.
input_shape = self.input_spec[0].shape
# state format: [h(t-1), c(t-1), y(t-1)]
h_0 = K.zeros_like(x[:, 0, :])
c_0 = K.zeros_like(x[:, 0, :])
y_0 = K.zeros_like(x) # (samples, timesteps, input_dim)
y_0 = K.sum(y_0, axis=(1, 2)) # (samples,)
y_0 = K.expand_dims(y_0) # (samples, 1)
y_0 = K.tile(y_0, [1, self.output_dim]) # (samples, output_dim)
initial_states = [h_0, c_0, y_0]
last_output, outputs, states = K.rnn(step_function=self.step,
inputs=x,
initial_states=initial_states,
go_backwards=self.go_backwards,
mask=mask,
constants=None,
unroll=self.unroll,
input_length=input_shape[1])
if self.return_sequences:
return outputs
else:
return last_output
def build(self, input_shape):
self.input_spec = [InputSpec(shape=input_shape)]
self.input_dim = input_shape[2]
self.W = self.init((self.input_dim, 4 * self.input_dim),
name='{}_W'.format(self.name))
self.U = self.inner_init((self.input_dim, 4 * self.input_dim),
name='{}_U'.format(self.name))
self.A = self.init((self.output_dim, 4 * self.input_dim),
name='{}_A'.format(self.name))
self.b = K.variable(np.hstack((np.zeros(self.input_dim),
K.get_value(self.forget_bias_init((self.input_dim,))),
np.zeros(self.input_dim),
np.zeros(self.input_dim))),
name='{}_b'.format(self.name))
self.V_y = self.init((self.input_dim, self.output_dim),
name='{}_V_y'.format(self.name))
self.b_y = K.zeros((self.output_dim,), name='{}_b_y'.format(self.name))
self.trainable_weights = [self.W, self.U, self.A, self.b,
self.V_y, self.b_y]
if self.initial_weights is not None:
self.set_weights(self.initial_weights)
del self.initial_weights
def step(self, x, states):
h_tm1 = states[0]
c_tm1 = states[1]
y_tm1 = states[2]
z = K.dot(x, self.W) + K.dot(h_tm1, self.U) + K.dot(y_tm1, self.A) + self.b
z0 = z[:, :self.input_dim]
z1 = z[:, self.input_dim: 2 * self.input_dim]
z2 = z[:, 2 * self.input_dim: 3 * self.input_dim]
z3 = z[:, 3 * self.input_dim:]
i = self.inner_activation(z0)
f = self.inner_activation(z1)
c = f * c_tm1 + i * self.activation(z2)
o = self.inner_activation(z3)
h = o * self.activation(c)
y = self.out_activation(K.dot(h, self.V_y) + self.b_y)
return y, [h, c, y]
def get_config(self):
config = {'output_dim': self.output_dim,
'init': self.init.__name__,
'inner_init': self.inner_init.__name__,
'forget_bias_init': self.forget_bias_init.__name__,
'activation': self.activation.__name__,
'out_activation': self.out_activation.__name__,
'inner_activation': self.inner_activation.__name__}
base_config = super(DecoderLSTM, self).get_config()
return dict(list(base_config.items()) + list(config.items()))
But by doing this I can only do one layer, and have to write all layer I want to add myself. Your approach of feeding one input at a time enables stacking any structure of layers, which would be awesome.
How do you think your approach could be modified to have a dedicated weight matrix for y_tm1
? Would that require modifying the step
function?
I have not tried to _decode_ your Decoder yet (process one value at a time and feed output into next input). I am about to try that in a bit, I probably will have questions since I am not good at Keras.
Thanks again!
Since you subclassed LSTM, the step function does not change, that means you need to somehow integrate the output y_tm1 (suppose you have maybe another dense layer from h_tm1 to get that) into either the h_tm1 or c_tm1. That is different from what Cho (appendix A) does, which is using a separate weight matrix for y_tm1. Am I correct?
I'm not sure I understand what you're getting at. The way I understand appendix A in the Cho article (but I just took a quick glance) is that it's simply feeding the model's output at timestep t-1 back into the decoder as input for timestep t. Notice that the decoder equations are almost identical to the encoder equations, except that instead of x_t
they use y_t-1
. This would mean that the "separate weight matrix for y_tm1
" is just the normal input weight matrix of the decoder LSTM, no?
You are right, x(t) _is_ y(t-1). I apologize I did not communicate well. I meant to refer to the context vector c, which is fed also as input to every time step.
Does your approach accepts c also as input for every timestep(I don't want to misunderstand your code so I want to directly ask you)? If so, is it done by concatenating c with y(t-1)? If not, how could it be modified to support c?
Thank you for your time!!
Does your approach accepts c also as input for every timestep(I don't want to misunderstand your code so I want to directly ask you)? If so, is it done by concatenating c with y(t-1)? If not, how could it be modified to support c?
The code I posted doesn't do that -- it just does the hidden state transfer part, but a little less sophisticated than in the Cho article, by simply setting h'_0 = h_N
(if N
is the last timestep of the encoder).
So basically, the context vector c
seems to be simply h_N
fed through another tanh activation. My custom layer already takes h_N
as its second input, so I probably wouldn't modify the inputs, but simply do the tanh activation in the decoder layer. (I'm also not quite sure why you even need it...)
However, to modify the layer to also use c
/h_N
in each hidden state update (the last part of the three decoder equations in the Cho article), you indeed have to write a custom step function, I think.
@mbollmann
Just a little update from me:
I managed to build a custom layer derived from LSTM that takes two inputs (actual input + initial hidden state) and returns two outputs (actual output + last hidden state). This way, I was able to build an encoder--decoder model where the only connection between encoder and decoder was the hidden state transfer, and successfully train it in a way that simply wasn't possible with the StateTransferLSTM from seq2seq (on my task, I get 3% accuracy with seq2seq's StateTransferLSTM using broadcast_state(), and 93%+ with my own custom LSTM). I believe that adding the transferred hidden state to the computational graph is the crucial difference here.
What kind of data / problem were you training on when you got these differences in numbers?
@sallamander
What kind of data / problem were you training on when you got these differences in numbers?
Character-based string transduction. But the task isn't important, what's important is that I used a decoder that takes as input only a constant start symbol (at timestep 0) and the model prediction at the previous timestep (everywhere else). In a nutshell:
I suspect seq2seq's StateTransferLSTM has either one of both of these problems, hence the 3% accuracy I got.
@mbollmann
Right, I agree - I moreso was just curious and wanted to play around with the example that you were using.
Yeah, I have a hunch it might be (2), but I'm working on digging into the seq2seq
implementation in the hopes of understanding where it's going wrong (which is what it looks like) and hopefully fixing it up.
For (2), do you mean the decoder? If not, then what transferred hidden state are you getting for the encoder? Or, do you mean that the _encoder_ will never learn anything because the gradients won't properly be passed back in the backprop step, or no real gradients will be passed back, or something like that?
In any case, how would you end up not backpropagating along the transferred hidden state, _unless_ you weren't setting it correctly as the initial hidden state of the decoder? There was some discussion about it being a bad idea to use stateful=True
and set an initial hidden state for the decoder - is this at all tied to your comment about back propagating along the transferred hidden state (seq2seq
does this, and I think this may be part of the issue)? I could imagine that if you set the initial state in the decoder but also set stateful=True
, then in some sense you're not backpropagating along the transferred hidden state, because after the first updates the transferred hidden state won't be seen again. Is this what you were referring to?
@sallamander
For (2), do you mean the decoder? If not, then what transferred hidden state are you getting for the encoder? Or, do you mean that the encoder will never learn anything because the gradients won't properly be passed back in the backprop step, or no real gradients will be passed back, or something like that?
Yes, the latter. In my model, the output of the encoder is never used as _input_ for the decoder, but _only_ for setting the decoder's hidden state, so if this connection isn't added to the computational graph, there effectively _is no_ connection between the two during backprop (and no gradients get passed).
Or that's how I understand it - I must confess that not all the details of how Keras constructs the computational graph are clear to me. I've never done it "manually" in Theano/Tensorflow, and Keras is really good at hiding the gory details away from you. I just went with @braingineer's comment above (https://github.com/fchollet/keras/issues/2995#issuecomment-226597313) who suggests taking the initial hidden state as _a second input_, claiming:
the major reason behind doing it this way is that it adds the conditioning vector you want to be your initial state into Keras' computational graph. This saves a ton of headaches.
Maybe he can explain better than me how exactly this happens and what the actual difference is. (I'd certainly be interested!)
Interesting, thanks!
Or that's how I understand it
That's how I understand it, too 馃憤
I'm hopeful that I can issue a PR on Keras and get this worked in, but yeah, the details of building it into the computational graph are a little tricky.
@mbollmann Hey again! Couple more questions:
First:
I'm trying to use your custom HiddenStateLSTM
layer, but am running into shape assertion errors. I structure my data like this: The actual input is always a sequence (lets say of chars) of a certain length. This goes to the decoder. The decoder input is also a sequence of some (other) length, which at first consists only of a start symbol with padding, and in each subsequent sample one char is added (mock prediction / teacher forcing). The output, similarly, starts out with the first char to be predicted, and subsequent chars are added in each sample.
So, my input has shape (n_samples, in_seq_len,)
for encoder input, (n_samples, dec_seq_len,)
for decoder input, and output shape is (n_samples, dec_seq_len, num_chars)
. Here is the error I run into:
File "seq2seq_lm.py", line 134, in train
self.build_model()
File "seq2seq_lm.py", line 105, in build_model
dec_layer, _, _ = HiddenStateLSTM(64, dropout_W=0.5, return_sequences=True)([dec_layer] + hidden)
File "/Users/dowling/anaconda/lib/python2.7/site-packages/keras/engine/topology.py", line 489, in __call__
self.build(input_shapes)
File "layers/decoderlstm.py", line 18, in build
assert shape[0] == input_shape[0]
TypeError: 'NoneType' object has no attribute '__getitem__'
This happens in the build function:
def build(self, input_shape):
if isinstance(input_shape, list) and len(input_shape) > 1:
input_shape = input_shape[0]
hidden_shapes = input_shape[1:]
for shape in hidden_shapes:
assert shape[0] == input_shape[0] # fails
assert shape[-1] == self.output_dim
super(HiddenStateLSTM, self).build(input_shape)
I pretty much adopted your sample code, I'm not sure if my data is structured the same way as yours though.
Second question:
If we only "manually" feed predicted outputs to the next input, they are not in fact part of the computational graph, correct? I see how the network should still learn in general, since the correct prediction is being taught locally and state is still kept, but in theory the connection of previous output to next input should be symbolic, no? This question is a bit more general though, for now, thanks a lot for providing your code!
@mbollmann Follow up on number one: this bug actually occurs when I run the code snippet you posted as well, it seems to be related to the way the model is set up, not the structure of the data. Could you have another look and see if your gist is up to date!
@phdowling I just ran my code snippet again and it produces no such error. What Python/Keras versions are you using?
Re your second question:
I'm not sure if it should be symbolic actually, since for decoding, you need to make a choice which output class to select (and therefore use as input to the next timestep). This can be a simple argmax (which could be expressed symbolically I guess), but you could also do beam search (keeping several options, and only make the final decision several timesteps later) or filter the output classes via some external criteria (e.g. a lexicon of wellformed words/sequences). In that case, a symbolic connection during training couldn't really model the situation during decoding anyway.
@mbollmann I'm using Python 2.7.12 with Keras 1.1.0 (installed from git). The bug I referred to was caused by my own adjustments, sorry for bothering you with that. FYI, here is where I went wrong:
I had to slightly adjust your code to make it python 2 compatible. Line 16:
input_shape, *hidden_shapes = input_shape
became
input_shape = input_shape[0]
hidden_shapes = input_shape[1:] # wrong, because input_shape was just written to
which should of course be
input_shape, hidden_shapes = input_shape[0], input_shape[1:]
This fixes what I initially reported. However, I still see an exception after fixing this:
Traceback (most recent call last):
File "layers/decoderlstm.py", line 114, in <module>
dec_layer, _, _ = HiddenStateLSTM(64, dropout_W=0.5, return_sequences=True)([dec_layer] + hidden)
File "/Users/dowling/anaconda/lib/python2.7/site-packages/keras/engine/topology.py", line 514, in __call__
self.add_inbound_node(inbound_layers, node_indices, tensor_indices)
File "/Users/dowling/anaconda/lib/python2.7/site-packages/keras/engine/topology.py", line 572, in add_inbound_node
Node.create_node(self, inbound_layers, node_indices, tensor_indices)
File "/Users/dowling/anaconda/lib/python2.7/site-packages/keras/engine/topology.py", line 154, in create_node
output_tensors = to_list(outbound_layer.call(input_tensors, mask=input_masks))
File "/Users/dowling/Development/idp/char-idp/layers/decoderlstm.py", line 68, in call
input_length=input_shape[1])
File "/Users/dowling/anaconda/lib/python2.7/site-packages/keras/backend/theano_backend.py", line 832, in rnn
initial_output = step_function(inputs[0], initial_states + constants)[0] * 0
TypeError: unsupported operand type(s) for +: 'TensorVariable' and 'list'
Here is a gist of my changed code: https://gist.github.com/phdowling/ae731089a1497ab0dd8177114ad60881
The other change I made was here: https://gist.github.com/phdowling/ae731089a1497ab0dd8177114ad60881#file-hiddenstatelstm-py-L110-L111 , but I don't think this is an incorrect translation of the Python 3 code. Does this gist throw an error for you? If so, I guess the problem might be a Python 3 / 2 issue somewhere.
This was again my mistake, I made the same error as above in another spot and forgot about it. I updated my gist, the example is working now!
Thanks also for your insight on the second point!
@mbollmann When you train your model, do you feed it "partial" prediction states in the training phase? By that I mean, for a generated output of length 5, do you feed the model 5 samples that include the iterations of the output generator, or do you only feed the final state?
During evaluation the intermediate steps are necessary of course to feed the partial results, but I'm thinking during training, since we do teacher forcing, it should be enough to feed the complete expected output with the corresponding ground-truth decoder inputs, right?
@phdowling Ah, cool, I saw you made that adjustment but didn't notice the mistake either!
Yes, I only feed the final (ground-truth) state during training. I am aware of more complex approaches to training such a model (e.g. Sequence-to-sequence learning as beam-search optimization), but have not looked into them yet.
@mbollmann
Also, note that I have my own (very hacky and inefficient) decoding functions built around these models, which take care of feeding the predicted output of one timestep as input to the next one. (Basically, for n timesteps, I'm just doing n separate predictions, each one a timestep longer than the previous one...) You'll probably have to do something to that effect to efficiently use an encoder--decoder model.
Would it be possible for you to share the code that does the decoding part after training the model?
@mbollmann i have read the whole discussion...passing C as in put dosent help...but can u suggest a encoder-decoder framework in keras
@suragnair Sorry I totally forgot to respond. Do you still need this? I'd look into extracting the relevant parts of my code then.
@harikrishnavydana What do you mean by "doesn't help"? The way I've discussed it here works for me and that's what I still (successfully) use for encoder-decoder. I don't know of a more convenient framework for implementing this, unfortunately.
@mbollmann
in an encoder decoder frame work How to use the output of present state as input to the decoder in keras
@harikrishnavydana AFAIK it's not possible except by using a workaround like I've already explained above and in this post: https://github.com/fchollet/keras/issues/2968#issuecomment-226727979
@harikrishnavydana @mbollmann Don't know if it's worth noting, but I've got a PR open for this: https://github.com/fchollet/keras/pull/3947
its worth doing it because its a key feature missing in keras-RNN architectures...to be able to use op-of previous time step in predicting the present time step is also vital feature for many sequential models@mbollmann @sallamander
@mbollmann @sallamander @fchollet @RamaneekGill
How do we get the keras version with the broad casting option included in RNN
@harikrishnavydana You can clone the Keras repository from my fork.
@sallamander by broadcasting Hidden state and batch size of 1 can i manually create a depth first implementation of keras RNN...is this the workaround that was mentioned.
@harikrishnavydana the implementation I pointed you to doesn't require a batch size of 1, although if I understand correctly you want both hidden state broadcasting and something like input feeding (e.g. during training, you feed the predicted y_{t-1}
as input for y_t
, instead of the actual)... is that right?
@sallamander yes i am actually looking for that can this be done in this fork
@harikrishnavydana My fork offers the broadcasting of hidden state, and if you restrict yourself to a batch size of 1, then I believe you could do it, yes.
In the method reset_state for the recurrent layer in Keras allows to pass the initial hidden state for the model. Specifically for LSTM, it allows to set the initial hidden state h(t) and the cell state C(t).
"You can specify the initial state of RNN layers numerically by
calling reset_states
with the keyword argument states
. The value of
states
should be a numpy array or list of numpy arrays representing
the initial state of the RNN layer"
as written in the layers/recurrent.py in the Keras Source Code
for LSTM unit, there are two states, one is hidden state and the other is cell state:
lstm_output = LSTM(units=units, return_sequences=False)(x, initial_state=[your_initial_state, your_initial_state])
for GRU unit, there is only one state:
gru_output = GRU(units=units, return_sequences=False)(x, initial_state=your_initial_state)
your_initial_state should have the same shape as the outpout shape of LSTM unit.
For Keras 2.0.6, it works.
@THUfl12 @samre12 So, if I am trying to implement an Seq2Seq model and trying to pass encoder state to the decoder the code should look something like this ?
Encoder = LSTM(units=HIDDEN_STATE_SIZE, return_state=True)(x)
Decoder = LSTM(units=HIDDEN_STATE_SIZE, return_sequences=True)(Encoder[0], initial_state=Encoder[1:])
@rohit-gupta the documentation suggests that
"You can specify the initial state of RNN layers symbolically by calling them with the keyword argument initial_state
. The value of initial_state
should be a tensor or list of tensors representing the initial state of the RNN layer."
I believe that this translates to what you have suggested though I haven't tried it explicitly.
I am looking for Seq2Seq model of encoding and decoding. How to get the last state of from LSTM encoders and set as the initial state for LSTM decorder. @rohit-gupta, Does your suggested code, work?
@ramakrse Something like this should work:
from keras.models import Model
from keras.layers import Input, LSTM, Dense
inputs = Input(batch_shape=(8,16,1024))
encoder = LSTM(256, return_state=True, return_sequences=True)
outputs = encoder(inputs)
output, state = outputs[0], outputs[1:]
decoder = LSTM(256, return_sequences=True)(output, initial_state=state)
output_words = TimeDistributed(Dense(activation='softmax'))(decoder)
model = Model(inputs, output_words)
@rohit-gupta This would only work for target sequences with the same length as input sequences. Also this does not appear to implement canonical seq2seq.
Here's my own implementation, which as far as I can tell implements canonical seq2seq. I haven't tried checking that it works in practice, as this isn't my domain and this was just a quick aside for me. I will flesh it out as an example script later on when I have time.
from keras.models import Model
from keras.layers import Input, LSTM, Dense, TimeDistributed
import numpy as np
num_encoder_tokens = 100
num_decoder_tokens = 100
encoder_seq_length = None
decoder_seq_length = None
batch_size = 128
epochs = 3
# Dummy data
input_seqs = np.random.random((1000, 10, num_encoder_tokens))
target_seqs = np.random.random((1000, 10, num_decoder_tokens))
# Define training model
encoder_inputs = Input(shape=(encoder_seq_length,
num_encoder_tokens))
encoder = LSTM(256, return_state=True, return_sequences=True)
encoder_outputs = encoder(encoder_inputs)
_, encoder_states = encoder_outputs[0], encoder_outputs[1:]
decoder_inputs = Input(shape=(decoder_seq_length,
num_decoder_tokens))
decoder = LSTM(256, return_sequences=True)
decoder_outputs = decoder(decoder_inputs, initial_state=encoder_states)
decoder_outputs = TimeDistributed(
Dense(num_decoder_tokens, activation='softmax'))(decoder_outputs)
model = Model([encoder_inputs, decoder_inputs], decoder_outputs)
# Training
model.compile(optimizer='rmsprop', loss='mse')
model.fit([input_seqs, target_seqs], target_seqs,
batch_size=batch_size, epochs=epochs)
# Next: inference mode (sampling).
# Here's the drill:
# 1) encode input and retrieve initial decoder state
# 2) run one step of decoder with this initial state
# and a "start of sequence" token as target.
# Output will be the next target token
# 3) Append the target token and repeat
# Define sampling models
encoder_model = Model(encoder_inputs, encoder_states)
decoder_state_input_h = Input(shape=(256,))
decoder_state_input_c = Input(shape=(256,))
decoder_states = [decoder_state_input_h, decoder_state_input_c]
decoder_outputs = decoder(decoder_inputs,
initial_state=decoder_states)
decoder_model = Model(
[decoder_inputs] + decoder_states,
decoder_outputs)
# Dummy data
input_seqs = np.random.random((1000, 10, num_encoder_tokens))
target_seqs = np.random.random((1000, 1, num_decoder_tokens))
# Sampling loop for a batch of sequences
states_values = encoder_model.predict(input_seqs)
stop_condition = False
while not stop_condition:
output_tokens = decoder_model.predict([target_seqs] + states_values)
# sampled_token = ... # sample the next token
# target_seqs = ... # append token to targets
# stop_condition = ... # stop when "end of sequence" token is generated
break
Thanks very much for this snippet.
Note that it's implementing "teacher forcing", i.e. feeding the correct labels to the decoder during training, if I'm reading correctly. This is known to hurt generalization, and is not the canonical seq2seq, I think, where the decoder is conditioned on its own previous output, and can't see the correct labels.
Is there a way to modify the code to do that?
@antishok Well, current best practice is to use Scheduled Sampling (Ref: https://arxiv.org/abs/1506.03099) which essentially means you have an additional hyperparameter which decides what fraction of inputs will be trained using correct values (teacher forcing) and what fraction will use the inferred values. Actually, if you can't implement scheduled sampling for some reason simply using teacher forcing works fine.
@fchollet Thanks for developing the Seq-2-Seq script, I am currently working on a translation related project and I can clean it up and send as a PR (adding Seq2Seq example) over the weekend if that would be useful.
@fchollet thanks for the sample code. It is very helpful to understand how to use keras for seq2seq learning.
I noticed that you wrote a post about it as well. Can you explain why you use TimeDistributed
layer in the above code but not in the blogpost (https://blog.keras.io/a-ten-minute-introduction-to-sequence-to-sequence-learning-in-keras.html)?
@jerrypaytm Does the code without TimeDistributed work for you ? I couldn't get it to work without TimeDistributed.
(Which TBH makes sense, if the output is variable length, you will need TimeDistributed to make it work)
@rohit-gupta It works for me without TimeDistributed
. I haven't try with TimeDistributed
yet.
@fchollet and @rohit-gupta can you confirm that the seq2seq model can be loaded? I saved the model as suggested in this script but it fails to load the model.
It is a ValueError
when I try loaded_model = load_model('s2s.h5')
Layer lstm_15 expects 1 inputs, but it received 3 input tensors.
nevermind, I was using keras==2.0.6 and upgraded to the latest fix the loading model issue.
@mbollmann Thank you for sharing your code!
In my project I need to set the initial state at each batch, is possible to do it with your HiddenStateLSTM class?
@fchollet how can your code be changed for target sequences with different length from the input_sequence ? target being longer than the input_sequence ?
Most helpful comment
In the method reset_state for the recurrent layer in Keras allows to pass the initial hidden state for the model. Specifically for LSTM, it allows to set the initial hidden state h(t) and the cell state C(t).
"You can specify the initial state of RNN layers numerically by
calling
reset_states
with the keyword argumentstates
. The value ofstates
should be a numpy array or list of numpy arrays representingthe initial state of the RNN layer"
as written in the layers/recurrent.py in the Keras Source Code