Keras: Seq2seq learning: Can I specify the initial hidden state of an LSTM/GRU?

Created on 16 Jun 2016 · 59Comments · Source: keras-team/keras

Hi,

I'm still working on sequence to sequence learning using an encoder and decoder architecture, and I'd like to implement a different approach than what I've mostly seen with keras.

The common appraoch seems to be to encode the inputs using one RNN with return_sequences=False, to take that vector and repeat it output_maxlentimes, and stack a decoder on top of that.

This is a fine approach, but from what I understand it's a bit different than some of the literature (like Sutskevers Seq2Seq model). What I'd actually like to do is this:

the encoder stays the same, producing a single vector as the input encoding
the decoder is fed the input encoding as its initial _hidden_ state
all subsequent inputs of the decoder are fed with the previous token language-model style (using ground truth in training, and predicted output in testing

This would require essentially two compiled models: for training, we input all the ground truth outputs (shifted by one index) to train the decoder, during testing we need the decoder to be stateful, feed it the input encoding to the hidden state and a <START> tag as the first input, and then feed it each of it's previous predictions one at a time.

All of this should be possible to implement with Keras from what I see, except for the part where we condition the decoder by feeding it an initial hidden state. Is there a way to do this in Keras? I could feed it the hidden state as the first input instead, but this is not ideal since the input vectors are tokens (or even characters) after that, so they sort of occupy a different space, conceptually.

Source

phdowling

👍11

Most helpful comment

In the method reset_state for the recurrent layer in Keras allows to pass the initial hidden state for the model. Specifically for LSTM, it allows to set the initial hidden state h(t) and the cell state C(t).
"You can specify the initial state of RNN layers numerically by
calling reset_states with the keyword argument states. The value of
states should be a numpy array or list of numpy arrays representing
the initial state of the RNN layer"
as written in the layers/recurrent.py in the Keras Source Code

samre12 on 19 May 2017

👍6 🎉3 ❤1

All 59 comments

Another important point: I need to specify the state symbolically, so that the entire model stays trainable.

My best guess currently is to set the decoder to be stateful even in training, set the rnn.states variable as the symbolic output of the previous layer, and reset the state after each sample. Should this work?

phdowling on 16 Jun 2016

I'm not sure about your first post. Are you saying that the first input would just be the conditioning vector?

That could work but it's not the same as the first hidden state being the vector, because it would be addressed with a different weight matrix.

a hacky solution I've used for similar situations:

One super easy way to accomplish this is to subclass whichever RNN (LSTM, GRU, whatever) and have the call(...) take as input a tupled x. So, when you call it, it becomes rnn([x_input, x_initial]) wherex_inputis the actual data andx_initial` is going to be used to override the initial weights.

now, in the rnn's call function:

first, unpack if you need to

if isinstance(x, (tuple, list)):
    x, custom_initial = x
else:
    custom_initial = None

then set the custom initial

        if self.stateful and custom_initial:
            raise Exception('some message here about how this is bad')
        elif custom_initial:
            # the bit we care about
            initial_states = custom_initial
        elif self.stateful:
            initial_states = self.states
        else:
            initial_states = self.get_initial_states(x)

the thing you need to take care with this approach: you have to adjust all other instances where the input is fed into the layer to adjust for the tupled input. I think compute_mask and get_output_shape_for will be the two functions you need to adjust.

the major reason behind doing it this way is that it adds the conditioning vector you want to be your initial state into Keras' computational graph. This saves a ton of headaches.

braingineer on 16 Jun 2016

👍5

Cool, thanks - that looks like what I'm looking for. I notice you're raising an exception on purpose - any particular reason this wouldn't work with a stateful model? Rather, is there any way to do prediction with this _without_ a stateful model?

Also, you say this adds the conditioning vector into the computational graph. Since the conditioning vector should actually be a symbolic variable (output of the encoder), does that mean this approach let's me train the whole encoder-decoder model's weights? That would of course be important.

phdowling on 16 Jun 2016

A stateful model would preclude having an initial hidden state because it would be the hidden state.

I didn't include the exception because I'm lazy, haha. Also, I'm presuming this would be personal use, so the exception should be thrown in there for your own sanity, not given guide lines to users.

edit: the nth pass of a stateful model where n > 0 would preclude it. The initial hidden state of a stateful model is the last hidden state from the last pass.

braingineer on 16 Jun 2016

What about after a state reset? There must still be _some_ initial hidden state, why can't we set it?

phdowling on 16 Jun 2016

my best advice is to check the code. all of those questions are in the class definition. and you'll get a far better understanding of what's going on and how to mold it to your situation.

the hidden state is currently not able to be set because no one has written to code to do it. Or, they have, but haven't PRed it. it gets initialized to None when it's not stateful. Normal use cases never call 'reset states'.

also, you're not talking about setting the initial hidden state. you're talking about linking it into the computational graph. That is a different situation. If you want to see what that means, I would start with keras.engine.topology and the Layer.__call__. Keras has its own computational graph to give it a lot of the functionality it has. this is why I recommended just spoofing the inputs as a fast and easy workaround.

braingineer on 17 Jun 2016

Fair enough, thanks for your help. I think my test-time problem can be solved by simply explicitly setting the state variable in a stateful RNN in the beginning and after a reset (no need for symbolic links there, since we're only evaluating). I'm also looking into the seq2seq package to see if I can manage to port it to keras 1.0.4, that would also provide much more tools to play with.

phdowling on 17 Jun 2016

So in seq2seq, there's this StateTransferLSTM layer which I've ported to Keras 1.0.4, and which claims to transfer its hidden state to other LSTM layers you have specified.

It does that by manipulating the target layer's updates variable, which I tried to investigate in the Keras code, but it's still not completely clear to me what this actually does. In particular, this seems completely different from the suggestion by @braingineer, and particularly the comment about there being a difference between just _setting_ the state and making it part of the _computational graph_ makes me wonder if the seq2seq implementation is actually correctly doing what it advertises to do. (I do realize that broadcasting state from one layer to another and setting an arbitrary initial state is not exactly the same thing, though.)

(Also, I've checked out Layer.__call__ but I don't know what I'm seeing. My current idea of the Keras computational graph is that it's basically building an abstract representation of all the computations the model has to do, so the backend can thoroughly optimize them for efficient calculation, but I'm not sure if this is accurate. Also, it maybe strays a bit far from the original question...)

mbollmann on 17 Jun 2016

@mbollman Could you point me to your port of the StateTransferLSTM? I'm trying to port it myself, but there seems to be trouble with the order in which my layers are being built inside the seq2seq model.

As to whether seq2seq works correctly: I'd say if it trains to learn, it probably does, since a missing connection in the computational graph would make backprop fail to work correctly (though there might of course just be another connection that keep things together)

phdowling on 17 Jun 2016

I believe I've done nothing but port the get_output function to a call function as described in the Keras migration docs. The code currently lives in a private repo, but here's a gist of the StateTransferLSTM:

https://gist.github.com/mbollmann/2d38bd38259a03ea83999de32dbe3466

I've successfully used this with a model similar to the SimpleSeq2seq one, though I've not tried the more advanced models of the seq2seq repo with it.

mbollmann on 18 Jun 2016

@braingineer Can I ask you a question? you mentioned taking care of the input for functions like compute_mask() and get_output_shape_for(). I was thinking about modifying functions that have x as input, like step(self, x, states) and get_constants(self, x) etc. I am very new at Theano and Keras and I am trying to understand this. Could you explain how to know which functions to modify? If I create a class Decoder(Recurrent), then do I need to write functions that are used as in the LSTM class, or do I also need to override functions just in the base class Recurrent?

Many thanks for your time!

dragonhyq on 11 Aug 2016

Just a little update from me:

I managed to build a custom layer derived from LSTM that takes two inputs (actual input + initial hidden state) and returns two outputs (actual output + last hidden state). This way, I was able to build an encoder--decoder model where the only connection between encoder and decoder was the hidden state transfer, and successfully train it in a way that simply wasn't possible with the StateTransferLSTM from seq2seq (on my task, I get 3% accuracy with seq2seq's StateTransferLSTM using broadcast_state(), and 93%+ with my own custom LSTM). I believe that adding the transferred hidden state to the computational graph is the crucial difference here.

@braingineer, your explanations in this thread were crucial for me, I probably wouldn't have managed to do this without them! Thanks a lot!

mbollmann on 16 Aug 2016

👍3

@mbollmann Any chance you could share that layer code? It could be very useful for many people, for me at the very least 👍

phdowling on 16 Aug 2016

👍1

@phdowling Here you go:

https://gist.github.com/mbollmann/29fd21931820c64095617125824ea246

That implementation might be quite hacky and possibly incomplete/buggy, but it works fine for my use case. LSTM actually keeps two internal states, and right now I transfer them both -- I'm not sure if this is what you'd want, but it didn't seem to make much of a difference.

Also, note that I have my own (very hacky and inefficient) decoding functions built around these models, which take care of feeding the predicted output of one timestep as input to the next one. (Basically, for _n_ timesteps, I'm just doing _n_ separate predictions, each one a timestep longer than the previous one...) You'll probably have to do _something_ to that effect to efficiently use an encoder--decoder model.

mbollmann on 17 Aug 2016

👍9 🎉1

@mbollmann Thank you for sharing your code!

I saw your post also here: #2968: keras does not currently support feeding output into input

Since you subclassed LSTM, the step function does not change, that means you need to somehow integrate the output y_tm1 (suppose you have maybe another dense layer from h_tm1 to get that) into either the h_tm1 or c_tm1. That is different from what Cho (appendix A) does, which is using a separate weight matrix for y_tm1. Am I correct?

I did a minimal subclass myself differently. I changed the step function to compute that y_tm1 and process it with its own weight matrix. Below is the class:

class DecoderLSTM(Recurrent):
    # Since initial hidden state is replicated from the input, there should be
    # input_dim == hidden_dim

    def __init__(self, output_dim,
                 init='glorot_uniform', inner_init='orthogonal',
                 forget_bias_init='one', activation='tanh',
                 out_activation='linear', inner_activation='hard_sigmoid',
                 **kwargs):
        self.output_dim = output_dim
        self.init = initializations.get(init)
        self.inner_init = initializations.get(inner_init)
        self.forget_bias_init = initializations.get(forget_bias_init)
        self.activation = activations.get(activation)
        self.out_activation = activations.get(out_activation)
        self.inner_activation = activations.get(inner_activation)

        super(DecoderLSTM, self).__init__(**kwargs)

    def call(self, x, mask=None):
        # input shape: (nb_samples, time (padded with zeros), input_dim)
        # note that the .build() method of subclasses MUST define
        # self.input_spec with a complete input shape.
        input_shape = self.input_spec[0].shape

        # state format: [h(t-1), c(t-1), y(t-1)]
        h_0 = K.zeros_like(x[:, 0, :])
        c_0 = K.zeros_like(x[:, 0, :])

        y_0 = K.zeros_like(x)  # (samples, timesteps, input_dim)
        y_0 = K.sum(y_0, axis=(1, 2))  # (samples,)
        y_0 = K.expand_dims(y_0)  # (samples, 1)
        y_0 = K.tile(y_0, [1, self.output_dim])  # (samples, output_dim)

        initial_states = [h_0, c_0, y_0]

        last_output, outputs, states = K.rnn(step_function=self.step, 
                                             inputs=x,
                                             initial_states=initial_states,
                                             go_backwards=self.go_backwards,
                                             mask=mask,
                                             constants=None,
                                             unroll=self.unroll,
                                             input_length=input_shape[1])

        if self.return_sequences:
            return outputs
        else:
            return last_output

    def build(self, input_shape):
        self.input_spec = [InputSpec(shape=input_shape)]
        self.input_dim = input_shape[2]

        self.W = self.init((self.input_dim, 4 * self.input_dim),
                           name='{}_W'.format(self.name))
        self.U = self.inner_init((self.input_dim, 4 * self.input_dim),
                                 name='{}_U'.format(self.name))
        self.A = self.init((self.output_dim, 4 * self.input_dim),
                            name='{}_A'.format(self.name))
        self.b = K.variable(np.hstack((np.zeros(self.input_dim),
                                       K.get_value(self.forget_bias_init((self.input_dim,))),
                                       np.zeros(self.input_dim),
                                       np.zeros(self.input_dim))),
                            name='{}_b'.format(self.name))
        self.V_y = self.init((self.input_dim, self.output_dim),
                            name='{}_V_y'.format(self.name))
        self.b_y = K.zeros((self.output_dim,), name='{}_b_y'.format(self.name))

        self.trainable_weights = [self.W, self.U, self.A, self.b,
                                  self.V_y, self.b_y]

        if self.initial_weights is not None:
            self.set_weights(self.initial_weights)
            del self.initial_weights

    def step(self, x, states):
        h_tm1 = states[0]
        c_tm1 = states[1]
        y_tm1 = states[2]

        z = K.dot(x, self.W) + K.dot(h_tm1, self.U) + K.dot(y_tm1, self.A) + self.b

        z0 = z[:, :self.input_dim]
        z1 = z[:, self.input_dim: 2 * self.input_dim]
        z2 = z[:, 2 * self.input_dim: 3 * self.input_dim]
        z3 = z[:, 3 * self.input_dim:]

        i = self.inner_activation(z0)
        f = self.inner_activation(z1)
        c = f * c_tm1 + i * self.activation(z2)
        o = self.inner_activation(z3)

        h = o * self.activation(c)
        y = self.out_activation(K.dot(h, self.V_y) + self.b_y)

        return y, [h, c, y]

    def get_config(self):
        config = {'output_dim': self.output_dim,
                  'init': self.init.__name__,
                  'inner_init': self.inner_init.__name__,
                  'forget_bias_init': self.forget_bias_init.__name__,
                  'activation': self.activation.__name__,
                  'out_activation': self.out_activation.__name__,
                  'inner_activation': self.inner_activation.__name__}
        base_config = super(DecoderLSTM, self).get_config()
        return dict(list(base_config.items()) + list(config.items()))

But by doing this I can only do one layer, and have to write all layer I want to add myself. Your approach of feeding one input at a time enables stacking any structure of layers, which would be awesome.

How do you think your approach could be modified to have a dedicated weight matrix for y_tm1? Would that require modifying the step function?

I have not tried to _decode_ your Decoder yet (process one value at a time and feed output into next input). I am about to try that in a bit, I probably will have questions since I am not good at Keras.

Thanks again!

dragonhyq on 5 Sep 2016

Since you subclassed LSTM, the step function does not change, that means you need to somehow integrate the output y_tm1 (suppose you have maybe another dense layer from h_tm1 to get that) into either the h_tm1 or c_tm1. That is different from what Cho (appendix A) does, which is using a separate weight matrix for y_tm1. Am I correct?

I'm not sure I understand what you're getting at. The way I understand appendix A in the Cho article (but I just took a quick glance) is that it's simply feeding the model's output at timestep t-1 back into the decoder as input for timestep t. Notice that the decoder equations are almost identical to the encoder equations, except that instead of x_t they use y_t-1. This would mean that the "separate weight matrix for y_tm1" is just the normal input weight matrix of the decoder LSTM, no?

mbollmann on 5 Sep 2016

You are right, x(t) _is_ y(t-1). I apologize I did not communicate well. I meant to refer to the context vector c, which is fed also as input to every time step.

Does your approach accepts c also as input for every timestep(I don't want to misunderstand your code so I want to directly ask you)? If so, is it done by concatenating c with y(t-1)? If not, how could it be modified to support c?

Thank you for your time!!

dragonhyq on 5 Sep 2016

Does your approach accepts c also as input for every timestep(I don't want to misunderstand your code so I want to directly ask you)? If so, is it done by concatenating c with y(t-1)? If not, how could it be modified to support c?

The code I posted doesn't do that -- it just does the hidden state transfer part, but a little less sophisticated than in the Cho article, by simply setting h'_0 = h_N (if N is the last timestep of the encoder).

So basically, the context vector c seems to be simply h_N fed through another tanh activation. My custom layer already takes h_N as its second input, so I probably wouldn't modify the inputs, but simply do the tanh activation in the decoder layer. (I'm also not quite sure why you even need it...)

However, to modify the layer to also use c/h_N in each hidden state update (the last part of the three decoder equations in the Cho article), you indeed have to write a custom step function, I think.

mbollmann on 5 Sep 2016

@mbollmann

Just a little update from me:

I managed to build a custom layer derived from LSTM that takes two inputs (actual input + initial hidden state) and returns two outputs (actual output + last hidden state). This way, I was able to build an encoder--decoder model where the only connection between encoder and decoder was the hidden state transfer, and successfully train it in a way that simply wasn't possible with the StateTransferLSTM from seq2seq (on my task, I get 3% accuracy with seq2seq's StateTransferLSTM using broadcast_state(), and 93%+ with my own custom LSTM). I believe that adding the transferred hidden state to the computational graph is the crucial difference here.

What kind of data / problem were you training on when you got these differences in numbers?

sallamander on 7 Sep 2016

@sallamander

What kind of data / problem were you training on when you got these differences in numbers?

Character-based string transduction. But the task isn't important, what's important is that I used a decoder that takes as input only a constant start symbol (at timestep 0) and the model prediction at the previous timestep (everywhere else). In a nutshell:

Without the hidden state coming from the encoder, there is _no signal at all_ from which the decoder can do anything meaningful.
Without backpropagation along the transferred hidden state, the encoder will _never learn anything at all_ and only ever produce random noise.

I suspect seq2seq's StateTransferLSTM has either one of both of these problems, hence the 3% accuracy I got.

mbollmann on 8 Sep 2016

@mbollmann

Right, I agree - I moreso was just curious and wanted to play around with the example that you were using.

Yeah, I have a hunch it might be (2), but I'm working on digging into the seq2seq implementation in the hopes of understanding where it's going wrong (which is what it looks like) and hopefully fixing it up.

For (2), do you mean the decoder? If not, then what transferred hidden state are you getting for the encoder? Or, do you mean that the _encoder_ will never learn anything because the gradients won't properly be passed back in the backprop step, or no real gradients will be passed back, or something like that?

In any case, how would you end up not backpropagating along the transferred hidden state, _unless_ you weren't setting it correctly as the initial hidden state of the decoder? There was some discussion about it being a bad idea to use stateful=True and set an initial hidden state for the decoder - is this at all tied to your comment about back propagating along the transferred hidden state (seq2seq does this, and I think this may be part of the issue)? I could imagine that if you set the initial state in the decoder but also set stateful=True, then in some sense you're not backpropagating along the transferred hidden state, because after the first updates the transferred hidden state won't be seen again. Is this what you were referring to?

sallamander on 8 Sep 2016

@sallamander

For (2), do you mean the decoder? If not, then what transferred hidden state are you getting for the encoder? Or, do you mean that the encoder will never learn anything because the gradients won't properly be passed back in the backprop step, or no real gradients will be passed back, or something like that?

Yes, the latter. In my model, the output of the encoder is never used as _input_ for the decoder, but _only_ for setting the decoder's hidden state, so if this connection isn't added to the computational graph, there effectively _is no_ connection between the two during backprop (and no gradients get passed).

Or that's how I understand it - I must confess that not all the details of how Keras constructs the computational graph are clear to me. I've never done it "manually" in Theano/Tensorflow, and Keras is really good at hiding the gory details away from you. I just went with @braingineer's comment above (https://github.com/fchollet/keras/issues/2995#issuecomment-226597313) who suggests taking the initial hidden state as _a second input_, claiming:

the major reason behind doing it this way is that it adds the conditioning vector you want to be your initial state into Keras' computational graph. This saves a ton of headaches.

Maybe he can explain better than me how exactly this happens and what the actual difference is. (I'd certainly be interested!)

mbollmann on 8 Sep 2016

Interesting, thanks!

Or that's how I understand it

That's how I understand it, too 👍

I'm hopeful that I can issue a PR on Keras and get this worked in, but yeah, the details of building it into the computational graph are a little tricky.

sallamander on 8 Sep 2016

@mbollmann Hey again! Couple more questions:
First:
I'm trying to use your custom HiddenStateLSTM layer, but am running into shape assertion errors. I structure my data like this: The actual input is always a sequence (lets say of chars) of a certain length. This goes to the decoder. The decoder input is also a sequence of some (other) length, which at first consists only of a start symbol with padding, and in each subsequent sample one char is added (mock prediction / teacher forcing). The output, similarly, starts out with the first char to be predicted, and subsequent chars are added in each sample.
So, my input has shape (n_samples, in_seq_len,) for encoder input, (n_samples, dec_seq_len,) for decoder input, and output shape is (n_samples, dec_seq_len, num_chars). Here is the error I run into:

File "seq2seq_lm.py", line 134, in train
    self.build_model()
  File "seq2seq_lm.py", line 105, in build_model
    dec_layer, _, _ = HiddenStateLSTM(64, dropout_W=0.5, return_sequences=True)([dec_layer] + hidden)
  File "/Users/dowling/anaconda/lib/python2.7/site-packages/keras/engine/topology.py", line 489, in __call__
    self.build(input_shapes)
  File "layers/decoderlstm.py", line 18, in build
    assert shape[0] == input_shape[0]
TypeError: 'NoneType' object has no attribute '__getitem__'

This happens in the build function:

    def build(self, input_shape):
        if isinstance(input_shape, list) and len(input_shape) > 1:
            input_shape = input_shape[0]
            hidden_shapes = input_shape[1:]
            for shape in hidden_shapes:
                assert shape[0] == input_shape[0]  # fails
                assert shape[-1] == self.output_dim
        super(HiddenStateLSTM, self).build(input_shape)

I pretty much adopted your sample code, I'm not sure if my data is structured the same way as yours though.

Second question:
If we only "manually" feed predicted outputs to the next input, they are not in fact part of the computational graph, correct? I see how the network should still learn in general, since the correct prediction is being taught locally and state is still kept, but in theory the connection of previous output to next input should be symbolic, no? This question is a bit more general though, for now, thanks a lot for providing your code!

phdowling on 29 Sep 2016

@mbollmann Follow up on number one: this bug actually occurs when I run the code snippet you posted as well, it seems to be related to the way the model is set up, not the structure of the data. Could you have another look and see if your gist is up to date!

phdowling on 29 Sep 2016

@phdowling I just ran my code snippet again and it produces no such error. What Python/Keras versions are you using?

Re your second question:
I'm not sure if it should be symbolic actually, since for decoding, you need to make a choice which output class to select (and therefore use as input to the next timestep). This can be a simple argmax (which could be expressed symbolically I guess), but you could also do beam search (keeping several options, and only make the final decision several timesteps later) or filter the output classes via some external criteria (e.g. a lexicon of wellformed words/sequences). In that case, a symbolic connection during training couldn't really model the situation during decoding anyway.

mbollmann on 29 Sep 2016

@mbollmann I'm using Python 2.7.12 with Keras 1.1.0 (installed from git). The bug I referred to was caused by my own adjustments, sorry for bothering you with that. FYI, here is where I went wrong:

I had to slightly adjust your code to make it python 2 compatible. Line 16:

input_shape, *hidden_shapes = input_shape

became

input_shape = input_shape[0]
hidden_shapes = input_shape[1:]  # wrong, because input_shape was just written to

which should of course be

input_shape, hidden_shapes = input_shape[0], input_shape[1:]

~~This fixes what I initially reported. However, I still see an exception after fixing this~~:

Traceback (most recent call last):
  File "layers/decoderlstm.py", line 114, in <module>
    dec_layer, _, _ = HiddenStateLSTM(64, dropout_W=0.5, return_sequences=True)([dec_layer] + hidden)
  File "/Users/dowling/anaconda/lib/python2.7/site-packages/keras/engine/topology.py", line 514, in __call__
    self.add_inbound_node(inbound_layers, node_indices, tensor_indices)
  File "/Users/dowling/anaconda/lib/python2.7/site-packages/keras/engine/topology.py", line 572, in add_inbound_node
    Node.create_node(self, inbound_layers, node_indices, tensor_indices)
  File "/Users/dowling/anaconda/lib/python2.7/site-packages/keras/engine/topology.py", line 154, in create_node
    output_tensors = to_list(outbound_layer.call(input_tensors, mask=input_masks))
  File "/Users/dowling/Development/idp/char-idp/layers/decoderlstm.py", line 68, in call
    input_length=input_shape[1])
  File "/Users/dowling/anaconda/lib/python2.7/site-packages/keras/backend/theano_backend.py", line 832, in rnn
    initial_output = step_function(inputs[0], initial_states + constants)[0] * 0
TypeError: unsupported operand type(s) for +: 'TensorVariable' and 'list'

Here is a gist of my changed code: https://gist.github.com/phdowling/ae731089a1497ab0dd8177114ad60881
The other change I made was here: https://gist.github.com/phdowling/ae731089a1497ab0dd8177114ad60881#file-hiddenstatelstm-py-L110-L111 , but I don't think this is an incorrect translation of the Python 3 code. Does this gist throw an error for you? If so, I guess the problem might be a Python 3 / 2 issue somewhere.

This was again my mistake, I made the same error as above in another spot and forgot about it. I updated my gist, the example is working now!

Thanks also for your insight on the second point!

phdowling on 29 Sep 2016

@mbollmann When you train your model, do you feed it "partial" prediction states in the training phase? By that I mean, for a generated output of length 5, do you feed the model 5 samples that include the iterations of the output generator, or do you only feed the final state?

During evaluation the intermediate steps are necessary of course to feed the partial results, but I'm thinking during training, since we do teacher forcing, it should be enough to feed the complete expected output with the corresponding ground-truth decoder inputs, right?

phdowling on 29 Sep 2016

@phdowling Ah, cool, I saw you made that adjustment but didn't notice the mistake either!

Yes, I only feed the final (ground-truth) state during training. I am aware of more complex approaches to training such a model (e.g. Sequence-to-sequence learning as beam-search optimization), but have not looked into them yet.

mbollmann on 29 Sep 2016

@mbollmann

Also, note that I have my own (very hacky and inefficient) decoding functions built around these models, which take care of feeding the predicted output of one timestep as input to the next one. (Basically, for n timesteps, I'm just doing n separate predictions, each one a timestep longer than the previous one...) You'll probably have to do something to that effect to efficiently use an encoder--decoder model.

Would it be possible for you to share the code that does the decoding part after training the model?

suragnair on 30 Oct 2016

@mbollmann i have read the whole discussion...passing C as in put dosent help...but can u suggest a encoder-decoder framework in keras

HariKrishna-Vydana on 20 Jan 2017

@suragnair Sorry I totally forgot to respond. Do you still need this? I'd look into extracting the relevant parts of my code then.

@harikrishnavydana What do you mean by "doesn't help"? The way I've discussed it here works for me and that's what I still (successfully) use for encoder-decoder. I don't know of a more convenient framework for implementing this, unfortunately.

mbollmann on 20 Jan 2017

@mbollmann
in an encoder decoder frame work How to use the output of present state as input to the decoder in keras

HariKrishna-Vydana on 21 Jan 2017

@harikrishnavydana AFAIK it's not possible except by using a workaround like I've already explained above and in this post: https://github.com/fchollet/keras/issues/2968#issuecomment-226727979

mbollmann on 21 Jan 2017

@harikrishnavydana @mbollmann Don't know if it's worth noting, but I've got a PR open for this: https://github.com/fchollet/keras/pull/3947

sallamander on 22 Jan 2017

its worth doing it because its a key feature missing in keras-RNN architectures...to be able to use op-of previous time step in predicting the present time step is also vital feature for many sequential models@mbollmann @sallamander

HariKrishna-Vydana on 22 Jan 2017

@mbollmann @sallamander @fchollet @RamaneekGill

How do we get the keras version with the broad casting option included in RNN

HariKrishna-Vydana on 28 Jan 2017

@harikrishnavydana You can clone the Keras repository from my fork.

sallamander on 29 Jan 2017

@sallamander by broadcasting Hidden state and batch size of 1 can i manually create a depth first implementation of keras RNN...is this the workaround that was mentioned.

HariKrishna-Vydana on 29 Jan 2017

@harikrishnavydana the implementation I pointed you to doesn't require a batch size of 1, although if I understand correctly you want both hidden state broadcasting and something like input feeding (e.g. during training, you feed the predicted y_{t-1} as input for y_t, instead of the actual)... is that right?

sallamander on 29 Jan 2017

@sallamander yes i am actually looking for that can this be done in this fork

HariKrishna-Vydana on 30 Jan 2017

@harikrishnavydana My fork offers the broadcasting of hidden state, and if you restrict yourself to a batch size of 1, then I believe you could do it, yes.

sallamander on 30 Jan 2017

samre12 on 19 May 2017

👍6 🎉3 ❤1

for LSTM unit, there are two states, one is hidden state and the other is cell state:
lstm_output = LSTM(units=units, return_sequences=False)(x, initial_state=[your_initial_state, your_initial_state])

for GRU unit, there is only one state:
gru_output = GRU(units=units, return_sequences=False)(x, initial_state=your_initial_state)

your_initial_state should have the same shape as the outpout shape of LSTM unit.
For Keras 2.0.6, it works.

THUfl12 on 22 Jul 2017

👍4

@THUfl12 @samre12 So, if I am trying to implement an Seq2Seq model and trying to pass encoder state to the decoder the code should look something like this ?

Encoder = LSTM(units=HIDDEN_STATE_SIZE, return_state=True)(x)
Decoder = LSTM(units=HIDDEN_STATE_SIZE, return_sequences=True)(Encoder[0], initial_state=Encoder[1:])

rohit-gupta on 13 Aug 2017

👍3

@rohit-gupta the documentation suggests that
"You can specify the initial state of RNN layers symbolically by calling them with the keyword argument initial_state. The value of initial_state should be a tensor or list of tensors representing the initial state of the RNN layer."
I believe that this translates to what you have suggested though I haven't tried it explicitly.

samre12 on 13 Aug 2017

I am looking for Seq2Seq model of encoding and decoding. How to get the last state of from LSTM encoders and set as the initial state for LSTM decorder. @rohit-gupta, Does your suggested code, work?

ramakrse on 7 Sep 2017

@ramakrse Something like this should work:

from keras.models import Model
from keras.layers import Input, LSTM, Dense

inputs = Input(batch_shape=(8,16,1024))
encoder = LSTM(256, return_state=True, return_sequences=True)
outputs = encoder(inputs)
output, state = outputs[0], outputs[1:]
decoder = LSTM(256, return_sequences=True)(output, initial_state=state)
output_words = TimeDistributed(Dense(activation='softmax'))(decoder)
model = Model(inputs, output_words)

rohit-gupta on 15 Sep 2017

👍5 😄1

@rohit-gupta This would only work for target sequences with the same length as input sequences. Also this does not appear to implement canonical seq2seq.

Here's my own implementation, which as far as I can tell implements canonical seq2seq. I haven't tried checking that it works in practice, as this isn't my domain and this was just a quick aside for me. I will flesh it out as an example script later on when I have time.

from keras.models import Model
from keras.layers import Input, LSTM, Dense, TimeDistributed
import numpy as np

num_encoder_tokens = 100
num_decoder_tokens = 100
encoder_seq_length = None
decoder_seq_length = None
batch_size = 128
epochs = 3

# Dummy data
input_seqs = np.random.random((1000, 10, num_encoder_tokens))
target_seqs = np.random.random((1000, 10, num_decoder_tokens))

# Define training model
encoder_inputs = Input(shape=(encoder_seq_length,
                              num_encoder_tokens))
encoder = LSTM(256, return_state=True, return_sequences=True)
encoder_outputs = encoder(encoder_inputs)
_, encoder_states = encoder_outputs[0], encoder_outputs[1:]

decoder_inputs = Input(shape=(decoder_seq_length,
                              num_decoder_tokens))
decoder = LSTM(256, return_sequences=True)
decoder_outputs = decoder(decoder_inputs, initial_state=encoder_states)
decoder_outputs = TimeDistributed(
    Dense(num_decoder_tokens, activation='softmax'))(decoder_outputs)
model = Model([encoder_inputs, decoder_inputs], decoder_outputs)

# Training
model.compile(optimizer='rmsprop', loss='mse')
model.fit([input_seqs, target_seqs], target_seqs,
          batch_size=batch_size, epochs=epochs)

# Next: inference mode (sampling).
# Here's the drill:
# 1) encode input and retrieve initial decoder state
# 2) run one step of decoder with this initial state
# and a "start of sequence" token as target.
# Output will be the next target token
# 3) Append the target token and repeat

# Define sampling models
encoder_model = Model(encoder_inputs, encoder_states)

decoder_state_input_h = Input(shape=(256,))
decoder_state_input_c = Input(shape=(256,))
decoder_states = [decoder_state_input_h, decoder_state_input_c]
decoder_outputs = decoder(decoder_inputs,
                          initial_state=decoder_states)
decoder_model = Model(
    [decoder_inputs] + decoder_states,
    decoder_outputs)

# Dummy data
input_seqs = np.random.random((1000, 10, num_encoder_tokens))
target_seqs = np.random.random((1000, 1, num_decoder_tokens))

# Sampling loop for a batch of sequences
states_values = encoder_model.predict(input_seqs)
stop_condition = False
while not stop_condition:
    output_tokens = decoder_model.predict([target_seqs] + states_values)
    # sampled_token = ... # sample the next token
    # target_seqs = ...  # append token to targets
    # stop_condition = ...  # stop when "end of sequence" token is generated
    break

fchollet on 28 Sep 2017

👍3

Thanks very much for this snippet.
Note that it's implementing "teacher forcing", i.e. feeding the correct labels to the decoder during training, if I'm reading correctly. This is known to hurt generalization, and is not the canonical seq2seq, I think, where the decoder is conditioned on its own previous output, and can't see the correct labels.
Is there a way to modify the code to do that?

antishok on 29 Sep 2017

@antishok Well, current best practice is to use Scheduled Sampling (Ref: https://arxiv.org/abs/1506.03099) which essentially means you have an additional hyperparameter which decides what fraction of inputs will be trained using correct values (teacher forcing) and what fraction will use the inferred values. Actually, if you can't implement scheduled sampling for some reason simply using teacher forcing works fine.

rohit-gupta on 29 Sep 2017

@fchollet Thanks for developing the Seq-2-Seq script, I am currently working on a translation related project and I can clean it up and send as a PR (adding Seq2Seq example) over the weekend if that would be useful.

rohit-gupta on 29 Sep 2017

@fchollet thanks for the sample code. It is very helpful to understand how to use keras for seq2seq learning.

I noticed that you wrote a post about it as well. Can you explain why you use TimeDistributed layer in the above code but not in the blogpost (https://blog.keras.io/a-ten-minute-introduction-to-sequence-to-sequence-learning-in-keras.html)?

jerrypaytm on 16 Oct 2017

@jerrypaytm Does the code without TimeDistributed work for you ? I couldn't get it to work without TimeDistributed.

(Which TBH makes sense, if the output is variable length, you will need TimeDistributed to make it work)

rohit-gupta on 17 Oct 2017

@rohit-gupta It works for me without TimeDistributed. I haven't try with TimeDistributed yet.

jerrypaytm on 17 Oct 2017

@fchollet and @rohit-gupta can you confirm that the seq2seq model can be loaded? I saved the model as suggested in this script but it fails to load the model.
It is a ValueError when I try loaded_model = load_model('s2s.h5')

Layer lstm_15 expects 1 inputs, but it received 3 input tensors.

jerrypaytm on 17 Oct 2017

nevermind, I was using keras==2.0.6 and upgraded to the latest fix the loading model issue.

jerrypaytm on 17 Oct 2017

@mbollmann Thank you for sharing your code!
In my project I need to set the initial state at each batch, is possible to do it with your HiddenStateLSTM class?

alex88o on 25 Oct 2017

@fchollet how can your code be changed for target sequences with different length from the input_sequence ? target being longer than the input_sequence ?