We should pad both input and desired sequences with zeros, right? But how will the objective function handle the padded values?

You could use a mask the hide your padded values from the network. Then you can discard the masked values in your sequence output. Currently masking is only supported via an initial Embedding layer, though. See: http://keras.io/layers/recurrent/

fchollet on 15 Jul 2015

I'm a little new to recurrent network. When Eder talked about the sequence to sequence map, it only reminds me of the char-level LSTM (http://karpathy.github.io/2015/05/21/rnn-effectiveness/). In this case, even we can discard the masked values in your sequence output, the padding values still have effects on the hyper-parameters of the model itself. So is it enough to just discard the masked values? Again, as Eder has asked, won't this bias the cost function?

cc13ny on 16 Jul 2015

Maybe this issue is of your interest #382

ghost on 16 Jul 2015

This worked for me. Padding the inputs and then the outputs, and adding special sequence start and sequence stop symbols to book-end each sequence, then the following model structure:

embedding_size = 64
hidden_size = 512

print('Build model...')
model = Sequential()
model.add(Embedding(max_features, embedding_size))
model.add(JZS1(embedding_size, hidden_size)) # try using a GRU instead, for fun
model.add(Dense(hidden_size, hidden_size))
model.add(Activation('relu'))
model.add(RepeatVector(MAX_LEN))
model.add(JZS1(hidden_size, hidden_size, return_sequences=True))
model.add(TimeDistributedDense(hidden_size, max_features, activation="softmax"))

model.compile(loss='mse', optimizer='adam')

If you have a sequence stop symbol, it should learn when to stop outputing non-zero values, and will output zero's thereafter. May not be ideal, but works within the current framework. I also tried replicating the output to the maxlen width (including the stop symbol) during training, and then just took the first valid sequence at test time.

simonhughes22 on 19 Jul 2015

👍7

btw, JZS1 is an RNN like a GRU or an LSTM, each of which could be used here instead in both the encoder and decoder.

simonhughes22 on 19 Jul 2015

Sounds like a good idea, but see that you are forcing your model to learn something else instead of your original problem.
I think I know how to solve this problem with a masking layer after each regular layer and if we allow loss to be a custom function from. This way, instead of averaging the cost with .mean(), we divide it by the number of non zero elements on each time series. This non discriminant averaging is where a lot of bias come from as well.

@fchollet is there any chance that we can get "loss" to check if its input is callable? I really didn't want to have to add new stuff to my code repo every time I needed something custom made. I can write a PR for that if there is a general interest. Let me know.

EderSantana on 24 Jul 2015

PR #446 is relevant here. Now all that we need is that the cost functions ask for that mask in their calculations.

EderSantana on 27 Jul 2015

@simonhughes22 Hello, I have been working with your code snippet (from a previous discussion, with example/toy data). While the code works, I am confused if it is doing what it is supposed to. I am new to Keras and deep learning, so please do bear with me.

As far as I understand, the idea is to have an 'encoder' process sequence X, and after the last time step of the sequence is processed, a 'decoder' starts to predict the new sequence Y.
In the code you provided, how does the model know that sequence X is complete and now it should start predicting Y?

gautamb85 on 30 Jul 2015

@gautamb85 the inputs are padded, the encoder RNN simply goes along the entire input, updating it's hidden state accordingly until it hits the end of the array, and then outputs a vector. That is then fed into the decoder. Btw, they just added masking to the loss function (see the recent commit history), so I'd make sure you are masking the loss function (you'll have to dig through the code or documentation to figure that out).

simonhughes22 on 30 Jul 2015

@simonhughes22 Thanks for the reply. I had a couple more questions.

In your code there is a dense relu layer between the hidden layers of the encoder and decoder. Is this needed? (don't they just connect encoder hidden layer to decoder)
What exactly is the RepeatVector part of the code doing?
In the Sequence to Sequence paper, they make explicit use of an 'end of sequence' symbol. The way I understand it (as you mentioned), if one is included, the rnn should learn to stop predicting non-zero values.
Does there need to be an explicit provision in the code that (during training) - looks for this end of sequence symbol, and tells the system that this is the hidden state to feed to the decoder, or is the model just naturally setup to do this?
I am confused because I don't understand why the model will not make a prediction for sequence Y for EACH input of sequence X, rather than wait till the whole of X has been processed?

I would greatly appreciate clarification regarding these points.

gautamb85 on 31 Jul 2015

The RepeatVector is repeating the final output vector from the encoding layer as a constant input to each timestep of the decoder. This is how the example works for the image captioning, so I copied the code for this.
Just make sure your training data has the end of sentence symbol as an additional work stuck on the end of sentences. You should also use the mask_zero=True in the embedding layer and cost function. The idea is that it processes the whole of x to produce a single vector representation of the sentence, and then uses that to generate an output sequence. That's the sort of the model you want when the sequences (x and y) are of different lengths, such as a translation model. If you want to instead build a word tagger, such as a POS tagger where there is a one to one mapping of input to output, you can use a simpler model of and embedding layer + a RNN of some kind (return_sequences=True) + a TimeDistributedDense layer. That's all you would need for that. Your ys would be 3D, row, colums, timestep. The columns would be a probability distribution over the classes, or a binary encoding if you can have more than one output label per word.

simonhughes22 on 3 Aug 2015

@simonhughes22 @gautamb85 Did you guys tried out the cost function masking proposed by #451?

EderSantana on 4 Aug 2015

I just mentioned this over in #451, but can't you use the sample_weight parameter to fit() and pass in 0 weight to the meaningless outputs?

wxs on 4 Aug 2015

Hey,

I'm new to keras and I have a simple question:

Why you use the mse objective in model.compile(loss='mse', optimizer='adam')?
Wouldn't it be more appropriate to use categorical_crossentropy, since you are using softmax in model.add(TimeDistributedDense(hidden_size, max_features, activation="softmax"))?

benjaminklein on 6 Aug 2015

@benjaminklein for a while there was an issue in theano + keras with the binary cross entropy and this kind of model, so I used MSE instead. Now I use binary cross entropy as I have a multi-class multi-label classification problem (each word can have 0 to many classes). If you have a more vanilla multi-class problem then yes, categorical_crossentropy should work better. However, I haven't tested that on TimeDistributed dense output, so you'd need to verify it is expecting a single category per output token, not across all tokens, as the output is 3D not 2D, so you have len(tokens) categorical cross entropy calculations. binary cross entropy just works per label, it doesn't compute a distribution over labels, so for that the output shape doesn't matter.

simonhughes22 on 6 Aug 2015

@simonhughes22 Thank you! Also could we change your code to use Graph instead of Sequential such that we'll have two inputs. One for the first sequence and another one for the second sequence?

The motivations is that in many papers about encoder-decoder, the decoding phase is using both the last hidden layer from the first sequence and the previous word of the second sequence. In your code we are only using the last hidden layer from the previous sequence.

benjaminklein on 6 Aug 2015

@EderSantana @simonhughes22 Have not yet had a chance to try it. Will let you know as soon as I do.
@fchollet I have been reading some keras posts about 'stateful' RNNs. If I understand correctly, the hidden state of the recurrent layer is reset for every time_step of a sequence. (this appears to be the case based on outputs_info for various classes in recurrent.py)

Would this be an issue in the sequence-to-sequence paradigm? How does the resetting of the state affect the summary (final hidden state) of sequence 1 ?
In my case data is being provided in chronological order i.e. a grapheme sequence, so I would guess that the previous state would matter?
If I were writing my own LSTM class, what would be the easiest way to make it 'stateful'?

gautamb85 on 6 Aug 2015

@gautamb85 I thought the issue is that it is reset for every row of input, meaning you can't test it easily by feeding its ouput predictions as inputs (you can, but you are re-feeding the entire input sequence plus the latest predicted for every subsequent time step, which is a little slower than if stateful). If the RNN reset itself between timesteps it wouldn't work...., AFAIK it maintains state across a row (sequence of timesteps) but resets itself once the row is processed. You can make it take it's output as input as I described, it's just slower than models that allow you to remember the state following a prediction.

Note for the example above, I am reading in one sequence, converting to a hidden state, and then predicting a whole second sequence, so you don't have the issue I mention here. However, you may get better results by training a model to predict the next word instead of the next sentence, and feeding each predicted word in as input to predict the next word to generate a sentence.

simonhughes22 on 6 Aug 2015

@benjaminklein as it's predicting a full sequence as output, it is remembering the previous word as it predicts the output sequence via retaining it's hidden state across the input sequence. What is repeated is the encoded representation of the entire previous sentence, but for each word it is predicting, it is also feeding in the hidden state from the previous word as that's how RNN's work. What you are describing is a slightly different type of model where you are predicting the next word and not the next sentence, and then adding that to the existing input and making a new prediction. You can do this with keras too very easily, just remove the last 3 layers from the model above and train it to predict the next word (or character). You'll have to write a bit of code to feed the output back in as input though.

simonhughes22 on 6 Aug 2015

@simonhughes22 Thank you for the clarification. That was my impression. (How could it even work if it was being reset btw time steps). I am still a little confused about testing it in generative mode.

You touched on exactly what I am trying to do, and I am hoping you could help me on some of my concepts.

Taking the machine translation problem in Sutskever's paper as an example, first an english sentence (sequence 1) is converted to a hidden state, then the 'decoder' starts to predict each word of the french sentence (sequence 2). Thus the translated sentence is generated word by word. Is this correct?
In this case, the 'decoder' is essentially a conditional language model (word level), conditioned on the english sentence, i.e. sequence 1. Thus the target for a given tilmestep is the input for the next one.
For training such a model, after reading in sequence-1, the decoder is provided with the 'true' french word at each tilmestep (not the prediction). At test time (as you mentioned) in order to feed the prediction back in to predict the next french word, the entire sequence (eng+french?) would need to be read, in order to predict the next word.

If all that is correct, would I need a graph structure? A pseudo code / flowchart description of the network architecture would be much appreciated.

Thanks in advance!

gautamb85 on 7 Aug 2015

@gautamb85 no you can use the model I listed above with the english sentence as input and the entire french sentence as output. The RNN model will maintain state across each timestep as it predicts the output sentence, no extra work required on your behalf. You will however need to one hot encode and zero pad the output sequence (the french sentence) and have it do a softmax over all possible words for the output at each time step. The ys then are 3D, each row is a matrix of height - number of french words, and width - number of time steps.

When I mentioned feeding the output as input, that's only if you want to train a language model to predict a word at a time, and then use that to generate text as a generative model. However, as in the example above, you can have it generate whole sequences for you from each input sequence.

simonhughes22 on 7 Aug 2015

@simonhughes22 I have a model training (graphemes to phonemes). I am writing a beam search to see if its actually learning something useful.
In Sutskiver' s paper, the decoder is described as a conditional language model (conditioned on the previous sentence). In the model you proposed, what is the input to the decoder RNN at each time-step?
If I wanted to design a model where for every time step of the decoder, the target of the previous timestep (and generate the output sequence word by word), what modifications to the model would I need to make?

gautamb85 on 10 Aug 2015

@gautamb85 I think you may be miss-understanding how that model is built. Each time step is a word (although could be a character, a phoneme or whatever). However, each row is an entire sequence, zero-padded to the left to the length of the longest sequence, and also each output row is a sequence, zero padded.

If you want a more traditional RNN like model, look at the Passage repo, however, that has less functionality and can't be used for tagging models (unless you predict word by word as described). When I say tagging models, I mean a one to one mapping of word or character to a tag.

simonhughes22 on 10 Aug 2015

@simonhughes22 When running your code I'm getting:
/usr/local/lib/python2.7/dist-packages/theano/gof/cmodule.py:293: RuntimeWarning: numpy.ndarray size changed, may indicate binary incompatibility
rval = import(module_name, {}, {}, [module_name])

Any ideas?

Thank you!

benjaminklein on 14 Aug 2015

It's a warning.... I've had that before, but i haven't noticed any issues. Theano is very hard to debug, so I am not able to figure out if that is serious, but every model that gave me that warning was still able to learn effectively on the data. Maybe @fchollet can shed more light on it.

simonhughes22 on 14 Aug 2015

@simonhughes22 Firstly, thank you agin for all your help so far. I was confused about the model as I thought that training was being done by teacher forcing, i.e. the actual targets were being fed to the decoder at each time step. I guess the model could be trained that way, but it would not generalize as well.

I am trying to replicate the approach (standard sequence to sequence task) from here : http://arxiv.org/abs/1506.00196

I took the your model and replaced the last layer -

print('Build model...')
model = Sequential()
model.add(Embedding(max_features, embedding_size, mask_zero=True))
model.add(GRU(embedding_size, hidden_size))
model.add(Dense(hidden_size, hidden_size))
model.add(Activation('relu'))
model.add(RepeatVector(maxlen))
model.add(GRU(hidden_size, hidden_size, return_sequences=True))

model.add(Dense(hidden_size, phn_output))
model.add(Activation('softmax'))

To get the prediction at each time step.
I was able to train the model. Though, after a certain number of training iterations, the loss started oscillating (not sure how to interpret / prevent that). Though for the first few iterations even the accuracy and loss on validation set was behaving properly.
I think it may have to do with the size of my hidden layers etc. I also only took 30000 sequences for training.

I am now trying to figure out how to do a beam search to find the best phonetic transcription of a test grapheme sequence.

At each time step I get a probability distribution over the phoneme vocabulary.
Say I keep the 3 highest (beam width) candidate translations (phonemes), I now need to expand each of these nodes of the graph by feeding them (the predictions) back in. Is this correct?
I order to feed back the prediction/s can I do something like the character language model example?
I would also need to re-feed the grapheme sequence (seq-1) to get the hidden activation at last time step. Could I just append the prediction to this sequence?

I am a little confused on how to go about setting this up. Any advice would be much appriciated.

gautamb85 on 18 Aug 2015

@gautamb85 if you want to combine this approach with a re-ranking solution (for instance using beam search), I would have the model just output the probabilities over the phonemenes for every time step. You don't need to feed the predictions in at each time, for each input, the model will output a probability distribution over every phonemene for every time step. You can then use that table of probabilities to do something like beam search (although I'd recommend also trying a dynamic programming approach - see how CRF models work, you can fit this model into that framework). However, as the model is keeping track of it's previous prediction across each time step, this should not be necessary. Note that for each input, the model will iterate over each timestep and output a prediction, giving you a list of predictions per timestep without you needing to feed them back in. However, doing so may or may not improve matters.

In terms of the oscillating errors, that normally means that the learning rate is too high and the model is having trouble converging. I'd advise using something like adam or adagrad to optimize as these approaches are very good at setting and adjusting the learning rate for you, and i've never had an unstable model with these appraoches. That said, i've hit points where the test set performance oscillates, and that often means the accuracy may not improve further. However, at this point, the training accuracy is normally still improving, but the model is starting to over fit.

The oscillating may be due to drop out also. I'd advise not using drop out until you've got a model that does very well on the training data, and experimenting with it to reduce over-fitting. My dataset is pretty noisy, which I think is regularizing the model to some degree, so dropout did seem to hurt more than help but normally it is advantageous to do. But first you want to get your model to overfit very well on the training data. That confirms the model can learn effectively on the data. Then start to address over-fitting.

simonhughes22 on 20 Aug 2015

Hi Simon,

You seem to have had success with a sequence to sequence NLP task. I'm struggling and wondered if you could help.

I have a data set of sentences, which are sequences of Socher's pre-trained word vectors. These sequence map 1-to-1 to another series of vectors for each sentence that I have calculated to describe some aspects of the sentences. These vectors are such that their values add to one. For instance, an example mapping would be,

 wordvectors["headache"] (50 dim vector) -->> 20 dim vector, [0.0, ..., 0.0, 0.25, 0.25, 0.25, 0.25, 0.0... 0,0]

I am training on a small subset of my data (10,000 points) to get things working, but I cannot overfit even with this small data. In fact, I cannot even get accuracy above 0.2. For instance, running a training epoch on a version of the model you posted above (with the embedding layer removed, as my inputs are already vector embeddings),

embedding_size = 50
hidden_size = 512
output_size = 20
maxlen = 60

model = Sequential()
model.add(JZS1(embedding_size, hidden_size)) # try using a GRU instead, for fun
model.add(Dense(hidden_size, hidden_size))
model.add(Activation('relu'))
model.add(RepeatVector(maxlen))
model.add(JZS1(hidden_size, hidden_size, return_sequences=True))
model.add(TimeDistributedDense(hidden_size, output_size, activation="softmax"))

model.compile(loss='mse', optimizer='adam')

I have padded my inputs and outputs with zeros to fit the maxlen dimension, but I have not added these "stop symbols" you wrote about. What kind of vector would I be adding as a stop symbol? Are they essential to train correctly?

Training with the model above does yield a set of probabilities for each word, so that they sum to one, but the Keras accuracy measure will not go past 0.15. I don't think it is able to fit the data I have.

What do you think is my bottle neck for training towards these vectors? Any ideas?

Thanks!

cjmcmurtrie on 21 Aug 2015

Hi Simon,

You seem to have had success with a sequence to sequence NLP task. I'm struggling and wondered if you could help.

I have a data set of sentences, which are sequences of Socher's pre-trained word vectors. These sequence map 1-to-1 to another series of vectors for each sentence that I have calculated to describe some aspects of the sentences. These vectors are such that their values add to one. For instance, an example mapping would be,

 wordvectors["headache"] (50 dim vector) -->> 20 dim vector, [0.0, ..., 0.0, 0.25, 0.25, 0.25, 0.25, 0.0... 0,0]

I am training on a small subset of my data (10,000 points) to get things working, but I cannot overfit even with this small data. In fact, I cannot even get accuracy above 0.2. For instance, running a training epoch on a version of the model you posted above (with the embedding layer removed, as my inputs are already vector embeddings),

embedding_size = 50
hidden_size = 512
output_size = 20
maxlen = 60

model = Sequential()
model.add(JZS1(embedding_size, hidden_size)) # try using a GRU instead, for fun
model.add(Dense(hidden_size, hidden_size))
model.add(Activation('relu'))
model.add(RepeatVector(maxlen))
model.add(JZS1(hidden_size, hidden_size, return_sequences=True))
model.add(TimeDistributedDense(hidden_size, output_size, activation="softmax"))

model.compile(loss='mse', optimizer='adam')

I have padded my inputs and outputs with zeros to fit the maxlen dimension, but I have not added these "stop symbols" you wrote about. What kind of vector would I be adding as a stop symbol? Are they essential to train correctly?

cjmcmurtrie on 21 Aug 2015

@cjmcmurtrie Sorry for the late reply, but i've gotten a lot of requests on this posting, and I've also been out of the coutnry. Note that while I've got it to learn something, I haven't tried using it to solve any problems, and I had to run it for a long time to get it to learn something useful. So it may not be the best approach for your particular problem (which I don't know enough about to suggest alternatives, although you could try the skip-thought sentence vectors from a pretty recent paper https://github.com/ryankiros/skip-thoughts).

Regarding the above - the stop symbol in my approach was just a special word, that would then get mapped to it's own embedding via the embedding layer and the network would learn that this meant stop outputting symbols. Sending in a zero length vector might be sufficient for this, or some randomly initialized vector. Why are you using socher's vectors? Those are likely fine-tuned for sentiment analysis (or are you using the Glove vectors). You are probably going to get the best performance by using a graph model and combining fixed vectors from the Glove or word2vec vectors (the latter I've had some success with using RNN's in keras), with an embedding layer that's able to learn it's own vectors. I haven't had chance to try that yet but in theory taht should work best from what i've read. In keras that would be achieve by having 1 input layer read hard-coded fixed pre-trained vectors, and merging that (concat) with an embedding layer where it can learn it's own vectors.

Also, if you are learning a 1-1 mapping, then I would use a differnt model. The model above is meant for where you have different length inputs and outputs. If you have a tagging problem, where the length of inputs matches the outputs, you can use a much simpler model:

model = Sequential()
model.add(JZS1(embedding_size, hidden_size, return_sequences=True))
model.add(TimeDistributedDense(hidden_size, output_size, activation="sigmoid"))
model.compile(loss='mse', optimizer='adam')

You might have to check some of the size's are correct, I just wrote that. If you are mapping to a second sequence of vectors that should work fine. Note that the output is 3D (#rows, vector_len, max sequence length), and you'll want to think about an appropriate loss function. If it's mapping to another set of vectors, mse should be fine. However, if you have some target symbols, you would be better mapping to that and using a softmax with categorical cross entropy. This model is much simpler and should work better (and I have gotten a similar model to do quite well on a supervised task).

Get that working then experiment with the Y-shaped model where you merge pre-trained fixed vectors with ones the model can learn from scratch.

simonhughes22 on 28 Aug 2015

Firstly, this thread has been a major help for me. Thank you everyone! Shout out to @simonhughes22

One thing I have read extensively about (for at least NLP), is that for your input, you want to reverse the order of your sequence. You can read more here: http://papers.nips.cc/paper/5346-sequence-to-sequence-learning-with-neural-networks.pdf

For example, if you are coding in "the dog barked loudly" as an input vector, it is better to write in "loudly barked dog the". This may help learning. I think the idea is that the next sentence that you predict has more to do with the END of the input sentence rather that the BEGINNING of the input sentence.

Forgive me as I am a beginner, but I almost always use 200 to 300 vectors for Word2Vec. Fifty or 20 vectors for a word seems incredibly low to me. Thoughts on this? I am aware of the curse of dimensionality.

My goal is to predict the sentence given the first sentence:

Input: Give First Sentence as Sequence --> Ouput: Yield next sentence as sequence

Here are a few more questions:

Question 1: We do not need to worry about the output mask on the compile layer correct? Following this thread: https://github.com/fchollet/keras/pull/451, it seems that fchollet made the output layer automatically masked (if we intially mask the embedding layer). Therefore, if we pad the y output with zeros, we should be good correct?

Question 2: I really struggle with formatting the y_train data. Right now, I cluster my words, and assign unique id's to each word within each cluster. Therefore, each word has two numbers associated with it. (I clustered the words by applying Word2Vec plus K means -- more info here: (https://redd.it/3psqil).

I know its been mentioned that y_train is formatted in 3 dimensions, but how is this possible?

Currently This is what I input as my 2d X_train into the embedding layer in Keras. I'm hoping to do something similar for the y_train as well:

Dimension 1: number of samples
Dimension 2: number of timesteps

12 3 4 5 6 3 0 0 0 0 0
6 5 4 23 3 5 1 4 2 0 0

Dimension 3: one hot each of the integers? I just thought there would be a more efficient way to do this. Or another thought is that you guys are using the third dimension to vectorize each word? This would explain why you're not doing 200 or 300 word dimensions but rather 20 or 50? But then the RNN would have to predict vectors? This is what confuses me.

Question 3: If we are indeed masking the zeros, then why do we need to tell the sequence where it stops and starts? I don't mind appending a "sentence start" and a "sentence end" number to each sequence. I just don't understand _why_ we should do this if we are indeed masking?

For reference, this is my current model:

max_features = (len(chars)) #this number (always an integer) is either a cluster number or a word id number, both of them have a maximum of ~400

max_features = (len(chars)) #this number (always an integer) is either a cluster number or a word id number, both of them have a maximum of ~400

model = Sequential()
model.add(Embedding(input_dim=max_features, output_dim=512, input_length=maxlen, mask_zero=True))
model.add(LSTM(hidden_variables)) 
model.add(Dropout(dropout))
model.add(Dense(hidden_variables))
model.add(Activation('relu'))
model.add(RepeatVector(MAX_LEN))
for z in range(0,number_of_decoding_layers):
    model.add(LSTM(hidden_variables, return_sequences=True))
    model.add(Dropout(dropout))
model.add(TimeDistributedDense(max_features, activation="softmax"))
model.compile(loss='categorical_crossentropy', optimizer='adam')

NickShahML on 24 Oct 2015

@LeavesBreathe I've actually found smaller vectors to work better (64 and 128 - note the binary sizes, I think this helps performance by assisting theano with copying chunks of data to and from GPU), likely as that's fewer parameters to learn, and too many can make learning hard for a neural network. Cross validation would help you determine a good size, try varying in magnitude (32,64,128,256) rather than more linear scales, as the relationship between these items and performance tend to be more of an exponential than a linear nature.

Reversing the inputs is something that may help. However, be careful which end you zero pad if you do this Another common strategy is to mirror the inputs, so you duplicate the input in reverse too, allowing the network to get the best of both worlds. There's a name for that approach, but it escapes me right now. Again be wary of the zero padding, you'd want it on the outside of the mirrored input, not the middle, I believe, otherwise the RNN hidden state may get reset once it hits the zeros, depending on what it learns to do in this situation. That may not happen, you'd have to see.

Qu 1. - I haven't played with this, but AFAIK if you pad your 0 pad y's you should be good. You seem to be writing a lot of custom code. keras has libraries for zero padding, as well as determining id's for words, so i'd rely on those rather than hand rolling it all as they've likely been well tested by either unit tests or the community, and are known to work with the package.

Qu 2. The number and format of the Y's for a number of these networks is I think one of the biggest source of confusion when doing deep learning with one of these libraries, although I think that's more to do with the complexity of the problem, keras makes it about as easy at it could be. The output is 3d - (number of samples, size of each vector, length of the sequence). This may seem confusing, but think about it like this. Using the time distributed dense above, we are making a prediction for each time step. For each time step you are predicting a word, so that's a one-hot vector, of length == | Vocab | But we have 'max sequence length' number of time steps. So the dimensionality has to the the number of rows, the number of words labels to choose from at each time step, and finally the number of time steps (max sequence length). Anything less would not be possible as for each row you are predicting a word for each time step (which is the length of the output sentence in this case). HTH. Put another way, say you have a vocab of 10k words and the max sentence length is 100. For each row, you need to make 100 predictions (some zero if less than max sequence length), for each prediction you need it to choose from 10k words. So for each row you'd have a matrix (2D) of size 10k X 100, so it's 3D.

Qu 3. - You don't absolutely need the start and stop symbols. However, when you are doing sequential prediction, each prediction for each timestep is condtioned on the previous inputs and output. However, when you are at the start of the sequence, you have no previous inputs to condition on. The distribution of the start symbols is normally not random however, certain words or labels tend to be more likely at the start of a sentence, such as the word 'The' or 'However', and subsequently so do their labels, such as POS tags if that's what you are trying to predict. It's unlikely a sentence will start with a noun, such as 'antelope'. The same is also true for the last word in a sentence or label in a sequence. Adding the artificial start and stop symbols allows the system to learn 'rules' (patterns may be a better term) that govern how the data is distributed at the start and end of sequences. This is late for me, so I hope I am making sense. The other reason is with this sort of sequential prediction, as you have a fixed sized sentence with zero-padding, you need to help the model know when to start outputting zeros. Without outputting the stop character I was seeing it have problems learning this, it would just keep repeating the same word sequence, or always output zeros. LSTM's and variants are stateful, and so this sort of signal can help them learn to switch states.

Looking at your model structure, I'd recommend keeping it simple and having one LSTM in the decoding layer. Although i've heard of people having success with stacking RNN's, I haven't personally found that to do better, it usually under or overfits for me when I try that. Technically also you don;t need the Activation layer as you can specify the activation function in the dense layer, but whatever is easier to read and understand.

I tend to find better performance from the simpler GRU and JZS1 RNN's than LSTM's. These models are a little simpler than the LSTM.

simonhughes22 on 25 Oct 2015

👍1

First of all, a huge thanks for all the details. This response had alot of useful insights.

I've actually found smaller vectors to work better (64 and 128 - note the binary sizes, I think this helps performance by assisting theano with copying chunks of data to and from GPU)

That's a useful tidbit.

Reversing the inputs is something that may help. However, be careful which end you zero pad if you do this

Which side would you normally pad the zeros? From my experience, I also pad the zeros on the _right_ side (as does keras's pad sequence function). When you reverse the input, are you suggesting you pad on the left side? What would be the advantage of that if the network expects to always recieve input in reverse order?

Let me clarify that when I say reverse the input, you reverse _all_ of the inputs you give it. You never give the network the input in original order. Therefore, the network always expects to recieve in the input in reverse order. I kinda think of it as you reading a book in reverse, and you learning to read in reverse. You're always fed books that have words in reverse order.

Interesting that there's a mirroring strategy, but I feel that it would confuse the network. I'll go digging for some papers on it and understand it better.

Regarding Qu1:

I'm actually not writing any custom code for Keras. I do my own significant pre-processing to cluster the words and assign each word a unique integer id within the cluster. This allows me to get a 80k vocab down to ~400 different integers where it takes two integers to represent each word (a cluster id, and a word id). I chain the cluster id and word id together. So to represent 15 words, it takes a string of 30 numbers.

But I do heed your suggestion! Keras code is usually much more durable than custom code. I go with Keras code when possible.

Regarding Qu2:

This was my biggest question, so I appreciate all of your details.

for each prediction you need it to choose from 10k words. So for each row you'd have a matrix (2D) of size 10k X 100, so it's 3D.

Good to know it has to be 3d, in retrospect, I feel a little foolish for asking that. So to clarify, lets suppose you have a vocab of 10k words. Are you saying that you must one hot this 10k vocab? That seems super inefficient to me.

I guess what I'm proposing instead is that instead of one hotting, you give an integer, and you then you apply some sort of embedding (much like you did in your embedding layer for x_train). I'm cool with doing one hot, but it just seems inefficient computationally and ram-wise.

The output is 3d - (number of samples, size of each vector, length of the sequence).

Not to be a detail jerk, but from Keras's docs, I've seen the usual _order_ to be:

(x: number of samples, y: number of timesteps, z: vector size of softmax output)

So in our case:

(x: number of samples, y: max number of words per sentence, z: one-hot of the 10k words)

Regarding Qu 3:

Adding the artificial start and stop symbols allows the system to learn 'rules' (patterns may be a better term) that govern how the data is distributed at the start and end of sequences.

This makes alot more sense. It gives the network a clear understanding of when there is a start and a stop. I will definitely be adding these in.

This is late for me, so I hope I am making sense.

You're making alot more sense than textbooks.

Although i've heard of people having success with stacking RNN's, I haven't personally found that to do better, it usually under or overfits for me when I try that.

I'll definitely start with just one rnn for my decoding layer. One thing I've learned from my experience is that stacking rnn's does perform significantly better, but _only_ when you have big data. Many of my experiments have shown that to me. Many Google papers tend to use these big nets as well.

I plan on training on at least 30 million samples (sentences). If I did it on 3 million or less, I could see a single layer performing better. If you're doing better with just one lstm, that's a red flag to me that you don't have enough data IMHO.

I do have one more question from what you've written:

Qu 4: Why are you vectorizing words for your y_train if you're one hotting them in the end?

I apologize if this is an obvious answer, but as I understand it, you're one-hotting each of your words. If you have a 10k vocab, why don't you assign an integer per word, and then one hot each of the words respectively? How are you incorporating these 32 or 64 length word vectors?

I'm sorry if I really missed the boat on this one. I use word vectors to cluster words into groups as I mentioned above. But in the end, I assign each word an integer (word id). From that integer, I can then one hot. Wouldn't you all be doing the same? I'm talking strictly about the y_train data.

Apologies for the essay response, but this is such a good conversation. If you're interesting in skype chatting anytime, my skype name is the same as my username here. Hopefully I can help you at least a little bit for the help you have given me.

NickShahML on 25 Oct 2015

I think you'd want it left padded. Which is what I thought keras does, but I don't have time to check. The important point is to pad from the same side regardless of whether the input is reversed or not. I'd make sure it can work well without reversing before trying to reverse, with these tools you want to do the simplest thing possible to get it to start learning before you start getting creative. That way you have a baseline you can compare against to ensure you are improving things iteratively.

The reason I think you want to left pad is how this works. You have an encoding layer and a decoding layer. The encoding layer processing the input left to right to produce a vector representation of the entire sentence. That vector is then replicated (just due to how keras is built) so that that input is replicated for every predicted output label. Then the decoder (you have an encoded input sequence as a vector) runs an RNN to produce output, taking this repeated encoding and it's internal state (which has a feedback loop) and produces output. So if your input is right padded, the right most tokens are zeros, which normally causes the network to reset it's state, so your encoding is not great. I could be wrong on this, but that is my understanding. It's been a few months so I could be missing something. Let me know if keras right pads as I am not overriding this.

Why the clustering? Use the pre-trained word2vec embeddings as input rather than clusters. That way you can have a 300 element encoding without the need for a one hot representation. That's how most people handle this problem these days, they use a language model to learn word embeddings. You don't need to even do that yourself unless you have very domain specific data, you can use the ones google pre-trained on a massive corpus. That actually worked better for me than learning my own embeddings, although the optimal strategy is a combination of pre-trained and tuning using supervised embeddings.

Having it predict an output embedding is better than the one hot strategy. I didn't mention that as it further complicates matters. But having a dense layer in the output is actually the equivalent to doing that. The word2vec embeddings and similar are taken from a neural network like model that is trained on a one-hot encoding, the embeddings are the weights from each of the inputs to the hidden layer, as I understand it. There are many variants on this, but that's the basic idea. You can emulate that using a dense vector before you one hot the output. That is inefficient but you'll have a hard time doing something more efficient such as a hierarchical softmax (see word2vec) in keras without a lot of custom code.

Regarding the ordering- I think keras' changed somewhat from when I last used this (before you specified named dimensions). Make sure it matches whatever the docs say and you'll be good.

That's cool about stacking. I've gotten really good performance from small data on this, which is very unusual. The more model params, the more powerful the model and so by adding more layers you are giving more degrees of freedom. If you have big data then you can take adv of a much larger model. I'd start small and simple though, get reasonable performance and then starting making it more complex. Unfortunately I'd don't have the option of more data (which almost always helps), labelling my dataset is a very expensive operation and depends on a team of psychologists.

I think i discussed the last question above. To be honest, for performance i'd have it predict the pre-trainined word2vec vectors as outputs. Again, in my domain, smallish data, my vocab size is actually small enough that doing a one-hot is not an issue for me. The best thing from the literature is probably a hiearchical softmax, but you'd have to hand roll that.

simonhughes22 on 25 Oct 2015

Gensim's word2vec runs on theano now I think (at least i've been getting theano errors from it so I am asssuming so).You may be able to take their hiearchical softmax and plugin it into keras. If you get that working, please submit a pull request and give back to the community.

simonhughes22 on 25 Oct 2015

I'd make sure it can work well without reversing before trying to reverse, with these tools you want to do the simplest thing possible to get it to start learning before you start getting creative.

Words of wisdom. I'll do regular input first.

Let me know if keras right pads as I am not overriding this.

Keras does right pad. I can understand as its been a while for you. I've ran it several times, and it always pads on the right. To clarify, it places the zeros on the right side. Again, appreciate the detailed explanation.

Why the clustering? Use the pre-trained word2vec embeddings as input rather than clusters.

In the end, I think your right. The reason why I did the clustering is that my words are very specific to my domain. Pretrained vectors aren't very good at high biology and astronomy terms (my interest).

By clustering, I essentially train the net to know that certain terms are related. The word id within each cluster is ordered by word frequency. So the most frequent words in that specific cluster are used more often. Thus, if the net has to guess, it will guess a word id of 0, or 1, and it will choose a word that is more used.

When strictly talking about x_train input, I do not do any one hotting. I simply submit each word with its cluster id first, followed by its word id. Thus my 2d x input look like this:

[34, 2, 45, 0, 23, 1, 34, 4, 45, 3]

These ten numbers represents 5 words. In this way, the maximum integer used is ~400 (400 clusters with 400 words per cluster at most.) Therefore, when I do a softmax, I only have to do it over 400 options. This was the whole motivation behind clustering words. This 2d x input is fed into the embed layer.

The meaning of the 2nd and 3rd dimensions only matter if you are using certain loss functions (like categorical cross-entropy) that do a soft-max like operation over all classes,

Yes, I plan on using categorical cross-entropy which is why I asked. Good to know!

Unfortunately I'd don't have the option of more data (which almost always helps), labelling my dataset is a very expensive operation and depends on a team of psychologists.

Ahh I see. To add to the stacking idea, have you tried using a series of dense layers afterwards? They sometimes improve performance without the need for more data.

You've probably already considered this, but have you tried looking into unsupervised learning so that you can acquire far larger data sets? No idea what you're doing, but from my experience: always go with unsupervised if you can get at least 100x more data.

Having it predict an output embedding is better than the one hot strategy. I didn't mention that as it further complicates matters. But having a dense layer in the output is actually the equivalent to doing that.

That's what I figured. The line you said about the dense layer is really good to know.

That is inefficient but you'll have a hard time doing something more efficient such as a hierarchical softmax (see word2vec) in keras without a lot of custom code.

I'll stick the word clustering strategy right now so that the softmax is down to 400 choices. I read some threads on hierarchical softmax in keras, but it seems a bit painful to implement.

If you get that working, please submit a pull request and give back to the community.

I definitely want to be as helpful as I can to you guys. It will take me at least a week or two to fully implement these ideas. If I find anything interesting, I'll report my findings on this thread so that hopefully, you guys can benefit from it. I'll also be experimenting with using different learning rates and adding dense layers as well at the end of the decoder. If I do any keras modifying, I'll be sure to submit a pull request!

NickShahML on 25 Oct 2015

Instead of the clustering i'd try trainining word2vec or GLOVE (from Stanford) on your biology astronomy dataset, if it's large enough. It doesn't work well on small data, on my small PhD dataset it didn't perform well. But for work I have a much larger dataset and the vectors learned there were very good, and that's very domain specific, I wasn't able to use the pre-trainined ones either for that. Then you can use those vectors, if they're any good, as your target outputs. You can ask word2vec for the top 10 matching terms, running that for some important keywords in your domain will tell you if it's worked well. Make sure you train more than one iteration though, that's the default and that sucked for me, but at 10-20 iterations, I was getting good results.

I corrected by comment about the meaning of the 2D and 3D dimensions, it does matter in this case, I believe, as we're using an RNN to generate the output sequentially so the dimension that refers to the time dimension is important. Apologies.

In terms of unsupervised learning, yeah i think it's very useful. Using the pre-trained google word2vec vectors has helped somewhat and that is unsupervised, at least in the manner i'm using it (it wasn't trained for the purposes I am using it for). It's supervised in the sense of it's training a language model, but if you are using for some other task, then i'd argue in that context, it's not. That's the easiest thing I can do right now in this regard, and the top 10 similar terms when I query that model are very good even for my specific domain (science essays). You can think of it as a form of soft clustering, where each element in the 300 dimension vector is a cluster or topic in your domain, and the value is a score of how much that word represents that topic. It's also important in word2vec to extract common phrases and treat those as words, i've found that helps tremendously. I've used a variant of association rules (applying the downward closure principle) to build a common phrase extractor to detect commonly occurring phrases in my domain.

simonhughes22 on 25 Oct 2015

Instead of the clustering i'd try trainining word2vec or GLOVE (from Stanford) on your biology astronomy dataset, if it's large enough.

Yes, I completely agree with you that the word vectors do indeed beat the clustering. I think the main reason that I didn't do the word vectors intiallly is that I simply don't know how to incorporate them into the input for my x_train

At this point, I just feel so bad because you have already helped me a ton, and I haven't given much to you. So if you're busy, no need to reply to what I'm saying below:

I guess what confused me about each word is that there are 300 numbers (for a 300 length vector), so how do you input 300 numbers on a 2d input for your x_train? I assumed you were using an embedding layer. But even if you do use an embedding layer, the vectors are not integers. Keras requests integers to be used in the embedding layer: http://keras.io/layers/embeddings/#embedding.

The clustering I was doing was nice because it gave me two integers per word, so I could embed them really easily (and understand what I'm doing! But the clustering strategy feels so amateur.)

Don't get me wrong, I know how to use word2vec and glove to generate word vectors for each word. I just didn't know how to format the inputs into the x_train or for that matter, the y_train either. Let me try to do some reading to figure it out.

But for work I have a much larger dataset and the vectors learned there were very good,

Yes, I have experienced the same when I use word2vec on my training data. When I ask for the top 10 matching terms, they are incredibly close. For example:

["neutron_star", "pulsar", "neutrons", "high_density", "high_rotation", "aftermath"]

Make sure you train more than one iteration though, that's the default and that sucked for me, but at 10-20 iterations, I was getting good results.

You are referring to the Word2vec training correct? Not the actual seq to seq model? If so, I didn't even think of this and I _should_ do this regardless!

You can think of it as a form of soft clustering, where each element in the 300 dimension vector is a cluster or topic in your domain, and the value is a score of how much that word represents that topic.

Yes, this is what lead me to the idea of doing a k-means cluster in the first place. But like I said, inputting direct word vectors into the model does beat the clustering idea.

It's also important in word2vec to extract common phrases and treat those as words, i've found that helps tremendously.

Yes, I do bi-gram, tri-gram, and quad-gram, and stop there. I figure penta-gram is just overkill, but heck I might try it sometime. But I have had major success with using the phrase extractor as well. Always good to hear that someone else is doing the same thing as you.

As a side note, something that helped cut down on extraneous words was pyenchant https://pythonhosted.org/pyenchant/api/enchant.html. Basic idea is to make your input text a list of words, and fix spelling errors (or recorrect words that shouldn't belong). First tokenize all the words of your input into a list using nltk. I don't know if this will be of any use to you, but if can help you, I'll feel better!

Here's my code. I apologize as its really messy right now:

words_not_vectorized = set()
all_words_untouched = set(tokenized_words_untouched)

print 'applying frequency distribution to original text'
word_freq = nltk.FreqDist(tokenized_words_untouched) 

for eachword7 in all_words_untouched:
    if word_freq[eachword7] < 2: #we choose 2 because a word is rarely mispelled incorrectly twice in the same way
        words_not_vectorized.add(eachword7)

words_that_are_common = all_words_untouched - words_not_vectorized

print 'creating personal spelling dictionary'
with open ('listofspelledwords.txt','w+') as listofspelledwords:
    for eachword12 in words_that_are_common: #add to dictionary for spelling corrector
        listofspelledwords.write(eachword12+'\n')

del words_that_are_common


d = enchant.DictWithPWL("en_US","listofspelledwords.txt")
spelled_tokenized_words_untouched =[]

number_of_corrected_spelling_errors = 0 
start_time = time.time()
print 'correcting spelling errors -- this will take a while'
for eachword13 in tokenized_words_untouched:
    if d.check(eachword13): #word spelled correctly
        spelled_tokenized_words_untouched.append(eachword13)
    else:  #word not spelled correctly
        try:
            spelled_tokenized_words_untouched.append(d.suggest(eachword13)[0])
            # print 'changed '+eachword13+' to '+(d.suggest(eachword13)[0])
            number_of_corrected_spelling_errors = number_of_corrected_spelling_errors +1
        except IndexError:
            spelled_tokenized_words_untouched.append(eachword13)
print 'the time for spell checking is below: '
print("--- %s seconds ---" % (time.time() - start_time))
print 'number of corrected words from spell check is: '+str(number_of_corrected_spelling_errors)
del number_of_corrected_spelling_errors

with the words that are left over, I find their respective hypernyms and replace them with the hypernyms. Note that I'm doing all of this _before_ I apply word2vec. This helps tremendously on word2vec results because it increases the frequency of the words:

'''-------------------------------HYPERNYM CLASSIFICATION SCHEME-------------------------------------'''




total_words = set(spelled_tokenized_words_untouched)

number_of_unclassified_words = 0
number_of_hypernyms_found = 0

# my_regex = r"\b" + re.escape(eachword4) + r"\b"
# text = re.sub(my_regex, hypernym_of_word, text)

hypernyms_replaced_one_text =spelled_tokenized_words_untouched

start_time = time.time()
print 'you are now finding and replacing uncommon words with hypernyms'
for eachword4 in words_not_vectorized:
    use_hypernym = 0
    use_synonym = 0
    try: 
        synonym_set_of_word = (Word(eachword4)).synsets[0]
        hypernym_set_of_word = synonym_set_of_word.hypernyms()[0]
        hypernym_of_word = hypernym_set_of_word.name().partition('.')[0]
        for n, eachword10 in enumerate(spelled_tokenized_words_untouched):
            if eachword10 == eachword4:
                hypernyms_replaced_one_text[n] = hypernym_of_word
        number_of_hypernyms_found = number_of_hypernyms_found + 1
    except IndexError:
        number_of_unclassified_words = number_of_unclassified_words+1
print 'you completed the hypernym process in time below'
print("--- %s seconds ---" % (time.time() - start_time))

print 'below is the number of words originally not vectorized:'
print len(words_not_vectorized)
print 'below is the number of words with no hypernyms found'
print number_of_unclassified_words
print 'Below is the number of different words within original text'
print len(total_words)
print 'total number words in the original text below'
print len(spelled_tokenized_words_untouched)
del spelled_tokenized_words_untouched

print 'total number words in the hypernymed text below'
print len(hypernyms_replaced_one_text)
# print 'total number of words in hypernym filtered list'
# print len(replaced_uncommon_with_hypernyms_one_list)
print 'total number of hypernyms found and replaced:'
print number_of_hypernyms_found
print 'the number of different words BEFORE ANY PREPROCESSING WAS DONE including punctuation is '+str(len(all_words_untouched))

NickShahML on 25 Oct 2015

I'll keep this short. To use pre-trained word embeddings, just lop off the embedding layer. The input to the LSTM is 3D if I recall correctly. Different example (and not seq to seq) but here's a conv net that is using the pre-trained embeddings that I have used:

print('Build model...')
# input: 2D tensor of integer indices of characters (eg. 1-57).
# input tensor has shape (samples, maxlen)
nb_feature_maps = 32
n_ngram = 5 # 5 is good (0.7338 on Causer) - 64 sized embedding, 32 feature maps, relu in conv layer
embedding_size = emb_shape[0]

model = Sequential()
model.add(Convolution2D(nb_feature_maps, 1, n_ngram, embedding_size))
model.add(Activation("relu"))
model.add(MaxPooling2D(poolsize=(maxlen - n_ngram + 1, 1)))
model.add(Flatten())
model.add(Dense(nb_feature_maps, 1))
model.add(Activation("sigmoid"))

# NOTE: add in repeat layer and decoder here

You can adapt that to be the decoder part, or you can keep the RNN structure too, as I mentioned remove the embedding layer and load in your own embeddings in place of it.

Another option is doing convolutions over embeddings (allowing bi-gram, tri-gram, etc to be extracted as convolutions).

The iterations comments was for Word2Vec, correct.

simonhughes22 on 25 Oct 2015

Awesome thanks again @simonhughes22 , I really appreciate your help, and I'll definitely will do some digging around and experimentation! Removing the embedding layer was the part I was missing. It will take me about 3 weeks to really test the ideas you suggested. If I find something interesting or helpful, I"ll post it back here. Thanks again man!

NickShahML on 25 Oct 2015

One other thing, you could train a character rnn. That way you have only a small number of inputs and outputs. Like this: https://github.com/fchollet/keras/blob/master/examples/lstm_text_generation.py. Google character RNN for more papers. LeCun's team did some work on this with conv nets if I recall, and Karparthy (I think) with RNN's.

simonhughes22 on 25 Oct 2015

Thanks for the pointer, but that's actually where I started! I started predicting chars after inputting 30 chars. I didn't try seq to seq on it. But I do appreciate the suggestion! I think i'm gonna stick with words and try to make as much progress as I can. Currently building the seq to seq model. Hopefully training will start tomorrow.

Just as a side note, the predicting chars works well when you throw about 300mb to 3gb of text at it, and ramp your lstm layers to 8 or 10. If you give it 100 chars, and ask it to predict the next one, it will predict phrases of sentences pretty well. It takes forever to train though, so I suggest changing the learning rate on adam to 0.02 and decreasing it when the loss goes crazy.

NickShahML on 26 Oct 2015

Just as an update for anyone reading this thread -- I retested the padding_sequences, and it pads on the _left_ side. So I was completely wrong, and Simon's memory is better than mine!

NickShahML on 27 Oct 2015

:). @LeavesBreathe I only remember that because it's important in how the LSTM RNN works. As you run the input left to right, it outputs a vector based on that whole sequence, but that vector is a reflection of its internal state. However, that's more sensitive to the more recent values (which is why you want to output sequences a lot of the times as that's not). Once it hits the zeros it learns to reset it's state, so if the left padding were at the end, it wouldn't work very well as those are the last inputs processed, wiping its state. Which is why I said that's very important to consider especially when reversing the inputs. If you want to reverse the inputs, do so, but ensure that the padding is still on the left hand side. It 'may' not learn to reset it's state, but once it hits a zero it just needs to learn to predict more of them, so it doesn't need to remember anything else about the rest of the sentence at that point, so it will in practice use that to trigger the forget gate. Hope that makes sense

simonhughes22 on 27 Oct 2015

If you want to reverse the inputs, do so, but ensure that the padding is still on the left hand side.

Thank you, this makes much more sense. I didn't realize it reads them left to right, and that if the zeros were on the right side, it would wipe the state. Makes complete sense.

Sometime, I want to buy you a latte (grande)

NickShahML on 27 Oct 2015

@LeavesBreathe i realize you could be anywhere in the world so this is a long shot. But if you happen to be anywhere near the chicago area, or san jose (I visit there a few times a year), i'd take you up on that.

simonhughes22 on 27 Oct 2015

hahaha I'm actually in Cincinnati (looks like 4.5hr drive for me), so this could be a possibility =p If you want to add me on skype, my username is "leavesbreathe"

NickShahML on 27 Oct 2015

Hi guys,

I went through the whole conversation but I still have a simple question about the Embedding Layer.

Assume I have a vocabulary of size nvocab, and all my sequences have length nseq. Assume I have only one sample for simplicity. I want to find an embedding for each word with nembed dimensions.

1) Should the input to the embedding layer,xtrain, be a sequence of integers or one-hot encodings? Currently, my xtrain has a length of nseq, with each element an integer that can take a value up to nvocab.

2) If xtrain is a list of integers (rather than one-hot encodings), what does the weight matrix in the embedding layer look like? Let us look at one word in the sequence. What I understand so far is that the embedding layer internally converts this word (a single integer) to a one hot encoded vector of size nvocab (let's call this vector Vonehot). Then, the network learns an embedding weight matrix, Wemb, of size [ nembed x nvocab], and computes an embedding vector Vembed = Wemb x Vonehot. Vembed is a vector of length nembed. Is that true?

Thanks a lot.

gkeskin07 on 27 Oct 2015

@kg07
1). If you use an embedding layer, feed it a list of integers. Without that you'd need a one hot encoding
2). This is correct. An embedding layer just take a linear NNet layer, mapping a one hot encoding to a hidden layer. The 'embedding' is simply the weights associated with each input node. As there is a separate input node for every word, you get a separate embedding for each word. This is represented by a matrix of size (nembed * nvocab), although the 2 dimensions may be switched depending on implementation details. At least that's my understanding. Of course the literature never explains it that simply, which is a shame, but that's what I've inferred after much digging. I'd love for someone to correct me if that is wrong

simonhughes22 on 27 Oct 2015

👍1

@kg07

Give the embedding layer your integers -- the embedding layer _only_ accepts integers -- and your input should be 2d. (nb_samples, sequence)

sorry just saw simon's comments -- nvm mine

NickShahML on 27 Oct 2015

Thanks a lot guys!

gkeskin07 on 27 Oct 2015

Note that often they use a dictionary for performance purposes when implementing those embedding layers. But that is equivalent to what I described AFAIK. For any pedantic types :)

simonhughes22 on 27 Oct 2015

Hey @simonhughes22 , thanks for the pointers again. I've made some good headway so far.

I wanted to revisit one issue we discussed earlier: What is the best way to format the Y_train so that we can predict words? Which of these ideas do you think is the best?

Idea 1 -- One Hot all Words

I've read in several places that doing a softmax over 2k terms is just a very bad idea. You face the curse of dimensionality, meaning it gets exponentially harder to predict words. So if you have a vocab of 100k words you would have to do a 100k softmax. This seems like the option of last resort.

I've read in some papers doing 100k softmaxes but only with 8 Titans or so. Here's an example: http://papers.nips.cc/paper/5346-sequence-to-sequence-learning-with-neural-networks.pdf

Idea 2 -- Hierarchial Softmax

That is inefficient but you'll have a hard time doing something more efficient such as a hierarchical softmax (see word2vec) in keras without a lot of custom code.

I think this is the best idea, but the hardest to implement because of the heavy writing. Some work has been done here: https://github.com/fchollet/keras/issues/438.

I feel that I'm not experienced enough to fully pull this off yet. But maybe in the future I will attempt to do this and submit a PR.

Idea 3 -- Use Regression to find the closest word in Word2Vec

Another idea is to abandon the categorical softmax all together, and just simply predict vectors. Obviously the neural net is not going to predict the exact vectors. So you take the vectors it does produce for each word, and find the closest word that matches those vectors. I don't know if you can do this in word2vec, but I imagine you could. So for each word, you would predict lets say 32 vectors that describes the word.

I think this is complicated to implement. For each sequence of lets say 20 words, you're asking the neural net to produce 32 x 20 = 640 numbers. This seems like a nightmare to me. I guess you would use a linear/tanh activation, mse objective, and RMSprop optimizer?

Idea 4 -- Use the Clustering Idea Discussed Earlier

Not to bring back bad ideas, but I do think the clustering discussed would work well here. The reason being that you do a softmax, but it is only over ~400 terms for 80k words. I've ran this to predict individual words (not sequences of words), and it always gets the cluster id and word id right after epoch 1.

Advantage: You only have to softmax over 400 terms. You get an added bonus that the word id will be near 0 or 1 (since all of the words in a clustered are ordered by frequency)

Disadvantage: You have to predict two individual integers per word. You also have to one hot, but its only over 400 numbers.

I don't mean to bother too much, but I just wanted to hear your thoughts on these options. It will take me at least a few days to implement/test each idea, so I would rather start with the best one and see what happens. Thanks!

NickShahML on 29 Oct 2015

Howdy,

I've been doing something like your idea 3 using pre-trained vectors as both inputs and outputs:

word_vector_size = 300 # this is the dimensionality of the word vectors I already have
dense_size = 512
model = Sequential()
M = Masking(mask_value=0)
M._input_shape = (1,max_len,d)
model.add(M)
model.add(GRU(word_vector_size, return_sequences=False))
model.add(Dropout(0.5))
model.add(Dense(d, activation="linear")) 
#optimizer = rmsprop(lr=0.001, clipnorm=10) # another option
optimizer = adam(lr=0.001,clipnorm=10) # works for me
model.compile(optimizer=optimizer, loss='mse')

(Sorry for the ugly Masking hack - there is currently some bug such that just doing model.add(Masking) doesn't work without an embedding layer at the moment.)

The inputs have shape: (n_samples, max_len, 300) and the outputs are (n_samples, 300). The vectors are dependency-based pre-trained word vectors from here: https://levyomer.wordpress.com/2014/04/25/dependency-based-word-embeddings/

I don't have concrete results yet, but it does learn, and is a much smaller output space than the one-hot idea (your idea 1). Before this I tried that with 10k word classes and it was also learning, but VERY slowly (as I only have a laptop GPU - GTX 870m); I found you need to use really heavy gradient clipping with rmsprop (clipnorm=0.1), or no learning would take place. Also, the output space was much larger -> 10k by the number in the mini-batch, whereas with regression your output space is only 300 by the number in the mini-batch.

My goals are maybe different from yours - I want smart encodings of sentences such that nearest neighbors give sensible results, and better than TFIDF BoW nearest neighbors. It seems like something in this general RNN direction should work, but I probably need a bigger machine to do the training =)

sergeyf on 29 Oct 2015

@sergeyf Thanks for the tips!

(Sorry for the ugly Masking hack - there is currently some bug such that just doing model.add(Masking) doesn't work without an embedding layer at the moment.)

Thanks for mask hack -- I was trying to figure this out this morning!

(as I only have a laptop GPU - GTX 870m)

I gotta tell you man, getting a maxwell modern gpu is so worth it. I can't even imagine trying to do this on a laptop. You can try to get decent GPU's on ebay. Regularly see 980 TI's going for $600 and Titan X's going for $850. Just make sure they weren't used for bitcoin mining. This post helped me alot: http://timdettmers.com/2014/08/14/which-gpu-for-deep-learning/

Also, the output space was much larger -> 10k by the number in the mini-batch, whereas with regression your output space is only 300 by the number in the mini-batch.

Right. This is the whole idea with regression. You only have a 300 numbers per word (or however many you choose to use with word2vec.).

Currently my goal is to take a sentence, and predict the next sentence that makes somewhat logical sense. I think this is similar to what you're doing? It is similar to translation but the "translation" is the next sentence that should come.

Also, how are you converting your 300 number outputs to a word (when you do model.predict)? Is there a function in word2vec that does that (I did a little searching and couldn't find one.)

NickShahML on 29 Oct 2015

Happy to help!

I am not converting the 300 number output to a word. I just leave it as is. Once the training is done, I feed entire sentences into the network, but take the representation that comes out of the RNN (before Dense) and use that as my representation of the sentence. Then I do NN in the sentence space. The idea being that I just fed a sequence of words into an RNN, so its final representation should be a sentence. Does that make sense? This is why I was pointing out that we may have different goals.

sergeyf on 29 Oct 2015

@sergeyf apologies for my misunderstanding. I understand what you're getting at. I guess I'm still kinda stuck though as which of the four ideas I mentioned above would be best for next-sentence prediction. =/

I definitely think what you're doing is smart, and could potentially work really well!

NickShahML on 29 Oct 2015

No worries!

I am not sure why you wouldn't just predict the next sentence as represented by word2vec vectors?

So input is "Horses run." as represented by [x_horses, x_run] and output is "They run quickly." as represented by [x_they, x_run, x_quickly]. Why ever convert things into categories instead of just leaving them always as vector embeddings?

sergeyf on 29 Oct 2015

@LeavesBreathe instead of predicting one hot vectors replace those one hots with the pre-trained vectors, word2vec or those dependency embeddings. Either way you output will be a matrix, you just drop one of the 2 dimensions from the size of your vocabulary to the size of the embedding. If that makes sense

simonhughes22 on 29 Oct 2015

@sergeyf @simonhughes22 Alright, I just feel stupid. Both of you are telling me the answer, and I can't understand it.

I completely understand x_train. The 3d matrix of (nb_samples, timesteps, word_vectors). And I can do the same with y_train as well. But when I do model.predict, won't the model have to predict word vectors for each word?

Suppose you use 32 scalars per word (when you set word2vec size = 32). This means that for each word in the sentence, you must predict 32 scalars per word correct? Not only that, you can't use a softmax. And what are the odds that the network is going predict the exact 32 scalars that correspond to a word? This is why I'm so lost.

NickShahML on 29 Oct 2015

IIRC you stack the one hot or the embeddings vertically, and then you have one column per output step up to max sequences. I may have the rows and cols reversed, but that's the idea

simonhughes22 on 29 Oct 2015

make sure, if predicting embedding, you use RMSE error. I don't think any of the other error metrics are correct for that. At prediction time (once trained) do a cosine sim search on most similar vectors

simonhughes22 on 29 Oct 2015

At prediction time (once trained) do a cosine sim search on most similar vectors

Thank you! This is what I thought you had to do. So the final network should look like this correct?

model = Sequential()
M = Masking(mask_value=0)
M._input_shape = (1,maxlen,word2vec_dimension)
model.add(M)
model.add(JZS1(hidden_variables, input_shape=(maxlen, word2vec_dimension), return_sequences = False)) #note for the input shape that you did not put the number of samples 
model.add(Dropout(dropout))
model.add(Dense(hidden_variables)) #consider adding another dense here
model.add(Activation('relu'))
model.add(RepeatVector(maxlen))
for z in range(0,number_of_decoding_layers):
    model.add(JZS1(hidden_variables, return_sequences=True))
    model.add(Dropout(dropout))
model.add(TimeDistributedDense(max_features, activation="linear")) #consider adding another timedistributedense here
model.compile(loss='mean_squared_error', optimizer='rmsprop')

Particularly:
Loss = mse (or are you saying I should do rmse here instead?)
Optimizer = rmsprop or adam
Activation = Linear (or is there a better option?)

NickShahML on 29 Oct 2015

I think so

simonhughes22 on 29 Oct 2015

I've also had it work with binary cross-entropy, as that doesn't technically do a soft max, but it's not really meant to be used like that. fchollet recommended I do RMSE. But you could try that if desired. Outputs need to be in the range 0-1 to use bce I think, so you word vectors would need to be in that range.

simonhughes22 on 29 Oct 2015

Does RMSE vs MSE make much of a difference? It would change the steepness
of curvature near the minima and/or saddle points but not sure which one is
preferred. Probably depends on the problem (as with everything)...

On Thu, Oct 29, 2015 at 3:06 PM, Simon Hughes [email protected]
wrote:

I've also had it work with binary cross-entropy, as that doesn't
technically do a soft max, but it's not really meant to be used like that.
fchollet recommended I do RMSE. But you could try that if desired. Outputs
need to be in the range 0-1 to use bce I think, so you word vectors would
need to be in that range.

—
Reply to this email directly or view it on GitHub
https://github.com/fchollet/keras/issues/395#issuecomment-152341583.

sergeyf on 29 Oct 2015

@sergeyf that's probably an empirical question. RMSE is a popular error metric as errors in the training data is assumed to be normally distributed according to the CLT, and so RMSE is the 'best' metric to minimize under those assumptions when you have real numbers not ordinals - at least in theory.

simonhughes22 on 30 Oct 2015

Outputs need to be in the range 0-1 to use bce I think, so you word vectors would need to be in that range.

I think Word2Vec does this, so I'll try that! Interesting that you can use bce -- would not have thought of that, but hopefully it will perform better than mse. If I end up trying RMSE, I'll submit a pull request for it.

NickShahML on 30 Oct 2015

There's a discussion about this happening on r/machinelearning: https://www.reddit.com/r/MachineLearning/comments/3qyn0m/sequence_to_sequence_mapping_via_lstm/

Their claim is that this doesn't work as well as classification!

sergeyf on 31 Oct 2015

@sergeyf this link from that discussion would back that up: https://github.com/yandex/faster-rnnlm. In summary, hierarchical softmax is used for speed, but you are sacrificing some accuracy for that efficiency gain. Doing a softmax over a sizable vocabulary is probably not feasible for a lot of real-world problems. NCE seems to be the way to go for these models, but I am unsure how you'd do that in a sequence learning model.

You 'could' try learning to predict a bag of words (BOW) representation instead of a sequence, i.e. a single vector, the same length as your vocabulary, with binary representations for each word. Then train a second model, a simple language model, to translate this into the most probable word sequence. But you've thrown away any word order in passing the predicted BOW between the two models, and so this probably wouldn't work as well, particularly as you can often re-order the words in a sentence and completely alter the meaning. But it very much depends on the problem you are solving. If it's not a linguistics problems but some other sequence, the ordering may be very easy to determine from a BOW type output.

simonhughes22 on 31 Oct 2015

@simonhughes22 thanks for that link - looks really interesting. My particular goal is to do NN queries for various sentences. The sentences tend to be on the shorter side, so it may indeed be not a big deal to lose the ordering. I've tried some models that seem to suggest that, but nothing conclusive yet.

sergeyf on 1 Nov 2015

@sergeyf @simonhughes22 , This reddit discussion is really good. I'll post my findings there that I've had so far.

Also, if you're interested, I submitted a PR for Root mean square error, so its on keras now if you want to use it.

Doing a softmax over a sizable vocabulary is probably not feasible for a lot of real-world problems.

I completely agree. Doing a softmax over 100k words seems like the wrong direction, even though that Google Seq to Seq paper did it (needed 8 titans though). This is why I suggested the clustering: so that you could use a softmax (and therefore, categorical_crossentropy) and not have to resort to mse or rmse. The reddit discussion seems to be criticizing mse hard.

In the meantime though, I've been directly inserting vectors and then using cos distance to predict the next sentence. Haven't had much luck though yet.

I'll comment back here when I have more info

NickShahML on 1 Nov 2015

@sergeyf this might be of use to you, although given how fast the field moves it's relatively old: http://www.utstat.toronto.edu/~rsalakhu/papers/topics.pdf - Hinton and Salakhutdinov paper

simonhughes22 on 1 Nov 2015

Thanks @simonhughes22 - I've seen a number of similar papers that all seem to use additional noise and stacked denoising autoencoders. It would be cool if Keras had an RBM implementation so I could try this out without massive hacking =)

I also found two non neural-network approaches.

One that makes use of pretrained word vectors and then does a vectorized word mover's distance between sentences - http://www.cs.cornell.edu/~kilian/papers/wmd_metric.pdf Python code for the first link is here: https://github.com/mkusner/wmd

And another that marginalizes out the noise that is added for the denoising autoencoder, yielding a closed-form solution: http://arxiv.org/pdf/1301.6770v1.pdf

sergeyf on 1 Nov 2015

It would be cool if Keras had an RBM implementation so I could try this out without massive hacking =)

I asked about this about a month ago and @EderSantana said it would be easy to implement, but I'm not sure if it would be worth it right now. RNN's might still be better than DBNs?

NickShahML on 1 Nov 2015

@sergeyf thanks, cool, i'll check it out

simonhughes22 on 1 Nov 2015

I'm not an expert, but seems like they are fundamentally different enough
to both be worthwhile?

On Sun, Nov 1, 2015 at 12:07 PM, LeavesBreathe [email protected]
wrote:

It would be cool if Keras had an RBM implementation so I could try this
out without massive hacking =)

I asked about this about a month ago and @EderSantana
https://github.com/EderSantana said it would be easy to implement, but
I'm not sure if it would be worth it right now. RNN's might still be better
than DBNs?

—
Reply to this email directly or view it on GitHub
https://github.com/fchollet/keras/issues/395#issuecomment-152859525.

sergeyf on 1 Nov 2015

RBM's are pretty simple to code up. Not all that different to an autoencoder though, which you could do in keras as is.

simonhughes22 on 1 Nov 2015

Hey Guys, just as an update. I used about 250mb of text to train. Basically, I'm getting results where it repeats the same word 4 or 5 times, then moves onto the next word. This was inputting words as 128 vectors, and then doing cos distance on output vectors.

This was with rmse, adam, and linear activation. Good news though is that it nails "sentence start" and "sentence end" every time. Found the best results with 2 LSTM encoders (hidden = 128), and 3 JZS1 decoder layers (hidden = 256). Also, I tried 2 time-dist-dense layers and it made it slightly better.

From the reddit discussion yesterday, I'm going to try inputs as vectors, and outputs as clusters + word id. It will probably take me a week to set this up properly and test. Will report results back here when I get them!

NickShahML on 2 Nov 2015

@LeavesBreathe you normally need to train it for a long time to get past that. I was never able to get great results, as it was more of an intellectual exercise, but it did seem to improve over time, and I didn't leave it running for a long period of time.

simonhughes22 on 2 Nov 2015

Good to know. I was training each model configuration for 50 epochs, adjusting the learning rate when loss was rising.

What I want to do is compare:

50 epochs of using cos distance with linear + rmse

to

50 epochs of using clustering with softmax + categorical_cross

And see which one performs better. Then do 200 epochs on the one that performs better =). With the cos distance, it takes me about 16 hrs for 50 epochs.

NickShahML on 2 Nov 2015

Crikey. Are you using the GPU?

simonhughes22 on 2 Nov 2015

yea 980 Ti...why you think that is too slow? Most of my time is wasted just loading matrices...i'm gonna upgrade from 16gb to 24 gb ram in a few days ....eventually though, I'm thinking of doing 64 gb of ram (but that's like in 4 or 5 months)

NickShahML on 2 Nov 2015

I'm mot a hardware guy, but that sounds fast. 16g of RAM is good for me. I am assuming you just have a lot of data. If you haven't already, check that Theano is utilizing the GPU, I had to jump through a few hoops to ensure that.

simonhughes22 on 2 Nov 2015

Wait i'm still confused -- are you saying training 50 epochs in 16 hrs is too slow or too fast?

If you haven't already, check that Theano is utilizing the GPU, I had to jump through a few hoops to ensure that.

Took me a solid day when I was starting to make sure GPU was being used, but I assure you that it is (gets pretty hot 60C).

Keep in mind I'm using 250mb of text, which translates to approx to 2 million samples. I train with batch size of 4096...but like I said, most of the time is spent loading matrices (which is why I want more ram)

NickShahML on 2 Nov 2015

Got it. Yeah you just have a lot of data. Mine is much faster, 50 epochs doesn't take me too long.

simonhughes22 on 2 Nov 2015

@simonhughes22
I've been following this thread and trying the network structure you shared on some conversation text data, but couldn't get it to learn something useful. Looks like the only thing it learns is outputting sentence start and sentence end. The only difference is I'm using categorical_crossentropy for loss function. I've added special tokens for sentence start/end, left padding, masking. My sequence_len = 100, nb_words=10000.

BTW for sentence prediction, is greedy approach the right thing to do? (argmax on final the one-hot output layer for each step in the sequence), or choose a word with the predicted probability?

Could you shed some insight on what could go wrong?

oleole on 3 Nov 2015

@oleole I can only speak to what i've tried. I have much shorter sequence lengths (inputs and outputs) and a pretty small vocabulary. @LeavesBreathe - also relevant to yoru question: When I trained it, It would start predicting the start and stop tokens, as those are the easiest to learn and most common, then it would move onto the most common words, and then repeat the more common phrases, before starting to predict something more interesting. I never got great results as I've mentioned a number of times, but it did start to predict sequences. So you may jsut need to let it train for a really long time. In the academic papers on this subject, the training times are pretty long from what i've read, it's a really hard problem. Also the longer the sequence the harder it will be to learn relationships over the length of that sequence.

Is it possible to shorten the sequences somehow (e.g. predict the first 5 or 10 words in the sequence) or just predict the next word only? Predicting the next word would undoubtedly be easier, and that would then be a language model, and you can even feed the prediction in at the end of the existing input and iterate that way to create the full output sequence, although you'll have to write a little bit of extra code to do that. I can't remember if I mentioned that here or on another issue. I guess i'd try that first for people who are having problems learning anything. I'd also start with a smallish dataset so you can iterate faster and test out the model structure before letting rip on the full dataset. The other thing to try, is to predict the output word vectors as opposed to the one hot encodings. In that case you would use RMSE or bce as your error metric (please see discussion above).

simonhughes22 on 3 Nov 2015

@simonhughes22 Thanks a lot for your long reply. I do feel it's learning slowly, but probably need much more time and data to move forward.

I agree learning the full sequence is a very challenging problem. From the "A Neural Conversational Model" paper, it's actually doing what you suggested, greedily learning the next word. I'll definitely give it a try.

The other idea of making word vector as target is also very interesting, but kind of discouraging hearing the results from other people on reddit. Several things I'm still not quite clear are:

Given a sequence input, the output will be n vectors (n=time_step). As @sergeyf mentioned, the first RNN output can be used as representation of input sequence (vector with size=hidden layers). However, for sequence prediction problem, we first need to find representation for the predicted sequence, and use that to search NN sequences. But there are no vector representation for predicted sequence (decoders), so how do you get it to work @LeavesBreathe ?
If we use pre-trained vector for output, as well as initialization for embedding layer, then output vector will be fixed, and embedding weight will keep updating? Also how to represent sentence start/end in the output vector? just some random fixed vectors?

oleole on 3 Nov 2015

@sergeyf For your experiment using pre-trained vectors as both inputs and outputs. What are the targets you are trying to predict? It looks like a word vector. Is it the next word given input sequence?

oleole on 3 Nov 2015

Yes, exactly. I am predicting the next word vector from a sequence of word
vectors. It didn't go very well!

On Mon, Nov 2, 2015 at 10:13 PM, oleole [email protected] wrote:

@sergeyf https://github.com/sergeyf For your experiment using
pre-trained vectors as both inputs and outputs. What are the targets you
are trying to predict? It looks like a word vector. Is it the next word
given input sequence?

—
Reply to this email directly or view it on GitHub
https://github.com/fchollet/keras/issues/395#issuecomment-153255706.

sergeyf on 3 Nov 2015

Have you tried skip-thoughts vector https://github.com/ryankiros/skip-thoughts?

oleole on 3 Nov 2015

But there are no vector representation for predicted sequence (decoders), so how do you get it to work @LeavesBreathe ?

@oleole , I haven't really gotten much to work first. I too get sentence start and end tokens. However, my biggest problem (with doing y targets as word vectors) is that you get repeated words over and over again.

I don't fully understand your question. Your encoding layer will create a vector rep of your input sequence. The repeat vector layer repeats that vector rep for all timesteps of your y output. I personally like to do two LSTM layer for encoding because I feel it capture more salient features. But this might be because I have a huge dataset (about 2 mil sentences).

Apologies if I didn't answer your question.

The skip thoughts paper is something of interest as well, and something I want to look into eventually. Right now, as mentioned above, I"m working on the clustering output. Taking longer than expected to get the matrices right.

. @LeavesBreathe - also relevant to yoru question: When I trained it, It would start predicting the start and stop tokens, as those are the easiest to learn and most common, then it would move onto the most common words, and then repeat the more common phrases, before starting to predict something more interesting.

This is really interesting to me @simonhughes22. I feel that this is a strong sign that you need more training data (I know you have a limited set). Usually if you have a situation where more epochs results in better results, that openly tells you that more data would be improve the model more instead.

Imagine you doubled your data. Then your model would get to the same level (loss) in _half_ the epochs, assuming your data is perfect. Still really good to know that patience is key in this.

NickShahML on 3 Nov 2015

@oleole skip thoughts looks promising, i've been meaning to try it out. The output varies as you are predicting a different output sequence for each input sequence. So I don't get the 'the output is fixed' comment. The start and end vectors can just be all 0's (or all 1's).

simonhughes22 on 3 Nov 2015

Hey guys,

So as an update, I've tried the input word vec (16 vectors) to cluster output. I've only ran it for 3 days, so there are plenty of different hyperparameters to try. To get to 50 epochs, it takes about 2 days, so its a slow process. However, I can tell you this:

Softmax + cross_entropy is much much better than linear regression/cos distance

This might look lame, but here's some sample output. Notice the word diversity!:

_posthumous respectively acute support bleeding association enough pregabalin gluten but hot into : in some of cocaine provides regular medical herrerasaurus review control the widespread coat upper_airway conclusively after glucose such_as linking the number and have cns may nucleus inflammation include respectively psa , patient trauma than occurs vascular evidence medical pseudounipolar dichotomy_between_

But I'm wondering about something a little more novel: changing the design of sequence to sequence.

Before, we were using a repeat vector for each timestep like in the model shown:

model = Sequential()
model.add(LSTM(hidden_variables_encoding, input_shape=(x_maxlen, word2vec_dimension), return_sequences = False)) 
model.add(Dropout(dropout))
model.add(Dense(hidden_variables_encoding))
model.add(Activation('relu'))
model.add(RepeatVector(y_maxlen))
for z in range(0,number_of_decoding_layers):
    model.add(LSTM(hidden_variables_decoding, return_sequences=True))
    model.add(Dropout(dropout))
model.add(TimeDistributedDense(y_matrix_axis, activation="softmax")) here
model.compile(loss='categorical_crossentropy', optimizer='adam')

But now, lets consider that we do _not_ use the repeat vector. But rather, we simply use a TimeDistrubtedDense layer instead to get the right amount of timesteps for our target:

model = Sequential()
model.add(LSTM(hidden_variables_encoding, input_shape=(x_maxlen, word2vec_dimension), return_sequences = True)) 
model.add(TimeDistributedDense(y_maxlen)) #consider adding another timedistributedense here
for z in range(0,number_of_decoding_layers):
    model.add(LSTM(hidden_variables_decoding, return_sequences=True))
    model.add(Dropout(dropout))
model.add(TimeDistributedDense(y_matrix_axis, activation="softmax"))
model.compile(loss='categorical_crossentropy', optimizer='adam')

Before trying this for 4 or 5 days, I wanted to get your guys take on it. Thanks alot!

NickShahML on 5 Nov 2015

I think you would have to have the return_sequences=True for the encoding
LSTM for this to work (second line).

So, if I understand, the major difference would be is that instead of
feeding the decoder the final encoding at its every time step, you would be
feeding intermediate encodings at every time step. It's hard to say whether
it would be better or worse. Presumably the final encoding with
return_sequences = False has more info in it than intermediate ones, and
the decoder having access to the final encoding _should_ be better by
intuition, but who knows what the truth is. RNNs surprise me often. =)

On Thu, Nov 5, 2015 at 2:38 PM, LeavesBreathe [email protected]
wrote:

Hey guys,

So as an update, I've tried the input word vec (16 vectors) to cluster
output. I've only ran it for 3 days, so there are plenty of different
hyperparameters to try. To get to 50 epochs, it takes about 2 days, so its
a slow process. However, I can tell you this:

_Softmax + cross_entropy is much much better than linear regression/cos
distance_

This might look lame, but here's some sample output. Notice the word
diversity!:

_posthumous respectively acute support bleeding association enough
pregabalin gluten but hot into : in some of cocaine provides regular
medical herrerasaurus review control the widespread coat upper_airway
conclusively after glucose such_as linking the number and have cns may
nucleus inflammation include respectively psa , patient trauma than occurs
vascular evidence medical pseudounipolar dichotomy_between_

But I'm wondering about something a little more novel: changing the design
of sequence to sequence.

Before, we were using a repeat vector for each timestep like in the model
shown:

model = Sequential()
model.add(LSTM(hidden_variables_encoding, input_shape=(x_maxlen, word2vec_dimension), return_sequences = False))
model.add(Dropout(dropout))
model.add(Dense(hidden_variables_encoding))
model.add(Activation('relu'))
model.add(RepeatVector(y_maxlen))
for z in range(0,number_of_decoding_layers):
model.add(LSTM(hidden_variables_decoding, return_sequences=True))
model.add(Dropout(dropout))
model.add(TimeDistributedDense(y_matrix_axis, activation="softmax")) here
model.compile(loss='categorical_crossentropy', optimizer='adam')

But now, lets consider that we do _not_ use the repeat vector. But
rather, we simply use a TimeDistrubtedDense layer instead to get the
right amount of timesteps for our target:

model = Sequential()
model.add(LSTM(hidden_variables_encoding, input_shape=(x_maxlen, word2vec_dimension), return_sequences = False))
model.add(TimeDistributedDense(y_maxlen)) #consider adding another timedistributedense here
for z in range(0,number_of_decoding_layers):
model.add(LSTM(hidden_variables_decoding, return_sequences=True))
model.add(Dropout(dropout))
model.add(TimeDistributedDense(y_matrix_axis, activation="softmax"))
model.compile(loss='categorical_crossentropy', optimizer='adam')

Before trying this for 4 or 5 days, I wanted to get your guys take on it.
Thanks alot!

—
Reply to this email directly or view it on GitHub
https://github.com/fchollet/keras/issues/395#issuecomment-154218066.

sergeyf on 5 Nov 2015

I think you would have to have the return_sequences=True for the encoding
LSTM for this to work (second line).

My mistake, you're absolutely right. Thanks for saving me the compile error.

Presumably the final encoding with
return_sequences = False has more info in it than intermediate ones, and
the decoder

Why is this necessarily true? For your encoding layer, couldn't use stack more LSTM's/TimeDistributedDense layers (in the 'encoding' portion) to make it just as sophisticated? Forgive me for my lack of understanding if this is an obvious answer.

I just don't like the repeatvector part of the original model. It seems to me that your repeating the same vector for each timestep. But wouldn't be useful it you were giving a _different_ vector for each timestep?

After all, you expect to see a different word for each timestep, so wouldn't make more sense to give a different vector input for each timestep?

NickShahML on 6 Nov 2015

First, let me say that I don't know what actually works or doesn't - the following is my rationale for what I believe should work better.

Let's say you have a sequence of words to encode: 'the cat is dancing'

In the first encoder-decoder type, you get an encoding(the cat is dancing) that is repeated to every step of the first decode layer.

In the second type that you are proposing, you get the following encodings:

encoding(the)
encoding(the cat)
encoding(the cat is)
encoding(the cat is dancing)

But! Presumably encoding(the cat is dancing) has strictly more encoded info than encoding(the cat is) (or the others). So we are providing as much info as possible to the decoding layer's every step. In a way, this lets the decoding layer know about the whole sequence at every step, not just what has been seen up until now. It should be an advantage. Not sure if it actually is one...

sergeyf on 6 Nov 2015

@sergeyf what you're saying makes sense. I like that your providing as much information as possible. I feel kind of foolish for proposing this in the first place.

If my GPU is ever free, I'll just try this idea just for fun.

In the meantime, I'm going to try reversing input and see if I get better results (like the google seq seq paper did). I'm also going to try adding on more TimeDistributedDense and LSTM/JZS1 layers. If I get anything better, I'll let you guys know.

NickShahML on 6 Nov 2015

Please don't feel foolish! Sometimes things that sound reasonable are
wrong. And we have no idea if they are unless we just try random stuff :)
That's my favorite way to learn.
On Nov 5, 2015 4:42 PM, "LeavesBreathe" [email protected] wrote:

@sergeyf https://github.com/sergeyf what you're saying makes sense. I
like that your providing as much information as possible. I feel kind of
foolish for proposing this in the first place.

If my GPU is ever free, I'll just try this idea just for fun.

In the meantime, I'm going to try reversing input and see if I get better
results (like the google seq seq paper did). I'm also going to try adding
on more TimeDistributedDense and LSTM/JZS1 layers. If I get anything
better, I'll let you guys know.

—
Reply to this email directly or view it on GitHub
https://github.com/fchollet/keras/issues/395#issuecomment-154248098.

sergeyf on 6 Nov 2015

Hey guys, I just want to draw some attention to thread https://github.com/fchollet/keras/issues/957. The bottom line is that our output's are _not_ masked, meaning the cost function is biased.

I think this explains why the results from linear regression tended to be "stretched". That is, it slowly drifted from word to word. I'm gonna talk to Eder, but I think to me, his explanations were quite clear.

NickShahML on 6 Nov 2015

@LeavesBreathe that sounds like a really interesting idea. When I want to test something that takes a long time, i run it on a small subset of data (if it's going to run a long time). If that works well or better than some other approach I am comparing it with on that same small subset then I try it on the full dataset or a larger subset. I'd advise that here. Smaller data should be easier for it to learn from too, although what it learns won't generalize nearly as well. Hope that helps.

simonhughes22 on 6 Nov 2015

RNN encoder-decoders do take really long to converge to good solutions... everybody seems to be reporting that. Another thing in the final layer decoder you may want an RNN with a readout (the final generated word is sent back to the inner RNN). I wrote a GRU with readout here: https://github.com/EderSantana/seya/blob/master/examples/imdb_readout.py#L53-L67

EderSantana on 6 Nov 2015

First of all, I can't believe seya now. Its so exciting to me that you have stateful GRU's and bidirectional rnn's. This is just amazing.

I need to readup a little bit on readouts to understand exactly what the advantage of this is, but it sounds very exciting. Thanks alot man.

@simonhughes22 I agree with you. Start small and grow bigger when testing anything major. Though I must say that the Sutskever RNN has much more promise.

NickShahML on 6 Nov 2015

c'mon you didn't know about Seya :D ??? That is the place where I'm cooking things before I push it to Keras. Some advanced examples that could crowd this repo are also there like Spatial Transformer Networks and DRAW. If more people use them and do suggestions, we could move them up here to main Keras.

To understand what I mean by readout see this figure by Cho et. al.
screen shot 2015-11-06 at 9 56 39 am

see the difference between encoder and decoder??? the generated symbol is sent back to the RNN in the decoder.

EderSantana on 6 Nov 2015

ahhhh I gotcha. So basically if I understand it correctly, lets say your decoding layer produces sentences. For y1 it produces: "the"

For y2 it produces: "cow"

since it is a readable GRU, for y3, it sees that you have written "the" and "cow" so it is more likely to pick "jumped" as y3?

Of course, y1, y2, and y3, are a distribution of percentages (assuming your using a softmax). But it would see these percentages. I usually apply a temperature after the percentages are produced (so I'm not always picking the highest percentage choice).

Anyways, I look forward to the tutorial you mentioned in the other thread. There's just so much to try now!

NickShahML on 6 Nov 2015

I was able to have a chat with K. Cho.. I believe that the readout is used at test time, the model is trained by 'teacher forcing' i.e. The decoder is fed with the true label (from previous time step as input) which is replaced by the readout at test time.
Note that feeding the readout to the next step is not explicitly needed as the information should be contained in the hidden state. For example, if you wanted to use the model to score a pair of sequences (and not generate the target sequence)

gautamb85 on 6 Nov 2015

Note that feeding the readout to the next step is not explicitly needed as the information should be contained in the hidden state.

I'm trying to generate text, which is why this is so critical to me. I believe @simonhughes22 is as well.

NickShahML on 6 Nov 2015

@gautamb85 I thought I was cheating when I teacher forced during training xD good to know!!!
But note that not all the information needed is present in the hidden state, specially if you are using a deep readout. Several works report feeding back the readout.

EderSantana on 6 Nov 2015

Heh. We have had this debate before (on this thread somewhere I think).
The way I think about it is - so the input to your decoder RNN (if you dont feed in the prediction) is going to be the summary vector produced by the encoder. At every timestep.

Now, I know for a fact that this does work (atleast on simple tasks like numbers-> number strings). However, intuitively if the previous prediction is fed in, the decoder has a better idea of 'where it is' (if that made any sense). I would think this model would be more powerful.
I am using the approach as generative model to score a pair of audio sequences. So I cant be confident about the text generation problem.
@simonhughes22 can the sequence-to-sequence model you proposed in keras be used to score two sequences?
I dont think so because there is no way to input the second sequence. I guess it would need a graph() model. It was actually easier to just write my own code (not easy, but easier ;) )

Sent from my iPhone

On Nov 6, 2015, at 10:20 AM, LeavesBreathe [email protected] wrote:

Note that feeding the readout to the next step is not explicitly needed as the information should be contained in the hidden state.

I'm trying to generate text, which is why this is so critical to me. I believe @simonhughes22 is as well.

—
Reply to this email directly or view it on GitHub.

gautamb85 on 6 Nov 2015

@EderSantana nope! You're good :)
Are you doing this with keras? It would need a graph() model ?
I wrote my own but it i only have sgd and momentum going and it would be nice to have more fancy optimization
Thanks for the tip on the readout

gautamb85 on 6 Nov 2015

@gautamb85 - use a y shaped graphical model, or concatenate the two sequences. I suspect that'll be really tough to learn. although hopefully the changes @EderSantana is suggesting will work better. I don't have time to check (busy with work, phd, and kaggle:) ).

simonhughes22 on 6 Nov 2015

Yes I did use a graph inputing the "input" and the "teacher" (input delayed one step) than I pass both to the decoder GRU with merge_mode=concat.

EderSantana on 6 Nov 2015

@EderSantana seya looks awesome. Please add to keras!

simonhughes22 on 6 Nov 2015

Off topic. But is anyone coming for NIPS? i say coming since I live in Montreal :)

Sent from my iPhone

On Nov 6, 2015, at 10:53 AM, Simon Hughes [email protected] wrote:

@EderSantana seya looks awesome. Please add to keras!

—
Reply to this email directly or view it on GitHub.

gautamb85 on 6 Nov 2015

However, intuitively if the previous prediction is fed in, the decoder has a better idea of 'where it is' (if that made any sense). I would think this model would be more powerful.

This is exactly what I thought when @EderSantana explained the GRU(readout). Knowing where it it is in the readout I think would be incredibly powerful. I'll try to integrate this readout GRU, and report back if I get better val loss with the text generation. It will take me sometime to integrate in correctly.

I'll also graph my results publically if it helps anyone. I've started graphing here: https://plot.ly/~oxygen123/folder/home

NickShahML on 6 Nov 2015

@gautamb85 I've been sitting out for much of this chat, but I'll be at NIPS, we should organize a Keras meetup! We're doing a lot of sequence to sequence stuff as well, would be good to compare notes.

wxs on 6 Nov 2015

I will also be at NIPS!

On Fri, Nov 6, 2015 at 8:05 AM, Gautam Bhattacharya <
[email protected]> wrote:

Off topic. But is anyone coming for NIPS? i say coming since I live in
Montreal :)

Sent from my iPhone

On Nov 6, 2015, at 10:53 AM, Simon Hughes [email protected]
wrote:

@EderSantana seya looks awesome. Please add to keras!

—
Reply to this email directly or view it on GitHub.

—
Reply to this email directly or view it on GitHub
https://github.com/fchollet/keras/issues/395#issuecomment-154449245.

sergeyf on 6 Nov 2015

@wxs It would really be good to compare notes! I know that another keras contributor lives in Montreal, I will try to get in touch with him. Maybe lets do a post on the google-group for a meetup?
Ps. Get ready for some serious cold :)

gautamb85 on 6 Nov 2015

I will be in NIPS as well. I will interested in the meetup.

Best regards, Mariano

On Fri, Nov 6, 2015 at 10:46 AM, Gautam Bhattacharya <
[email protected]> wrote:

@wxs https://github.com/wxs It would really be good to compare notes! I
know that another keras contributor lives in Montreal, I will try to get in
touch with him. Maybe lets do a post on the google-group for a meetup?
Ps. Get ready for some serious cold :)

—
Reply to this email directly or view it on GitHub
https://github.com/fchollet/keras/issues/395#issuecomment-154483117.

Best Regards, Mariano

mphielipp on 6 Nov 2015

@LeavesBreathe You might want to check out this ipython notebook
https://github.com/DTU-deeplearning/day3-RNN/blob/master/RNN.ipynb

He has setup an encoder-decoder and one with an attention mechanism. Note that he is doing it in the same way (not feeding in the prediction), but the attention model can kinda compensate for this.
There is also a toy-problem (text prediction :)) setup.
You would have to learn lasagne (also a really good package), and use more theano.

gautamb85 on 6 Nov 2015

wow, that NIPS meetup will be fun. I'll go home in Dec. but you guys write blogs or something to let us know what happened.

EderSantana on 6 Nov 2015

I'll be at NIPS -- I'd love a meetup.

lukedeo on 6 Nov 2015

@LeavesBreathe You might want to check out this ipython notebook
https://github.com/DTU-deeplearning/day3-RNN/blob/master/RNN.ipynb

This is a really good find. You're right that I would need to learn lasagne, but it may be worth it if it allows more capabilities.

I think for right now, I'm gonna wait for @EderSantana 's tutorial and go from there. Keras has been a huge help to me, and I hope I can start to contribute to it. In the meantime, I'm gonna try to start implementing the readout GRU.

NickShahML on 6 Nov 2015

Moving NIPS discussion to #962!

wxs on 6 Nov 2015

I have done a seq2seq implementation, based on http://papers.nips.cc/paper/5346-sequence-to-sequence-learning-with-neural-networks.pdf.

https://github.com/farizrahman4u/seq2seq

It has stateful LSTM encoder and decoder, hidden state transfer from encoder to decoder, feedback decoder (output at step t is input at step t+1), depth and all the fancy stuff.

farizrahman4u on 7 Nov 2015

@farizrahman4u I saw your code it looks pretty interesting. @LeavesBreathe I think you want to check that out.

One little thing, I saw that you use two LSTMs, an encoder and another one for decoder. Isn't that approach more similar to Socher's encoder-decoder than the model presented in the paper you cite? Did it provide you better results than using a single RNN? Tkx for sharing.

EderSantana on 7 Nov 2015

wow @farizrahman4u , I'm overwhelmed. This Keras community is ridiculous. A huge, huge thanks. I have a few questions if you have time:

Can you mask values? I know there is a input_length and output_length. But if we add a masking layer _before_ the seq2seq layer, can we mask all zeros? This would be for both input _and_ output. Output being really important (so cost function is not biased)
For beginners, can you give some understanding do what depth is and why you recommend 5?
If we want to add more GRU's or LSTM's to the decoding layer, would we make a network like this?

```
seq2seq = Seq2seq(input_length=x_maxlen, input_dim=word2vec_dimension,hidden_dim=hidden_variables_encoding,
output_dim=hidden_variables_decoding, output_length=y_maxlen, batch_size=10, depth=5)

model = Sequential()
M = Masking(mask_value=0)
M._input_shape = (x_maxlen, word2vec_dimension)
model.add(M)
model.add(seq2seq)
model.add(GRU(hidden_variables_decoding, return_sequences = True))
model.add(TimeDistributedDense(y_matrix_axis, activation = 'softmax')
model.compile(loss='categorical_crossentropy', optimizer='adam')
```

NickShahML on 7 Nov 2015

One little thing, I saw that you use two LSTMs, an encoder and another one for decoder. Isn't that approach more similar to Socher's encoder-decoder than the model presented in the paper you cite?

@EderSantana, from my understanding of Google's Seq to Seq paper, they got the best results using _four_ LSTM's to encode and _four_ LSTMs to decode giving them a total of 8 LSTM's.

This is why I asked the question above. Ideally, you would want to use much data, along with multiple RNN's. The idea is that each one captures more salient features within your data.

Interestingly, one thing that improves my results from previous prediction experiments is to use not just one type of rnn. Instead, for my decoder layer, using something like:

LSTM, JZS1, GRU (in that order)

gives me better results. I always lead with a LSTM because I feel it captures the most amount of features. Anyways, my two cents for what its worth. LIke I said earlier, I'll be publically graphing all my experiments (and labelling them as best as I can) with @EderSantana and @farizrahman4u mods on Keras. Tonight, I'm gonna try leading with a bidirectional LSTM and see what happens.

NickShahML on 7 Nov 2015

Hi @LeavesBreathe. You have to mask only after the seq2seq2, not before.

1)Your training data should look like :
["How are you <EOL> <EOL> <EOL> <EOL> <EOL>", "I am fine <EOL> <EOL> <EOL> <EOL> <EOL>"]

If you mask the EOLs in the input, the decoder will find it difficult to terminate its outputs with EOLs. Instead, it will fill up its output length with garbage words.

2)Depth is the number of LSTM layers in the encoder and decoder. If depth is 2, you have 2 LSTMs for encoding and 2 LSTMs for decoding. Thanks for pointing out that the optimal value is 4 and not 5. You can also have different depths for the encoder and decoder. e.g, depth = [4,5]. Which means 4 LSTMs for encoding and 5 LSTMs for decoding.
3) Yes, you could do that, but use stateful GRU from seya. (Or you could use the depth argument). Also instead of using the ready made Seq2seq model, try playing around with the LSTMEncoder and LSTMDecoder layers and make your own custom Seq2seq models (for e.g, multiple dense layers in between encoder and decoder, one encoder - many decoders models etc.). Its fun!

farizrahman4u on 8 Nov 2015

@EderSantana In Cho et al, the output from encoder is fed to the decoder at every time step. And readout is also present. But in seq2seq, the output from the encoder is fed to the decoder in the first time step only. Also, the hidden state is copied from encoder to decoder. So my model is more similar to the seq2seq.

Update: I have added a new decoder: LSTMDecoder2, it is similar to Cho et al, the output from encoder is fed to the decoder at every time step, along with output of previous time step. You may or may not enable hidden state copying when using this decoder. (Should work better when not enabled in case of conversational model, as it could remember not only what was said by human in the previous time steps, but also what it said in the previous time steps. ).
One advantage of having 2 LSTMs is that the encoder learns a vector representation for sentences. So if you are doing a language translation task, you can reuse the encoder for multiple languages. You dont have to train the encoder from scratch for each language pair. Here is some pseudo-code:

EnglishToFrench = Seq2seq()
EnglishToFrench.compile()
EnglishToFrench.train()

encoder_data = EnglishToFrench.encoder.get_weights()

EnglishToSpanish=Seq2Seq()
EnglishToSpanish.encoder.set_weights(encoder_data)
EnglishToSpanish.compile()
EnglishToSpanish.train()

You can also train multiple language pairs simultaneously (Encode in English, decode in other languages):

EnglishEncoder = LSTMEncoder()
FrenchDecoder = LSTMDecoder()
SpanishDecoder = LSTMDecoder()
GermanDecoder = LSTMDecoder()
dense = Dense()

EnglishEncoder.decoders = [FrenchDecoder, SpanishDecoder, GermanDecoder] #Multiple decoders. Wow!

model = Graph()
model.add_input(EnglishEncoder, "english")
model.add_node(dense,"dense", input="english")
model.add_output(FrenchDecoder, "french", input="dense")
model.add_output(SpanisDecoder, "spanish", input="dense")
model.add_output(GermanDecoder, "german", input="dense")
model.compile()
model.train()

Apart from any advantage in accuracy/speed, multiple RNNs have the advantage of modularity and reusability.

farizrahman4u on 8 Nov 2015

👍2

If you mask the EOLs in the input, the decoder will find it difficult to terminate its outputs with EOLs. Instead, it will fill up its output length with garbage words.

Huh. I always thought you wanted to mask your input, but what you're saying makes sense. I guess the question is then, is how do you mask the output given the seq2seq model? Would it be something like this?

seq2seq = Seq2seq(input_length=x_maxlen, input_dim=word2vec_dimension,hidden_dim=hidden_variables_encoding,
              output_dim=hidden_variables_decoding, output_length=y_maxlen, batch_size=10, depth=5)

model = Sequential()
model.add(seq2seq)
model.add(Masking(mask_value=0))
model.add(TimeDistributedDense(y_matrix_axis, activation = 'softmax')
model.compile(loss='categorical_crossentropy', optimizer='adam')

try playing around with the LSTMEncoder and LSTMDecoder layers and make your own custom Seq2seq models

Yes! I definitely want to do that. From last night, I actually got better results with using strictly bidirectional lstms from seya.

Once I have things working, I'll make a BidirectionalLSTMEncoder and Decoder. I'll either submit a pull request to you or @EderSantana. It will take me at least a few weeks to get there though. I have alot of matrix setup and testing I need to do.

I have added a new decoder: LSTMDecoder2, it is similar to Cho et al, the output from encoder is fed to the decoder at every time step, along with output of previous time step

This is really cool. I can't wait to try all of this out. I also saw you updated the conversational.py -- Thanks!

Thanks for pointing out that the optimal value is 4 and not 5.

Its not necessarily that the _optimal_ value is 4. I actually went to med school for a while and studied neurology heavily. You'll find that the brain appropriates different amounts of neurons for different tasks (along with different types of neurons). And different types of conversations require different amounts of neurons.

The bottom line is that, for different areas of conversations (or topics), there is a sweet spot of the number of neurons you want, and the brain automatically optimizes to that level. Talking about the weather takes less neurons and watts than compared to theoretical physics discussions. This is also why people who speak 4 or 5 languages have a more neurons appropriated for language processing and creation.

Right now, we manually do that optimizing by trying different amounts of layers/hidden states. So in summary, its not that 4 is the most optimal for seq to seq. It is for that dataset, with the task of _translating_ English to French, 4 happens to be the most optimal. If you tried translating English to Chinese, I bet the optimal level would be 6 or 7.

Didn't mean to ramble, but all I'm trying to say is: experiment with different levels. Sometimes, I get better results with lower amount of hidden states, but _more_ levels. Sorry if I'm beating a dead horse.

NickShahML on 8 Nov 2015

@LeavesBreathe Thanks for clearing up the layer depth thing. Regarding masking:

Masking is done for values that do not contain information
An embedding layer outputs zeroes for words it have not seen or invalid words/numbers (The embedding layer has no 'information regarding that word). Such words could occur anywhere in a sentence.
EOL is different. It contains information. It represents the end of line. So it should be considered as a normal word, and should have a non zero word embedding.
Placing the EOLs in the right place is the first thing that your seq2seq model learns. It is quite fun to see the model accidentally place some EOLs in the middle of sentences occasionally during the early stage of training and gradually shift the EOLs towards the end of the sentences as training progresses. This effect is more prominent if your vocab size is relatively small.
As EOLs are real words, contain information and have a non zero embedding, they should not be masked.
You can mask the zeroes that come from invalid/out of vocab words, before feeding them to seqtoseq. I said do not mask the input in my previous comment because you were trying to mask the EOLs.
While training, if your y_train has zeroes (invalid words in output sentences), no matter what masking you do in between layers, your model will still output zeroes(invalid words), because you are training it to do so.
Note that if you place a TimeDistributedDense on top of seq2seq, its the TimeDistributedDense that gives out the output word embeddings, not seq2seq. In that case the output of seq2seq layer will be some hidden internal representation. In such internal representation, zero vector does not mean "invalid word". Its an internal representation of your output, so zeroes could mean anything, and encode information.. so no masking in between seq2seq and another layer.
Summing it up: mask input, do not mask output. Remove invalid words from training data if possible, or just leave them there and hope for the best.

farizrahman4u on 8 Nov 2015

one note, if anybody ever have to mask outputs, use sample_weight not masking to choose which values affect the cost function. In practice it doesn't matter what the network output after EOL so we don't actually need to make it learn to output zeros. But, if you are not familiar with sample_weight do as @farizrahman4u says and just use the model as he suggested.

EderSantana on 8 Nov 2015

@farizrahman4u Thanks for such a detailed description in regards to masking. I completely understand everything you're saying in regards to input. Fortunately, I do not have any 'out-of-vocab' words, so I won't be masking any input.

Thank you also for clarifying that EOL is a word. And is the first part to be predicted correctly. Makes complete sense.

I guess my main question is this: Is there a disadvantage to having your neural net predict repeated EOLs?

I feel like its additional thing your network has to learn. If you're predicting sentences, your network has to learn there is a length of 50 for each sentence predicted. Supposed the predicted sentence has 30 words. Then the network has to learn to predict EOL's for the rest of the 20 timesteps.

I'm wondering if this comes at a cost to the network. It might not, and I might just be worrying about nothing.

At this point, I feel bad asking more questions, so if you don't have time, don't bother addressing below:

There is another important aspect that I'm still cloudy on: transferring hidden states

In your seq2seq model, you transfer the hidden state from the encoder to the decoder, which makes complete sense. It eliminates the need for the RepeatVector. But beyond that, many people talk about transferring the hidden state from layer to layer within the decoder and encoder..

Suppose for the decoder layer, we stack 4 LSTMs (depth = [4,4]). Does the hidden state transfer from layer to layer?

If so, what is the difference between transferring the hidden states, and just regularly stacking 4 LSTM Keras layers? Why is transferring the hidden state advantageous?

Thanks alot again.

NickShahML on 8 Nov 2015

Is there any attention based model case with keras?
Thank you very much

tttwwy on 9 Nov 2015

@LeavesBreathe

I guess my main question is this: Is there a disadvantage to having your neural net predict repeated EOLs?

Not much.

Then the network has to learn to predict EOL's for the rest of the 20 timesteps.

This is not as tough as it sounds. Your network DOES NOT learn like this:

After 1 EOL, I should output  EOL
After 2 EOLs, I should output  EOL
After 3 EOLs, I should output  EOL
......
After 19 EOLs I should output EOL

Instead, it just learns:

If my previous output is EOL:
      output EOL
else if I do not have anything more to say:
       output EOL

This rule is very simple to learn when compared to the complex stuff your seq2seq model learns, like translation, conversation etc. So dont worry.

farizrahman4u on 9 Nov 2015

Thanks @farizrahman4u for the clarification. The if statement makes much more sense. I'll get working on this and report back here if I find anything interesting that may help you guys.

NickShahML on 9 Nov 2015

I was having a some trouble with masking my data a while back, and I was hoping someone could clarify a few things for me, before I try a large experiment.

I am working with real-valued speech data (mfcc). so my input data is a 3d tensor (n_batch, time_step, feat_dimension)
Utterances are of variable length so, i pad them with zeros to max_length

Q1. As my data is a 3D tensor, I pad the zeros at the BOTTOM of each individual feature matrix. Is this ok? (its a matrix of zeros)

I know the padding function in keras pads to the right, but in my case it has to be either above or below.

Otherwise I have a masking layer after my input layer with mask value set to (0.)

Thanks

gautamb85 on 9 Nov 2015

@LeavesBreathe

In your seq2seq model, you transfer the hidden state from the encoder to the decoder, which makes complete sense. It eliminates the need for the RepeatVector. But beyond that, many people talk about transferring the hidden state from layer to layer within the decoder and encoder.

I have added a new function to StatefulRNN, called broadcast_state. So you can send your hidden state from any StatefulRNN to another. For e.g, Here is a Seq2seq model with depth 2. The hidden state of encoder1 is propagated throughout the model.

encoder1 = LSTMEncoder(.........return_sequences=True)
encoder2 = LSTMEncoder(..........)
dense = Dense(.......)
decoder = LSTMDecoder2(.........)
encoder3 = LSTMEncoder(.....return_sequences=True)
encoder4 = LSTMEncoder(.....return_sequences=True)

#Connect hidden layers

encoder1.broadcast_state(encoder2)
encoder2.broadcast_state(decoder)
decoder.broadcast_state(encoder3)
encoder3.broadcast_state(encoder4)

#Build model

seq2seq = Sequential()
seq2seq.add(encoder1)
seq2seq.add(encoder2)
seq2seq.add(dense)
seq2seq.add(decoder)
seq2seq.add(encoder3)
seq2seq.add(encoder4)

I will be updating Seq2seq and Conversational shortly, stay tuned!

farizrahman4u on 9 Nov 2015

Wow Fariz...seriously man you're awesome. This broadcast_state will be incredibly useful, and I definitely hope the main Keras gets it eventually.

I am a little confused as to the seq2seq model you built in your snippet of code.

You go from encode-> dense -> decode.

How are you going from the 2D output of the dense layer to the 3d input required by the decoder? Doesn't the LSTM/GRU of the decoder require a 3d input? In the past, we used a RepeatVector for it, but it is better to transfer hidden states instead.

Also: Small technicality. For your encoder1 and decoder , you should have return_sequences = True correct?

In the weeks to come, I really hope to give back to you (and everyone else) by testing a ton of seq2seq models. Hopefully I can give you some insight as I imagine you are doing seq to seq as well.

@gautamb85, I'm not familiar with mfcc, but couldn't you simply do a variation of what @farizrahman4u suggested to me? Couldn't you, instead of masking, place some sort of 'SILENT' token where the zeros are? In this way, you don't have to worry about affecting the seq2seq model or its output data.

NickShahML on 9 Nov 2015

@LeavesBreathe hmm. worth a shot, but its trickier. I can't add a symbol, it has to be a floating point number (I hate real-valued data, why didn't I do NLP. lol). I'll keep you posted if something works.

@farizrahman4u First off.. stellar job on the seq2seq model!! :)
Q. Can I use the model to SCORE a pair of sequences?
In Cho's original paper, they use the model to re-score english-french translation pairs.
So, say I have a trained model, and I have pairs of sequences to test - can I evaluate the conditional likelihood of seq-2 given seq-1 i.e. p(y|x)
Doing something like this would not give me the correct data likelihood will it? (as I want the log likelihood and not the negative log likelihood, I would multiply by -1)

objective_score = -1*model.evaluate(X_test, Y_test, batch_size=32)

gautamb85 on 9 Nov 2015

I can't add a symbol, it has to be a floating point number

It doesn't necessarily have to be a symbol, it can be a real number, just keep it consistent. Suppose you use "3" as your silent token. Assuming your doing speech to text, your model will learn that 3 is associated with the silent token output.

As an aside, I would choose a real number that is far away from your other numbers that represent your data. That way, it is treated a completely separate entity from the rest of your data. As an fyi, this is how the brain does it in your Auditory Cortex. It determines silence as a certain value, and you consciously recognize that value as silence. This is why conditions like chronic tinnitus can't be cured: the brain never hears the "silence" value and continues to output the "ringing-in-my-ears" value.

NickShahML on 9 Nov 2015

@LeavesBreathe

How are you going from the 2D output of the dense layer to the 3d input required by the decoder?

The decoder's input is 2D (Even though it inherits from LSTM). The output from each time step then becomes the input for the next time step.

Also: Small technicality. For your encoder1 and decoder , you should have return_sequences = True correct?

A decoder always has return_sequence = True by default. And yes, return_sequence = True for encoder1

farizrahman4u on 9 Nov 2015

@LeavesBreathe I have added a new class called DeepLSTM. It has built in hidden state propagation.

Example:

deep = DeepLSTM(input_dim=100, output_dim=100,depth=4, return_sequences=True, inner_return_sequences=False, remember_state=False, batch_size=32)

Notice the inner_return_sequences argument is False, which means the inner LSTMs will behave like Cho's enocder-decoder..(RepeatVector in between non-sequence-returning RNNs)

On a side note, you can also use broadcast_state to send hidden state from one LSTM to multiple LSTMs ..(simply pass them as a list)

farizrahman4u on 9 Nov 2015

@farizrahman4u If you get a chance to answer my question, I would be most grateful :)

(pasted from above)
Q. Can I use the model to SCORE a pair of sequences?
In Cho's original paper, they use the model to re-score english-french translation pairs.
So, say I have a trained model, and I have pairs of sequences to test - can I evaluate the conditional likelihood of seq-2 given seq-1 i.e. p(y|x)
Doing something like this would not give me the correct data likelihood will it? (as I want the log likelihood and not the negative log likelihood, I would multiply by -1)

gautamb85 on 9 Nov 2015

@gautamb85, don't mean to post over yours, but I do want to get back to Fariz.

@farizrahman4u ...I'm just running out of ways to complement and thank.

The output from each time step then becomes the input for the next time step.

Clever. Really clever.

Its great you can list LSTMs with broadcast_state -- This is gold.

So is the idea with DeepLSTM is that it saves you lines of code right? You _could_ technically do the deep LSTM with the code snippet you gave above with broadcast_state correct?

To incorporate the DeepLSTM in the seq2seq model, would it be something like this?

encoder1 = DeepLSTM(input_dim=100, hidden_dim=100, output_dim=100,depth=4, return_sequences=True, inner_return_sequences=False, remember_state=False, batch_size=32)
dense = Dense(.......)
decoder = LSTMDecoder2(.........)
encoder2 = DeepLSTM(input_dim=100, hidden_dim=100, output_dim=100,depth=4, return_sequences=True, inner_return_sequences=False, remember_state=False, batch_size=32)

#Connect hidden layers

encoder1.broadcast_state(decoder)
decoder.broadcast_state(encoder2)

#Build model

seq2seq = Sequential()
seq2seq.add(encoder1)
seq2seq.add(dense)
seq2seq.add(decoder)
seq2seq.add(encoder3)
seq2seq.add(Dropout(dropout))
seq2seq.add(TimeDistributedDense(y_matrix_axis, activation = 'softmax'))
seq2seq.compile(loss='categorical_crossentropy', optimizer='adam')

NickShahML on 9 Nov 2015

@LeavesBreathe @EderSantana @farizrahman4u @gautamb85 this is really interesting. I've had some work to do so missed the party somewhat. Can everyone post the model and library that they end up with as being the optimal approach for their dataset when all's said and done, and hopefully issue some pull requests to keras so we can get all this in one place? Great work guys

simonhughes22 on 10 Nov 2015

Can everyone post the model that they end up with as being the optimal approach when all's said and done

Glad you're back. I will post my best models along with the training graph as they come in! All my graphs are here: https://plot.ly/~oxygen123/folder/home

NickShahML on 10 Nov 2015

@LeavesBreathe Yes, saving lines of code is the idea.And it does all the RepeatVector stuff automatically. Also, for encoder1 return_sequences=False. Depth of encoder2 should be 3, so that for decoding you have 1 decoder + 3 encoders = 4 layers deep.
@gautamb85 I saw your comment just now. I will come up with a detailed answer + code in a few hours.

farizrahman4u on 10 Nov 2015

@farizrahman4u You Sir, are a deep learning angel ! :)
I did have a question related to a previous post. I think I read/saw that you are using 2 lstm's in the encoder.
The idea behind this is a hierarchical RNN? I saw a paper recently, so the upper LSTM takes as its input (at each of its time steps) the encoding produced by the lower LSTM.
So if the lower RNN is fed words, the last time step would represent a sentence. Consequently, the last tilmestep of the upper LSTM would encode the whole document.

Of course, as you mentioned, many seq2seq architectures can be experimented with.

PS. Its really late and I might be imaging all this. If that is the case. Please ignore.

gautamb85 on 10 Nov 2015

@farizrahman4u I don't know if this helps, but I can confirm (for sure) at least in Cho et al.'s approach, that the decoder is trained by teacher forcing (though I guess you don't have to). That is, during training the decoder is fed the TRUE label of the previous time-step, whereas at test time the true label is replaced by the prediction.
If we want to score a pair of sequences, we still have the true labels at test time, so those can be used. I am not sure if you need the readout or if it helps in this case, since you have the true label.

gautamb85 on 10 Nov 2015

@farizrahman4u You Sir, are a deep learning angel ! :)

Amen

That is, during training the decoder is fed the TRUE label of the previous time-step, whereas at test time the true label is replaced by the prediction.

Maybe I'm missing something but isn't this what we already do? For example, right now, I give it 2 sentences and ask it to predict the next one. During training, I give it the next one it should predict (label). During testing, I test how close its prediction is to the actual sentence that is labelled.

I guess what I'm asking is: how else would you do this? You have to do teacher forcing?

The idea behind this is a hierarchical RNN? I saw a paper recently, so the upper LSTM takes as its input (at each of its time steps) the encoding produced by the lower LSTM.

I believe the whole idea is this:

You start off with with a basic encoder LSTM and decoder LSTM:

words --> Embedding --> Encoder LSTM --> Dense --> Decoder LSTM --> TimeDistributedDense--> Softmax

However, to make this neural net capture more salient featuers, we add _another_ encoding level after the decoder:

words --> Embedding --> Encoder1 LSTM --> Dense --> Decoder LSTM --> Encoder2 LSTM --> TimeDistributedDense--> Softmax

Lastly, we want to ensure that our encoder1 and our encoder2 are big enough to capture all levels of abstraction, so we add _multiple_ layers of LSTMs _within_ encoder1 and encoder2:

words --> Embedding --> Encoder1 LSTM (4 LSTMs) --> Dense --> Decoder LSTM (1 LSTM) --> Encoder2 LSTM (3 LSTMs) --> TimeDistributedDense--> Softmax

As a side note, there are Dense + RepeatVector in between _each_ of the LSTM's within the encoder1 and encoder2 levels.

All hidden states are transferred (propagated) from each previous LSTM to the next LSTM. To do this, you need to use Fariz's broadcast_state he has built in, so you can't use Keras to do this alone. I did not include broadcast state below because it would take up too much space. All layed out, it looks like this:

model = Sequential()

#Encoder 1 layer

model.add(LSTM(hidden_variables_encoding, input_shape=(x_maxlen, word2vec_dimension), return_sequences = False))
model.add(Dense(hidden_variables_encoding)
model.add(RepeatVector(x_sent_len))
model.add(LSTM(hidden_variables_encoding, return_sequences = False)) 
model.add(Dense(hidden_variables_encoding)
model.add(RepeatVector(x_sent_len)
model.add(LSTM(hidden_variables_encoding, return_sequences = False)) 
model.add(Dense(hidden_variables_encoding)
model.add(RepeatVector(x_sent_len)
model.add(LSTM(hidden_variables_encoding, return_sequences = False))
model.add(Dense(hidden_variables_encoding)

#Decoder Layer
model.add(LSTMDecoder2(.........)) 

#Encoder 2 Layer -- notice the change from x_sent_len to y_sent_len

model.add(LSTM(hidden_variables_encoding, return_sequences = False)) 
model.add(Dense(hidden_variables_encoding)
model.add(RepeatVector(y_sent_len)
model.add(LSTM(hidden_variables_encoding, return_sequences = False))
model.add(Dense(hidden_variables_encoding)
model.add(RepeatVector(y_sent_len)
model.add(LSTM(hidden_variables_encoding, return_sequences = True))


#softmax

model.add(TimeDistributedDense(y_matrix_axis, activation = 'softmax'))
model.compile(loss='categorical_crossentropy', optimizer='adam')

Anyone feel free to correct me if I'm wrong. I know this might be overkill, but I figure being as clear as possible is best so we're all on the same page.

NickShahML on 10 Nov 2015

@LeavesBreathe
Maybe I'm missing something but isn't this what we already do? For example, right now, I give it 2 sentences and ask it to predict the next one. During training, I give it the next one it should predict (label). During testing, I test how close its prediction is to the actual sentence that is labelled.
I guess what I'm asking is: how else would you do this? You have to do teacher forcing?

The idea is that a RNN is being used as a generative model of your data. So I want to find the log likelihood (negative cross entropy) of seq-2 given seq-1 or p(seq-2 | seq-1). So its like I have the correct label (say it was produced by some other system) and I want to score how good it is.

Yes, you would teacher force. Like a language model. Say you have a sequence and a probability model P(seq) (given bt the RNN), if you wanted to find how 'likely' the sequence is given the model, you would feed in the current time-step and ask it to predict the next one. You wouldn't (perhaps additionally) feed the prediction to the next step (because you might end up finding the likelihood of a different sequence. I'm not sure about that last sentence.lol)

However, to make this neural net capture more salient featuers, we add another encoding level after the decoder:
words --> Embedding --> Encoder1 LSTM --> Dense --> Decoder LSTM --> Encoder2 LSTM --> TimeDistributedDense--> Softmax

I need to look at the code again, but this model seems a little strange. Not to mention, I might be wrong.
So the decoder produces an output at every time-step, which is being fed to a new encoder. This guy will/should encode a sequence of predictions produced by the decoder into a single vector (if it works as a standard encoder).
Though a standard encoder produces a single output, so connecting it to a time-distributed layer without a repeatVector should break your code (which I assume it doesn't, so something else must be going on)

If you are connecting an encoder to a timedistributed dense layer, it is not really an 'encoder' as it must be producing an output at every time-step to feed to the dense layer.

gautamb85 on 10 Nov 2015

Encoder 1 layer

model.add(LSTM(hidden_variables_encoding, input_shape=(x_maxlen, word2vec_dimension), return_sequences = False))
model.add(Dense(hidden_variables_encoding)
model.add(RepeatVector(x_sent_len))
model.add(LSTM(hidden_variables_encoding, return_sequences = False))
model.add(Dense(hidden_variables_encoding)
model.add(RepeatVector(x_sent_len)
model.add(LSTM(hidden_variables_encoding, return_sequences = False))
model.add(Dense(hidden_variables_encoding)
model.add(RepeatVector(x_sent_len)
model.add(LSTM(hidden_variables_encoding, return_sequences = True))
model.add(Dense(hidden_variables_encoding)

I am not sure if this is an 'encoder' per say. As the last return sequences = True.
I think the idea is to get a single vector encoding of the sequence

gautamb85 on 10 Nov 2015

Hey @gautamb85 I think this is a good discussion, though it would require alot of typing to communicate back and forth, do you want to skype chat for a bit if you're free? I think talking this out would be easier, and we can post back here when we have a conclusion? My username is leavesbreathe

NickShahML on 10 Nov 2015

@LeavesBreathe The final layer of encoder1 should have return _sequences=False

farizrahman4u on 10 Nov 2015

@LeavesBreathe The final layer of encoder of 1 should have return _sequences=False

My error. You want a final vector in the end of the encoder, so setting return_sequences=False makes complete sense.

Fariz, if you want to join the skype chat, feel free to!

NickShahML on 10 Nov 2015

@gautamb85 I havent tried the Cho's rescoring thing. Can you please explain it to me.. like what is your input, what is the output, what are you trying to optimize etc?

farizrahman4u on 10 Nov 2015

@LeavesBreathe You two skype and later comment your conclusions here for us. I have some work pending, #964, #928, and documentation of #893. So pretty busy:)

farizrahman4u on 10 Nov 2015

Sounds good -- I'll skype with @gautamb85 (if he's cool with that), and we'll get back as to what we agree is the most optimal model. I'll test that model tonight if I can get everything setup.

NickShahML on 10 Nov 2015

@farizrahman4u
Lets say I have an english-french translation pair. So you would encode the english sentence (input 1).
For the decoder, in Cho's paper (appendix) the GRU non-linearity gets 3 inputs
- the context/encoding produced by the encoder (and its associated weight matrix)
- The french sentence (which at test time is replaced by the prediction)
- The hidden activation from the previous time-step.
1. Providing an extra input (so that the context vector is fed explicitly to each time-step) might prove a little problematic as it might need a new layer to be written.
You can refer to https://github.com/dagbldr/dagbldr
He has a conditional_GRU layer Vs a standard GRU_layer
2. Though a workaround might be to just use the encoding to initialize the decoder recurrence (like in sutskiver's paper)
3. If one wants to use the model to generate translations, then the prediction is needed at test time. If the model is to be used for scoring, we just need to evaluate the log-likelihood / negative cost
- optimization criteria etc are exactly the same.
- I had an idea how to do this using a graph() model, but the problem was getting an additional input to a GRU/LSTM layer. Plus I couldn't initialize the decoder recurrence as I didn't have a stateful RNN. So I can try that out.

Does a graph() model make sense for what I described? Il post some pseudo code in a little while)

gautamb85 on 10 Nov 2015

@LeavesBreathe Sure, Skype sounds good. Today is kinda busy, but I might be available later at night (if you are a late sleeper). Tomorrow evening should be cool. I'm in Montreal, hoping you're in an easy to co-ordinate with time-zone :)

gautamb85 on 10 Nov 2015

I'm in Cincinnati, so we have the same time zone. I have some things I gotta do tonight, but I'm free now untill 5. Or 9pm to 11pm tonight. Whatever works best for you. Just add me on skype and we can figure out a time. Sometimes I step away from my desk for a while, but if we schedule a time, I'll be sure to be online then.

NickShahML on 10 Nov 2015

@gautamb85

Providing an extra input (so that the context vector is fed explicitly to each time-step) might prove a little problematic as it might need a new layer to be written

This is already done using the LSTMDecoder2 layer in seq2seq. The output from encoder is repeatedly input to the decoder at every time step.

farizrahman4u on 10 Nov 2015

@farizrahman4u O sweet! I didn't look at decoder2 yet.
so I would do a graph() model sorta like this
input-1 -> encoder
input-2 -> LSTMdecoder2

Does a graph() architecture make sense ? I would need it to feed in the second sequence no?
I'll try out and report back, but yes, it should solve the problem.

Q. I noticed that you have a dense layer between stacked LSTM in the encoder, and also between the encoder and decoder. Why do you do it this way, is it shape related?

gautamb85 on 10 Nov 2015

Yes. the Dense in between encoder and decoder is to make the shape compatible. There is no dense in between the LSTM stack layers btw. (Its only there in @LeavesBreathe 's comment). there are RepeatVectors in between the stack layers, again, for shape compatibility.

farizrahman4u on 10 Nov 2015

@gautamb85 This could be done fairly easily.. all you have to do is add an additional input to LSTMDecoder2, so that it at a given time step it will have 3 inputs:

Context vector from encoder
The French word at that time step
Output from previous time step

_Note: French word embedding size should be same as dimension of context vector from the encoder._

Now lets see some pseudo code:

french = Sequential()
french.add(Embedding(...))

english = Sequential()#this is your encoder
english.add(Embedding(...))
english.add(DeepLSTM(..., return_sequence=False))

#Cheat keras..make it think its a single input
english.add(Reshape(1,french_embedding_size))
merge=Merge(english, french, 'concat')

#Decoder
deocoder=LSTMDecoder3()
model = Sequential()
model.add(merge)
model.add(LSTMDecoder3(....))

#optionally
english.broadcast_state(decoder)

model.compile(...)

I will post code for LSTMDecoder3 shorty!

farizrahman4u on 10 Nov 2015

Done!

from seq2seq.lstm_decoder import LSTMDecoder2

class LSTMDecoder3(LSTMDecoder2):

    def _step(self, si, sf, sc, so,
              x_tm1,
              h_tm1, c_tm1, v,
              u_i, u_f, u_o, u_c, w_i, w_f, w_c, w_o, w_x, v_i, v_f, v_c, v_o, b_i, b_f, b_c, b_o, b_x):

        #Inputs = output from previous time step, vector from encoder, french sentence
        xi_t = T.dot(x_tm1, w_i) + T.dot(v, v_i) + si + b_i 
        xf_t = T.dot(x_tm1, w_f) + T.dot(v, v_f) + sf + b_f
        xc_t = T.dot(x_tm1, w_c) + T.dot(v, v_c) + sc + b_c
        xo_t = T.dot(x_tm1, w_o) + T.dot(v, v_o) + so + b_o

        i_t = self.inner_activation(xi_t + T.dot(h_tm1, u_i))
        f_t = self.inner_activation(xf_t + T.dot(h_tm1, u_f))
        c_t = f_t * c_tm1 + i_t * self.activation(xc_t + T.dot(h_tm1, u_c))
        o_t = self.inner_activation(xo_t + T.dot(h_tm1, u_o))
        h_t = o_t * self.activation(c_t)

        x_t = T.dot(h_t, w_x) + b_x
        return x_t, h_t, c_t

    def get_output(self, train=False):
        ip = self.get_input(train)
        v = ip[0]#English context vector from encoder
        S = ip[1:]#French Sentence
        si = T.dot(S, self.S_i)
        sf = T.dot(S, self.S_f)
        sc = T.dot(S, self.S_c)
        so = T.dot(S, self.S_o) 
        [outputs,hidden_states, cell_states], updates = theano.scan(
            self._step,
            sequences=[si, sf, sc, so],
            outputs_info=[v, self.h, self.c],
            non_sequences=[v, self.U_i, self.U_f, self.U_o, self.U_c,
                          self.W_i, self.W_f, self.W_c, self.W_o,
                          self.W_x, self.V_i, self.V_f, self.V_c,
                          self.V_o, self.b_i, self.b_f, self.b_c, 
                          self.b_o, self.b_x],
            truncate_gradient=self.truncate_gradient)
        if self.state_input is None and self.remember_state:
            self.updates = ((self.h, hidden_states[-1]),(self.c, cell_states[-1]))
        for o in self.state_outputs:
            o.updates=((o.h, hidden_states[-1]),(o.c, cell_states[-1]))           
        return outputs

    def set_params(self):
        super(LSTMDecoder3, self).set_params()
        dim = self.input_dim
        hdim = self.hidden_dim
        self.S_i = self.init((dim, hdim))
        self.S_f = self.init((dim, hdim))
        self.S_c = self.init((dim, hdim))
        self.S_o = self.init((dim, hdim))
        self.params += [self.S_i,self.S_c, self.S_f, self.S_o]

    def build(self):
        self.set_params()
        self._build()

Might have typos/indentation issues because I am typing this on my phone and can not test it right now.

farizrahman4u on 10 Nov 2015

I wil test out.

a couple of questions

Cheat keras..make it think its a single input

englsh.add(Reshape(1,french_embedding_size))
merge=Merge(english, french, 'concat')

Decoder

deocoder=LSTMDecoder3()
model = Sequential()
model.add(merge)
model.add(LSTMDecoder3(....))

Q. So you are concatenating the encoder output and the embedding for the french word into a single vector. So decoder3 takes this guy (concatenated vector) as the new additional input ?
It is also getting the encoding explicitly at every time-step (as was the case for lstmdecoder2)?

Q. I don't get why you need to concatenate them.

Can't it be -

french = Sequential()
french.add(Embedding(...))

english = Sequential()#this is your encoder
english.add(Embedding(...))
english.add(DeepLSTM(...output_dim=xdim, return_sequence=False))

(maybe a reshape over here is needed)
Q. can't I get the output from deepLSTM, like : context = output fro Deep_LSTM
and then do model.add(context) instead of add(merge)
Unless its not easy to get the layer output

Decoder

deocoder=LSTMDecoder3()
model = Sequential()
model.add(context)
model.add(LSTMDecoder3(....))

PS. You typed this on your phone?
Mind = Blown :)

gautamb85 on 10 Nov 2015

@gautamb85

PS. You typed this on your phone?

Most of it is copy-pasted from LSTMDecoder2.

. So you are concatenating the encoder output and the embedding for the french word into a single vector.

No. I am concatenating the encoder output and the French SENTENCE (NOT WORD) into a single matrix(not vector)

If the French sentence was [word1, word2, word3, word4], after merging it with the context vector it would look like : [context, word1, word2, word3, word4]

So decoder3 takes this guy (concatenated vector) as the new additional input ?

No. This guy is THE input. Not additional. Technically, the number of inputs for for LSTMDecoder2 and LSTMDecoder3 is same (which is 2). But logically,LSTMDecoder3 has one extra input(the merged input could be seen as 2 inputs)

It is also getting the encoding explicitly at every time-step (as was the case for lstmdecoder2)?

YES

Q. I don't get why you need to concatenate them

We are packing the inputs for the decoder into a single tensor. The decoder then separates them out.

If your sentence pair is ["how are you", "comment allez-vous"], your merged guy will look like this:
[ f("how are you"), e("comment"), e("allez"), e("vous")]. Here f("how are you") is your context vector that you get from the encoder.

Lets analyze the input to the decoder and its output for 4 time steps.

Time1 : x1 = LSTM(context, french_word1, context)
Time2: x2 = LSTM(context, french_word2, x1)
Time3: x3 = LSTM(context, french_word3, x2)
Time4: x4 = LSTM(context, french_word4, x3)

Hope this helps.
In your code, you are not using your french model at all !! Am I missing something?

farizrahman4u on 10 Nov 2015

My bad with the disconnected french model. (Also typed on my phone. Lol)

I see. Yea, thats why u concatenate them, as the context is getting replaced after the first timestep.

I thought it was like this :

the context was repeated for every time-step, or something like that.

You concatenate so that the context is there at every step. Is that correct?

Thanks again.
I will update you when I get something going.

Sent from my iPhone

On Nov 10, 2015, at 2:48 PM, Fariz Rahman [email protected] wrote:

@gautamb85

PS. You typed this on your phone?

Most of it is copy-pasted from LSTMDecoder2.

. So you are concatenating the encoder output and the embedding for the french word into a single vector.

No. I am concatenating the encoder output and the French SENTENCE (NOT WORD) into a single matrix(not vector)

If the French sentence was [word1, word2, word3, word4], after merging it with the context vector it would look like : [context, word1, word2, word3, word4]

So decoder3 takes this guy (concatenated vector) as the new additional input ?

No. This guy is THE input. Not additional.

It is also getting the encoding explicitly at every time-step (as was the case for lstmdecoder2)?

YES

Q. I don't get why you need to concatenate them

Lets analyze the input to the decoder for 4 time steps.

Time1 : x1 = LSTM(context, word1, context)
Time2: x2 = LSTM(context, word2, x1)
Time3: x3 = LSTM(context, word3, x2)
Time4: x4 = LSTM(context, word4, x3)
Hope this helps.

—
Reply to this email directly or view it on GitHub.

gautamb85 on 10 Nov 2015

@gautamb85 I think we are talking about slightly different models. Can you give me an example of your x_train, y_train etc?

farizrahman4u on 10 Nov 2015

@farizrahman4u I will have to get back to you on that in a little while.

From your code of LSTMdecoder3:
xi_t = T.dot(x_tm1, w_i) + T.dot(v, v_i) + si + b_i

where v is the context vector from the encoder and v_i is the weight matrix (for input gate only) and x_tm1 is the prediction from the previous time-step.

Could I replace the prediction x_tm1 with the actual french word, or alternatively, add not the T.dot(x_t,W)
where x_t is the current french word. Which I guess is the need to concatenate the context and the french sentence.

But I think what you are suggesting with the architecture (the concatenation) should achieve the same thing.

gautamb85 on 10 Nov 2015

I need more clarity on what we are trying to do here. Lets start with what your training data will look like. I will then pick the best model for you.

farizrahman4u on 10 Nov 2015

I eventually want to use the model with speech.
So my training data would be a pairs of recordings that are padded to the same length. So it would be a 3D matrix - (N_samp, maxlen, feat_dim) representing a mini N_samp examples (corresponding to seq-1) and a similar matrix corresponding to seq-2.

Now this model can be trained by teacher forcing, i.e. instead of feeding the prediction (as you do in LSTMdecoder2 I believe) you can feed in the true label from the previous time-step.

This is how Cho et. al trained there models, both for generating translations and scoring pairs of translations. In the generation case, you need to replace the true label with the prediction as you don't have the true label.
However in the scoring case, you can use the model as it (if it is trained by teacher forcing), since we have access to both seq-1 and seq-2

gautamb85 on 10 Nov 2015

This paper is pretty interesting: http://arxiv.org/pdf/1511.01432.pdf

I gotta read it some more to fully understand it, but looks like there's even more we can implement.

Fariz, I'm having some difficulty with getting your classes to work. I'm gonna try a few variations first, but if I can't get any of them working, I'll report back here tomorrow afternoon or so.

NickShahML on 11 Nov 2015

@farizrahman4u
Thank you for your decoder code, does your decoder code is an attention model?

tttwwy on 11 Nov 2015

@tttwwy No.Its just a stateful LSTM with readout and hidden state broadcasting.

farizrahman4u on 11 Nov 2015

@LeavesBreathe Please open an issue in seq2seq for any problem you are facing. Try recloning. I just made an update.

farizrahman4u on 11 Nov 2015

@farizrahman4u I will be sure to open an issue on your seq2seq. Give me at least a full day as I'm testing alot of variations/debugging before I come to you with the final problem.

@tttwwy I agree with you that attention is very important. I'm working on some code that I hope to implement in two to three weeks to address this issue. Maybe add on to Fariz's seq to seq model so we all have one working model.

@gautamb85 did you want to still chat tonight? I think it would be good to share each other's ideas with each other. Doesn't have to be long. I can't pm you over github, so if you can just add me on skype and we can figure out a time. I'm free all day today.

NickShahML on 11 Nov 2015

Hey. Sorry for the late reply. Are u free at 10-1030 eastern time?

Sent from my iPhone

On Nov 11, 2015, at 12:29 PM, LeavesBreathe [email protected] wrote:

@farizrahman4u I will be sure to open an issue on your seq2seq. Give me at least a full day as I'm testing alot of variations/debugging before I come to you with the final problem.

@tttwwy I agree with you that attention is very important. I'm working on some code that I hope to implement in two to three weeks to address this issue. Maybe add on to Fariz's seq to seq model so we all have one working model.

@gautamb85 did you want to still chat tonight? I think it would be good to share each other's ideas with each other. Doesn't have to be long. I can't pm you over github, so if you can just add me on skype and we can figure out a time. I'm free all day today.

—
Reply to this email directly or view it on GitHub.

gautamb85 on 11 Nov 2015

@gautamb85 @LeavesBreathe @farizrahman4u have you seen this: http://www.tensorflow.org/tutorials/seq2seq/index.md? Google just open sourced a deep learning toolkit with a graphical interface. Includes a sequence to sequence model. I am unreasonably excited. Has a python interface and an attentional model, something I've really wanted and needed for my research. It also has a python interface

simonhughes22 on 12 Nov 2015

@simonhughes22 since you are interested in models besides Keras, have you seen blocks-examples? They have a machine translation model with attention working out the box for en-cs. https://github.com/mila-udem/blocks-examples

EderSantana on 12 Nov 2015

Slightly off topic:
@fchollet tweeted that keras will be supporting seamlessly both theano and tensorflow. Does this mean that keras models could run on android? Because tensorflow has an android example. In the mean time, is there any way to get a keras model work on anndroid as of now? Has anyone tried it (like turn of all the c++ stuff and run theano in pure python mode?)

farizrahman4u on 12 Nov 2015

@gautamb85 , sorry but we had a power outage yesterday -- internet is stil out but we can chat hopefully tonight. Do you mind adding me on skype (username is leavesbreathe) so that we don't need to take up space on this thread to schedule talking?

NickShahML on 12 Nov 2015

Hey guys so I pretty much spent the entire day reading up on TensorFlow. I think the bottom line is that they have more capabilities (attention mechanism), however it is so much more messy than Keras. So basically, I've decided to try out TensorFlow, but I still want to use Keras (as I like Keras's community and logic flow more).

With all of that being said, I think it would be interesting to compare results using Keras versus TensorFlow. I hope to have TF up in the next week or so to see what type of results I'm getting.

NickShahML on 12 Nov 2015

@EderSantana @simonhughes22 @melonista @LeavesBreathe
there is an attention based nmt model , which may have some help for you .
https://github.com/kyunghyuncho/dl4mt-material/tree/master/session3

tttwwy on 13 Nov 2015

@LeavesBreathe TensorFlow code is messy when compared to Keras. It is more easy to contribute to Keras, which means in the long run Keras will be filled with a lot of features. We even have a working implementation of Nural Turing Machine !( #990 ). We are the first to open source it.

farizrahman4u on 13 Nov 2015

It is more easy to contribute to Keras, which means in the long run Keras will be filled with a lot of features.

I totally agree with you. Keras is much cleaner and easier to contribute to (I don't even think tf is allowing PR's).

However, I want at least try it for a little bit because there may be a few things I learn from TF that we can implement in Keras. For example, they give an attention model mechanism here:

https://github.com/tensorflow/tensorflow/blob/master/tensorflow/models/rnn/seq2seq.py#L453-520

Btw, I was looking at the neural turing machine earlier that Eder wrote and it looks so cool

NickShahML on 13 Nov 2015

I agree with @farizrahman4u !!! Keras is much easier to program, cause we are focused in offering higher level APIs.

To be more precise, I think we are the first to open source a NTMachine with RNN controllers. Others had implementations with feedforward controllers (which is less powerful).

For example, here is a simple LSTM classifying the MNIST (running row by row) in TensorFlow https://github.com/EderSantana/TwistedFate/blob/master/mnist_lstm.py It is fast to start running, but we have to hard code all dimensions (fix batch size, fix sequence length, etc)

EderSantana on 13 Nov 2015

@LeavesBreathe Sorry for not getting back to you. I don't have a skype account, aI will set it up over and we add you over the weekend.

@farizrahman4u I had a question about your code. Specifically realting to prediction feedback:

def _step(self, si, sf, sc, so,
x_tm1,
h_tm1, c_tm1, v,
u_i, u_f, u_o, u_c, w_i, w_f, w_c, w_o, w_x, v_i, v_f, v_c, v_o, b_i, b_f, b_c, b_o, b_x):

    #Inputs = output from previous time step, vector from encoder, french sentence
    xi_t = T.dot(x_tm1, w_i) + T.dot(v, v_i) + si + b_i 
    xf_t = T.dot(x_tm1, w_f) + T.dot(v, v_f) + sf + b_f
    xc_t = T.dot(x_tm1, w_c) + T.dot(v, v_c) + sc + b_c
    xo_t = T.dot(x_tm1, w_o) + T.dot(v, v_o) + so + b_o

    i_t = self.inner_activation(xi_t + T.dot(h_tm1, u_i))
    f_t = self.inner_activation(xf_t + T.dot(h_tm1, u_f))
    c_t = f_t * c_tm1 + i_t * self.activation(xc_t + T.dot(h_tm1, u_c))
    o_t = self.inner_activation(xo_t + T.dot(h_tm1, u_o))
    h_t = o_t * self.activation(c_t)

    x_t = T.dot(h_t, w_x) + b_x
    return x_t, h_t, c_t

Q. In that code snippet, x_t is the prediction (which is getting fed back via scan) and it is intialized as v (the context produced by the encoder). correct?

Q. I am confused because, if this was regression, then x_t represents the actual prediction of the model. However for classification, this x_t would get fed to a softmax function, and then we would sample/argmax to get the actual prediction.

Is it equivalent to feedback just x_t? (without doing softmax etc) and does it work the same way at test time? I mean, at test time the x_t (before softmax) is fed back as the 'prediction', however the actual prediction (visible) is made by feeding these hidden predictions to a softmax layer after the 'scanning (theano)' is done.

Q. I assume the get_outputs function that (returns the outputs) feeds them to a dense (softmax) layer that makes the prediction ?

Ps. I know a lot of that may not sound clear, and I am happy to clarify.

gautamb85 on 13 Nov 2015

@gautamb85 no problem no rush -- I don't know if its necessary that we talk, but if you want to chat, I think it would be good! add me whenever you want!

NickShahML on 13 Nov 2015

@gautamb85 x_tm1 is the output from the previous timestep(with initial value v), it need not be the actual prediction of the model at that timestep(which is y_tm1,because it is difficult to access). Still, x_tm1 is still a good representation of y_tm1. The above layer is a general layer - like the default LSTM in Keras. Whether it should do regression or classification is up to you. You simply stack activation layers over it. That being said, try doing a sigmoid/tanh over x_t and see if you find anything interesting.

farizrahman4u on 14 Nov 2015

Hey Guys, I'm back from exploring TensorFlow, and I'm fired up to keep working on @farizrahman4u's seq2seq. I'm having a few issues Fariz, which I will post to your seq2seq channel. However, here are a few takeaways, I got from TensorFlow:

In each sentence, they append a sentence start (Go) and a sentence end token. And _then_ they pad with a separate symbol (PAD). Thus for the output, PAD is on the right. However, for the _input_, they reverse the entire sentence meaning the PAD is on the left. I thought this was interesting.
They have two separate types of LSTM cells: a basic and a more advanced version. In their translation example, they used the basic one as it is much faster. Just something to take note on.
The biggest feature they have over Fariz's Seq2Seq is attention mechanism, which helps for especially long input sentences. Once I get Fariz's seq2seq working, I plan on working hard on this to add it in.
Another Feature they have is a sampled softmax. This is nice because it allows you to do a softmax of only a fraction of your words, and you get pretty good results. This allows you to do softmax over 200,000 words on a GPU. However, I think the hierarchical softmax (another area I want to investigate) works as well, so I don't think this is a killer feature.
For efficiency, they use bucketing, which is sorting sentences by length and placing them in separate buckets. In the end, this results in a ~2x speedup in training.
One thing that drove me insane is that it appears that the input to their seq2seq models are 2d tensors, including the non-embedding models. I am still confused on this.
If anyone's interested in using TensorFlow, prepare for a huge mess. It has a lot of good tools, and you can easily tell which device to handle which op, but to replicate what we do here, it takes at least 3 separate classes, just for the framework.

NickShahML on 15 Nov 2015

@LeavesBreathe I am sort of offline right now. But as my seq2seq repo has got more attention than I anticipated, I would be spending more time on it once i am done with my upcoming exams:).

farizrahman4u on 17 Nov 2015

Best of luck with the exams -- I must say that your seq2seq repo has gotten much attention from my skype contacts -- they keep asking me about it and are constantly comparing it to tensorflow's seq2seq. I think you and Tensorflow have the best working seq2seq models right now.

NickShahML on 17 Nov 2015

@LeavesBreathe

. I think you and Tensorflow have the best working seq2seq models right now.

Thats a huge compliment. Thanks!

farizrahman4u on 17 Nov 2015

Regarding attention mechanism, I will be converting the following project to keras:
https://github.com/npow/RNN-EM
The API will be similar to that of an LSTM, so just replacing all LSTMs with the RNN_EM class would give you a seq2seq model with attention mechanism.

farizrahman4u on 17 Nov 2015

That's gonna be killer. Beyond that, the only major feature that I see that tensorflow has is a sampled_softmax, but I'm trying to work on a hierarchical softmax right now. It will definitely take me a while as it has been attempted already in Keras.

NickShahML on 17 Nov 2015

@LeavesBreathe excited for that: would be a very useful addition.

wxs on 18 Nov 2015

Hey Guys just as another update, I dabbled more with TensorFlow, and I am getting some pretty good results with it. The lowest perplexity I've gotten has been 31. Though with Fariz's Seq2Seq, I got a perplexity of 45 (and that's with no attention) so there is plenty of optimization still left to do!

NickShahML on 18 Nov 2015

I'm also working on a generic attention decoder, but waiting for multiple input support on graph rather than using a concat, I feel that it's cleaner. I'm organizing a hackathon on attention models this Saturday so I'll keep you posted. If you guys have any suggestion that would be appreciated.

charlesollion on 24 Nov 2015

Good to hear. My attention has shifted more to tensorflow lately, and I'm trying to implement curriculum learning right now. Its really hard for me to determine whether tensorflow or keras is a better platform. I think they each have their strengths.

Fariz's platform is really good as well. Right now, because of attention, I am still getting better results in tensorflow.

Also, Fariz. I recently had a chat with kkastner and he mentioned a possible improvement with averaging the hidden states in the decoder. I'm going to test that over the next two weeks (each run takes like 6 days). If I get improved results, I'll report back here on this thread as this is something pretty easy to implement.

NickShahML on 24 Nov 2015

@farizrahman4u @LeavesBreathe thanks for a very illuminating discussion here. Would you guys talk a bit about the incoming data into the seq2seq or deepLSTMs that you're using? I'm a bit unclear right now about how to model my data in order to do the predictions so far.

viksit on 29 Nov 2015

@viksit there are alot of ways you can input your data into seq2seq. The most basic is one-hotting, where each word is assigned one word. Then you can do what's called an embedding layer, which will compress this vector into a readable format. If you scroll up this thread, @simonhughes22 helped me understand this.

NickShahML on 29 Nov 2015

Ah, I'm already using custom trained word vectors. Some thoughts. I'm using a toy dataset here. I did see @simonhughes22s comment above, and rather than "book-ending" my sentences - I'm simply using end of sentence markers.

What is your name?
My name is X.
How old are you?
I am Y years old.

I then transform this data into,

X_train,y_train
What is your name?<EOS>,My name is X.<EOS>
How old are you?<EOS>,I am Y years old.<EOS>

Question - Do I need to do a mask to make sure that these symbols aren't learned? (eventually)
I then convert this to (one hot for simplicity) and then an embedding layer
Add the layers sequentially - just a simple seq2seq and then compile
Question - how do you think about converting it from (nb_samples, dims) to (nb_samples, timesteps, dims)?
How long does it take for you to build the s2s model you described?

An end-to-end example would actually be quite helpful to infer some of the details.

viksit on 29 Nov 2015

@viksit ,lately, I have been working with Tensorflow lately. Not saying that Fariz's platform is bad or anything, but tf has just given me a better understanding of what's going on.

As such, I don't mask anything. I simply pad with 0's at the end for input and output.

each word is a timestep, so I'm not quite sure what you're asking.

When you say build...do you mean compile? about 10 mins. To train? about 5 days, depending on data size and compute power. For me I have two 980 ti's and it takes me about 5 days for a network with 1million samples.

I don't have an end to end example, but you might want to check out this: http://www.tensorflow.org/tutorials/seq2seq/index.html

NickShahML on 29 Nov 2015

Hi everyone,
I have made the following updates to my seq2seq implementation:

Seqseq model with attention
Hidden state propagation throughout LSTM stack
A basic SimpleSeq2seq model which uses only pure Keras to serve as performance baseline, against which other models can be compared.
Cleaner API. Now the seq2seq API is more similar to the Keras Recurrent API.
Cleaner code with doc strings
Better readme

www.github.com/farizrahman4u/seq2seq

farizrahman4u on 15 Jan 2016

Thanks alot fariz, I'm really swamped right now, but this is nice to have for sure. Really appreciate the help.

NickShahML on 15 Jan 2016

Cool!

jgc128 on 15 Jan 2016

I have been following this thread. I am not exactly doing text generation or machine translation. @farizrahman4u @LeavesBreathe @simonhughes22 your insights were really useful. @LeavesBreathe I would like to know if tensorflow gave you good results. I am going to tryout @farizrahman4u model as well. So I just wanted to know if tensorflow worked better. I would like to get a suggestion as well if something like CNN for feature extraction followed by LSTM encoder-decoder. I am working at character level and not word level. Thank a lot in advance..

pralav on 29 Jan 2016

pralav I will say tensorflow is really good but it is incredibly slow and hogs a ton of memory as of right now. They have plans to improve, but be prepared to spend alot of money on graphics cards if you plan on using tensorflow.

Integrating CNN's within LSTM's is something that could potentially lead to interesting results. Have to be careful how you structure it though.

NickShahML on 29 Jan 2016

@LeavesBreathe Thanks a lot. Hopefully will inform once i experiment with more models. Thank you.

pralav on 29 Jan 2016

@LeavesBreathe So what do you suggest using instead of tensorflow for NLP tasks which will help in minimize the memory hog?

aliabbasjp on 10 Feb 2016

Well you can wait until tensorflow improves its memory hogging problems, buy more gpu's, or simply switch to theano which is much faster and doesn't hog the memory nearly as much as of right now =/

NickShahML on 10 Feb 2016

Hi @farizrahman4u ,

In this post https://github.com/nicolas-ivanov/debug_seq2seq it is mentioned that the seq2seq implementation of yours is not performing well, and perhaps not generating sequences properly. Have you looked into those examples ? Has your recent commits, especially the attention modules, were to fix some of the issues mentioned there ? Or has there been any other tests like this which show good results with your implementation, perhaps benchmark with tensorflow ?

Thanks

karakiz on 4 May 2016

👍13

Assume we are trying to learn a sequence to sequence map. For this we can use Recurrent and TimeDistributedDense layers. Now assume that the sequences have different lengths. We should pad both input and desired sequences with zeros, right?

Isn't it possible to just skip the input_length parameter and give just the input_dim? Or input_shape=(None, input_dim)?

louisabraham on 19 Dec 2016

@NickShahML : isn't now people use encoder-decoder architecture for sequence to sequence learning?

devm2024 on 5 Feb 2018

Keras: is the Sequence to Sequence learning right?

Most helpful comment

All 219 comments

Idea 1 -- One Hot all Words

Idea 2 -- Hierarchial Softmax

Idea 3 -- Use Regression to find the closest word in Word2Vec

Idea 4 -- Use the Clustering Idea Discussed Earlier

Encoder 1 layer

Cheat keras..make it think its a single input

Decoder

Decoder

Related issues