Assume we are trying to learn a sequence to sequence map. For this we can use Recurrent and TimeDistributedDense layers. Now assume that the sequences have different lengths. We should pad both input and desired sequences with zeros, right? But how will the objective function handle the padded values? There is no choice to pass a mask to the objective function. Won't this bias the cost function?
We should pad both input and desired sequences with zeros, right? But how will the objective function handle the padded values?
You could use a mask the hide your padded values from the network. Then you can discard the masked values in your sequence output. Currently masking is only supported via an initial Embedding layer, though. See: http://keras.io/layers/recurrent/
I'm a little new to recurrent network. When Eder talked about the sequence to sequence map, it only reminds me of the char-level LSTM (http://karpathy.github.io/2015/05/21/rnn-effectiveness/). In this case, even we can discard the masked values in your sequence output, the padding values still have effects on the hyper-parameters of the model itself. So is it enough to just discard the masked values? Again, as Eder has asked, won't this bias the cost function?
Maybe this issue is of your interest #382
This worked for me. Padding the inputs and then the outputs, and adding special sequence start and sequence stop symbols to book-end each sequence, then the following model structure:
embedding_size = 64
hidden_size = 512
print('Build model...')
model = Sequential()
model.add(Embedding(max_features, embedding_size))
model.add(JZS1(embedding_size, hidden_size)) # try using a GRU instead, for fun
model.add(Dense(hidden_size, hidden_size))
model.add(Activation('relu'))
model.add(RepeatVector(MAX_LEN))
model.add(JZS1(hidden_size, hidden_size, return_sequences=True))
model.add(TimeDistributedDense(hidden_size, max_features, activation="softmax"))
model.compile(loss='mse', optimizer='adam')
If you have a sequence stop symbol, it should learn when to stop outputing non-zero values, and will output zero's thereafter. May not be ideal, but works within the current framework. I also tried replicating the output to the maxlen width (including the stop symbol) during training, and then just took the first valid sequence at test time.
btw, JZS1 is an RNN like a GRU or an LSTM, each of which could be used here instead in both the encoder and decoder.
Sounds like a good idea, but see that you are forcing your model to learn something else instead of your original problem.
I think I know how to solve this problem with a masking layer after each regular layer and if we allow loss to be a custom function from. This way, instead of averaging the cost with .mean(), we divide it by the number of non zero elements on each time series. This non discriminant averaging is where a lot of bias come from as well.
@fchollet is there any chance that we can get "loss" to check if its input is callable? I really didn't want to have to add new stuff to my code repo every time I needed something custom made. I can write a PR for that if there is a general interest. Let me know.
PR #446 is relevant here. Now all that we need is that the cost functions ask for that mask in their calculations.
@simonhughes22 Hello, I have been working with your code snippet (from a previous discussion, with example/toy data). While the code works, I am confused if it is doing what it is supposed to. I am new to Keras and deep learning, so please do bear with me.
As far as I understand, the idea is to have an 'encoder' process sequence X, and after the last time step of the sequence is processed, a 'decoder' starts to predict the new sequence Y.
In the code you provided, how does the model know that sequence X is complete and now it should start predicting Y?
@gautamb85 the inputs are padded, the encoder RNN simply goes along the entire input, updating it's hidden state accordingly until it hits the end of the array, and then outputs a vector. That is then fed into the decoder. Btw, they just added masking to the loss function (see the recent commit history), so I'd make sure you are masking the loss function (you'll have to dig through the code or documentation to figure that out).
@simonhughes22 Thanks for the reply. I had a couple more questions.
I would greatly appreciate clarification regarding these points.
@simonhughes22 @gautamb85 Did you guys tried out the cost function masking proposed by #451?
I just mentioned this over in #451, but can't you use the sample_weight
parameter to fit()
and pass in 0 weight to the meaningless outputs?
Hey,
I'm new to keras and I have a simple question:
Why you use the mse objective in model.compile(loss='mse', optimizer='adam')?
Wouldn't it be more appropriate to use categorical_crossentropy, since you are using softmax in model.add(TimeDistributedDense(hidden_size, max_features, activation="softmax"))?
@benjaminklein for a while there was an issue in theano + keras with the binary cross entropy and this kind of model, so I used MSE instead. Now I use binary cross entropy as I have a multi-class multi-label classification problem (each word can have 0 to many classes). If you have a more vanilla multi-class problem then yes, categorical_crossentropy should work better. However, I haven't tested that on TimeDistributed dense output, so you'd need to verify it is expecting a single category per output token, not across all tokens, as the output is 3D not 2D, so you have len(tokens) categorical cross entropy calculations. binary cross entropy just works per label, it doesn't compute a distribution over labels, so for that the output shape doesn't matter.
@simonhughes22 Thank you! Also could we change your code to use Graph instead of Sequential such that we'll have two inputs. One for the first sequence and another one for the second sequence?
The motivations is that in many papers about encoder-decoder, the decoding phase is using both the last hidden layer from the first sequence and the previous word of the second sequence. In your code we are only using the last hidden layer from the previous sequence.
@EderSantana @simonhughes22 Have not yet had a chance to try it. Will let you know as soon as I do.
@fchollet I have been reading some keras posts about 'stateful' RNNs. If I understand correctly, the hidden state of the recurrent layer is reset for every time_step of a sequence. (this appears to be the case based on outputs_info for various classes in recurrent.py)
@gautamb85 I thought the issue is that it is reset for every row of input, meaning you can't test it easily by feeding its ouput predictions as inputs (you can, but you are re-feeding the entire input sequence plus the latest predicted for every subsequent time step, which is a little slower than if stateful). If the RNN reset itself between timesteps it wouldn't work...., AFAIK it maintains state across a row (sequence of timesteps) but resets itself once the row is processed. You can make it take it's output as input as I described, it's just slower than models that allow you to remember the state following a prediction.
Note for the example above, I am reading in one sequence, converting to a hidden state, and then predicting a whole second sequence, so you don't have the issue I mention here. However, you may get better results by training a model to predict the next word instead of the next sentence, and feeding each predicted word in as input to predict the next word to generate a sentence.
@benjaminklein as it's predicting a full sequence as output, it is remembering the previous word as it predicts the output sequence via retaining it's hidden state across the input sequence. What is repeated is the encoded representation of the entire previous sentence, but for each word it is predicting, it is also feeding in the hidden state from the previous word as that's how RNN's work. What you are describing is a slightly different type of model where you are predicting the next word and not the next sentence, and then adding that to the existing input and making a new prediction. You can do this with keras too very easily, just remove the last 3 layers from the model above and train it to predict the next word (or character). You'll have to write a bit of code to feed the output back in as input though.
@simonhughes22 Thank you for the clarification. That was my impression. (How could it even work if it was being reset btw time steps). I am still a little confused about testing it in generative mode.
You touched on exactly what I am trying to do, and I am hoping you could help me on some of my concepts.
Taking the machine translation problem in Sutskever's paper as an example, first an english sentence (sequence 1) is converted to a hidden state, then the 'decoder' starts to predict each word of the french sentence (sequence 2). Thus the translated sentence is generated word by word. Is this correct?
In this case, the 'decoder' is essentially a conditional language model (word level), conditioned on the english sentence, i.e. sequence 1. Thus the target for a given tilmestep is the input for the next one.
For training such a model, after reading in sequence-1, the decoder is provided with the 'true' french word at each tilmestep (not the prediction). At test time (as you mentioned) in order to feed the prediction back in to predict the next french word, the entire sequence (eng+french?) would need to be read, in order to predict the next word.
If all that is correct, would I need a graph structure? A pseudo code / flowchart description of the network architecture would be much appreciated.
Thanks in advance!
@gautamb85 no you can use the model I listed above with the english sentence as input and the entire french sentence as output. The RNN model will maintain state across each timestep as it predicts the output sentence, no extra work required on your behalf. You will however need to one hot encode and zero pad the output sequence (the french sentence) and have it do a softmax over all possible words for the output at each time step. The ys then are 3D, each row is a matrix of height - number of french words, and width - number of time steps.
When I mentioned feeding the output as input, that's only if you want to train a language model to predict a word at a time, and then use that to generate text as a generative model. However, as in the example above, you can have it generate whole sequences for you from each input sequence.
@simonhughes22 I have a model training (graphemes to phonemes). I am writing a beam search to see if its actually learning something useful.
In Sutskiver' s paper, the decoder is described as a conditional language model (conditioned on the previous sentence). In the model you proposed, what is the input to the decoder RNN at each time-step?
If I wanted to design a model where for every time step of the decoder, the target of the previous timestep (and generate the output sequence word by word), what modifications to the model would I need to make?
@gautamb85 I think you may be miss-understanding how that model is built. Each time step is a word (although could be a character, a phoneme or whatever). However, each row is an entire sequence, zero-padded to the left to the length of the longest sequence, and also each output row is a sequence, zero padded.
If you want a more traditional RNN like model, look at the Passage repo, however, that has less functionality and can't be used for tagging models (unless you predict word by word as described). When I say tagging models, I mean a one to one mapping of word or character to a tag.
@simonhughes22 When running your code I'm getting:
/usr/local/lib/python2.7/dist-packages/theano/gof/cmodule.py:293: RuntimeWarning: numpy.ndarray size changed, may indicate binary incompatibility
rval = import(module_name, {}, {}, [module_name])
Any ideas?
Thank you!
It's a warning.... I've had that before, but i haven't noticed any issues. Theano is very hard to debug, so I am not able to figure out if that is serious, but every model that gave me that warning was still able to learn effectively on the data. Maybe @fchollet can shed more light on it.
@simonhughes22 Firstly, thank you agin for all your help so far. I was confused about the model as I thought that training was being done by teacher forcing, i.e. the actual targets were being fed to the decoder at each time step. I guess the model could be trained that way, but it would not generalize as well.
I am trying to replicate the approach (standard sequence to sequence task) from here : http://arxiv.org/abs/1506.00196
I took the your model and replaced the last layer -
print('Build model...')
model = Sequential()
model.add(Embedding(max_features, embedding_size, mask_zero=True))
model.add(GRU(embedding_size, hidden_size))
model.add(Dense(hidden_size, hidden_size))
model.add(Activation('relu'))
model.add(RepeatVector(maxlen))
model.add(GRU(hidden_size, hidden_size, return_sequences=True))
model.add(Dense(hidden_size, phn_output))
model.add(Activation('softmax'))
To get the prediction at each time step.
I was able to train the model. Though, after a certain number of training iterations, the loss started oscillating (not sure how to interpret / prevent that). Though for the first few iterations even the accuracy and loss on validation set was behaving properly.
I think it may have to do with the size of my hidden layers etc. I also only took 30000 sequences for training.
I am now trying to figure out how to do a beam search to find the best phonetic transcription of a test grapheme sequence.
I am a little confused on how to go about setting this up. Any advice would be much appriciated.
@gautamb85 if you want to combine this approach with a re-ranking solution (for instance using beam search), I would have the model just output the probabilities over the phonemenes for every time step. You don't need to feed the predictions in at each time, for each input, the model will output a probability distribution over every phonemene for every time step. You can then use that table of probabilities to do something like beam search (although I'd recommend also trying a dynamic programming approach - see how CRF models work, you can fit this model into that framework). However, as the model is keeping track of it's previous prediction across each time step, this should not be necessary. Note that for each input, the model will iterate over each timestep and output a prediction, giving you a list of predictions per timestep without you needing to feed them back in. However, doing so may or may not improve matters.
In terms of the oscillating errors, that normally means that the learning rate is too high and the model is having trouble converging. I'd advise using something like adam or adagrad to optimize as these approaches are very good at setting and adjusting the learning rate for you, and i've never had an unstable model with these appraoches. That said, i've hit points where the test set performance oscillates, and that often means the accuracy may not improve further. However, at this point, the training accuracy is normally still improving, but the model is starting to over fit.
The oscillating may be due to drop out also. I'd advise not using drop out until you've got a model that does very well on the training data, and experimenting with it to reduce over-fitting. My dataset is pretty noisy, which I think is regularizing the model to some degree, so dropout did seem to hurt more than help but normally it is advantageous to do. But first you want to get your model to overfit very well on the training data. That confirms the model can learn effectively on the data. Then start to address over-fitting.
Hi Simon,
You seem to have had success with a sequence to sequence NLP task. I'm struggling and wondered if you could help.
I have a data set of sentences, which are sequences of Socher's pre-trained word vectors. These sequence map 1-to-1 to another series of vectors for each sentence that I have calculated to describe some aspects of the sentences. These vectors are such that their values add to one. For instance, an example mapping would be,
wordvectors["headache"] (50 dim vector) -->> 20 dim vector, [0.0, ..., 0.0, 0.25, 0.25, 0.25, 0.25, 0.0... 0,0]
I am training on a small subset of my data (10,000 points) to get things working, but I cannot overfit even with this small data. In fact, I cannot even get accuracy above 0.2. For instance, running a training epoch on a version of the model you posted above (with the embedding layer removed, as my inputs are already vector embeddings),
embedding_size = 50
hidden_size = 512
output_size = 20
maxlen = 60
model = Sequential()
model.add(JZS1(embedding_size, hidden_size)) # try using a GRU instead, for fun
model.add(Dense(hidden_size, hidden_size))
model.add(Activation('relu'))
model.add(RepeatVector(maxlen))
model.add(JZS1(hidden_size, hidden_size, return_sequences=True))
model.add(TimeDistributedDense(hidden_size, output_size, activation="softmax"))
model.compile(loss='mse', optimizer='adam')
I have padded my inputs and outputs with zeros to fit the maxlen dimension, but I have not added these "stop symbols" you wrote about. What kind of vector would I be adding as a stop symbol? Are they essential to train correctly?
Training with the model above does yield a set of probabilities for each word, so that they sum to one, but the Keras accuracy measure will not go past 0.15. I don't think it is able to fit the data I have.
What do you think is my bottle neck for training towards these vectors? Any ideas?
Thanks!
Hi Simon,
You seem to have had success with a sequence to sequence NLP task. I'm struggling and wondered if you could help.
I have a data set of sentences, which are sequences of Socher's pre-trained word vectors. These sequence map 1-to-1 to another series of vectors for each sentence that I have calculated to describe some aspects of the sentences. These vectors are such that their values add to one. For instance, an example mapping would be,
wordvectors["headache"] (50 dim vector) -->> 20 dim vector, [0.0, ..., 0.0, 0.25, 0.25, 0.25, 0.25, 0.0... 0,0]
I am training on a small subset of my data (10,000 points) to get things working, but I cannot overfit even with this small data. In fact, I cannot even get accuracy above 0.2. For instance, running a training epoch on a version of the model you posted above (with the embedding layer removed, as my inputs are already vector embeddings),
embedding_size = 50
hidden_size = 512
output_size = 20
maxlen = 60
model = Sequential()
model.add(JZS1(embedding_size, hidden_size)) # try using a GRU instead, for fun
model.add(Dense(hidden_size, hidden_size))
model.add(Activation('relu'))
model.add(RepeatVector(maxlen))
model.add(JZS1(hidden_size, hidden_size, return_sequences=True))
model.add(TimeDistributedDense(hidden_size, output_size, activation="softmax"))
model.compile(loss='mse', optimizer='adam')
I have padded my inputs and outputs with zeros to fit the maxlen dimension, but I have not added these "stop symbols" you wrote about. What kind of vector would I be adding as a stop symbol? Are they essential to train correctly?
@cjmcmurtrie Sorry for the late reply, but i've gotten a lot of requests on this posting, and I've also been out of the coutnry. Note that while I've got it to learn something, I haven't tried using it to solve any problems, and I had to run it for a long time to get it to learn something useful. So it may not be the best approach for your particular problem (which I don't know enough about to suggest alternatives, although you could try the skip-thought sentence vectors from a pretty recent paper https://github.com/ryankiros/skip-thoughts).
Regarding the above - the stop symbol in my approach was just a special word, that would then get mapped to it's own embedding via the embedding layer and the network would learn that this meant stop outputting symbols. Sending in a zero length vector might be sufficient for this, or some randomly initialized vector. Why are you using socher's vectors? Those are likely fine-tuned for sentiment analysis (or are you using the Glove vectors). You are probably going to get the best performance by using a graph model and combining fixed vectors from the Glove or word2vec vectors (the latter I've had some success with using RNN's in keras), with an embedding layer that's able to learn it's own vectors. I haven't had chance to try that yet but in theory taht should work best from what i've read. In keras that would be achieve by having 1 input layer read hard-coded fixed pre-trained vectors, and merging that (concat) with an embedding layer where it can learn it's own vectors.
Also, if you are learning a 1-1 mapping, then I would use a differnt model. The model above is meant for where you have different length inputs and outputs. If you have a tagging problem, where the length of inputs matches the outputs, you can use a much simpler model:
model = Sequential()
model.add(JZS1(embedding_size, hidden_size, return_sequences=True))
model.add(TimeDistributedDense(hidden_size, output_size, activation="sigmoid"))
model.compile(loss='mse', optimizer='adam')
You might have to check some of the size's are correct, I just wrote that. If you are mapping to a second sequence of vectors that should work fine. Note that the output is 3D (#rows, vector_len, max sequence length), and you'll want to think about an appropriate loss function. If it's mapping to another set of vectors, mse should be fine. However, if you have some target symbols, you would be better mapping to that and using a softmax with categorical cross entropy. This model is much simpler and should work better (and I have gotten a similar model to do quite well on a supervised task).
Get that working then experiment with the Y-shaped model where you merge pre-trained fixed vectors with ones the model can learn from scratch.
Firstly, this thread has been a major help for me. Thank you everyone! Shout out to @simonhughes22
One thing I have read extensively about (for at least NLP), is that for your input, you want to reverse the order of your sequence. You can read more here: http://papers.nips.cc/paper/5346-sequence-to-sequence-learning-with-neural-networks.pdf
For example, if you are coding in "the dog barked loudly" as an input vector, it is better to write in "loudly barked dog the". This may help learning. I think the idea is that the next sentence that you predict has more to do with the END of the input sentence rather that the BEGINNING of the input sentence.
Forgive me as I am a beginner, but I almost always use 200 to 300 vectors for Word2Vec. Fifty or 20 vectors for a word seems incredibly low to me. Thoughts on this? I am aware of the curse of dimensionality.
My goal is to predict the sentence given the first sentence:
Input: Give First Sentence as Sequence --> Ouput: Yield next sentence as sequence
Here are a few more questions:
Question 1: We do not need to worry about the output mask on the compile layer correct? Following this thread: https://github.com/fchollet/keras/pull/451, it seems that fchollet made the output layer automatically masked (if we intially mask the embedding layer). Therefore, if we pad the y output with zeros, we should be good correct?
Question 2: I really struggle with formatting the y_train data. Right now, I cluster my words, and assign unique id's to each word within each cluster. Therefore, each word has two numbers associated with it. (I clustered the words by applying Word2Vec plus K means -- more info here: (https://redd.it/3psqil).
I know its been mentioned that y_train
is formatted in 3 dimensions, but how is this possible?
Currently This is what I input as my 2d X_train
into the embedding layer in Keras. I'm hoping to do something similar for the y_train
as well:
Dimension 1: number of samples
Dimension 2: number of timesteps
12 3 4 5 6 3 0 0 0 0 0
6 5 4 23 3 5 1 4 2 0 0
Dimension 3: one hot each of the integers? I just thought there would be a more efficient way to do this. Or another thought is that you guys are using the third dimension to vectorize each word? This would explain why you're not doing 200 or 300 word dimensions but rather 20 or 50? But then the RNN would have to predict vectors? This is what confuses me.
Question 3: If we are indeed masking the zeros, then why do we need to tell the sequence where it stops and starts? I don't mind appending a "sentence start" and a "sentence end" number to each sequence. I just don't understand _why_ we should do this if we are indeed masking?
For reference, this is my current model:
max_features = (len(chars)) #this number (always an integer) is either a cluster number or a word id number, both of them have a maximum of ~400
max_features = (len(chars)) #this number (always an integer) is either a cluster number or a word id number, both of them have a maximum of ~400
model = Sequential()
model.add(Embedding(input_dim=max_features, output_dim=512, input_length=maxlen, mask_zero=True))
model.add(LSTM(hidden_variables))
model.add(Dropout(dropout))
model.add(Dense(hidden_variables))
model.add(Activation('relu'))
model.add(RepeatVector(MAX_LEN))
for z in range(0,number_of_decoding_layers):
model.add(LSTM(hidden_variables, return_sequences=True))
model.add(Dropout(dropout))
model.add(TimeDistributedDense(max_features, activation="softmax"))
model.compile(loss='categorical_crossentropy', optimizer='adam')
@LeavesBreathe I've actually found smaller vectors to work better (64 and 128 - note the binary sizes, I think this helps performance by assisting theano with copying chunks of data to and from GPU), likely as that's fewer parameters to learn, and too many can make learning hard for a neural network. Cross validation would help you determine a good size, try varying in magnitude (32,64,128,256) rather than more linear scales, as the relationship between these items and performance tend to be more of an exponential than a linear nature.
Reversing the inputs is something that may help. However, be careful which end you zero pad if you do this Another common strategy is to mirror the inputs, so you duplicate the input in reverse too, allowing the network to get the best of both worlds. There's a name for that approach, but it escapes me right now. Again be wary of the zero padding, you'd want it on the outside of the mirrored input, not the middle, I believe, otherwise the RNN hidden state may get reset once it hits the zeros, depending on what it learns to do in this situation. That may not happen, you'd have to see.
Qu 1. - I haven't played with this, but AFAIK if you pad your 0 pad y's you should be good. You seem to be writing a lot of custom code. keras has libraries for zero padding, as well as determining id's for words, so i'd rely on those rather than hand rolling it all as they've likely been well tested by either unit tests or the community, and are known to work with the package.
Qu 2. The number and format of the Y's for a number of these networks is I think one of the biggest source of confusion when doing deep learning with one of these libraries, although I think that's more to do with the complexity of the problem, keras makes it about as easy at it could be. The output is 3d - (number of samples, size of each vector, length of the sequence). This may seem confusing, but think about it like this. Using the time distributed dense above, we are making a prediction for each time step. For each time step you are predicting a word, so that's a one-hot vector, of length == | Vocab | But we have 'max sequence length' number of time steps. So the dimensionality has to the the number of rows, the number of words labels to choose from at each time step, and finally the number of time steps (max sequence length). Anything less would not be possible as for each row you are predicting a word for each time step (which is the length of the output sentence in this case). HTH. Put another way, say you have a vocab of 10k words and the max sentence length is 100. For each row, you need to make 100 predictions (some zero if less than max sequence length), for each prediction you need it to choose from 10k words. So for each row you'd have a matrix (2D) of size 10k X 100, so it's 3D.
Qu 3. - You don't absolutely need the start and stop symbols. However, when you are doing sequential prediction, each prediction for each timestep is condtioned on the previous inputs and output. However, when you are at the start of the sequence, you have no previous inputs to condition on. The distribution of the start symbols is normally not random however, certain words or labels tend to be more likely at the start of a sentence, such as the word 'The' or 'However', and subsequently so do their labels, such as POS tags if that's what you are trying to predict. It's unlikely a sentence will start with a noun, such as 'antelope'. The same is also true for the last word in a sentence or label in a sequence. Adding the artificial start and stop symbols allows the system to learn 'rules' (patterns may be a better term) that govern how the data is distributed at the start and end of sequences. This is late for me, so I hope I am making sense. The other reason is with this sort of sequential prediction, as you have a fixed sized sentence with zero-padding, you need to help the model know when to start outputting zeros. Without outputting the stop character I was seeing it have problems learning this, it would just keep repeating the same word sequence, or always output zeros. LSTM's and variants are stateful, and so this sort of signal can help them learn to switch states.
Looking at your model structure, I'd recommend keeping it simple and having one LSTM in the decoding layer. Although i've heard of people having success with stacking RNN's, I haven't personally found that to do better, it usually under or overfits for me when I try that. Technically also you don;t need the Activation layer as you can specify the activation function in the dense layer, but whatever is easier to read and understand.
I tend to find better performance from the simpler GRU and JZS1 RNN's than LSTM's. These models are a little simpler than the LSTM.
First of all, a huge thanks for all the details. This response had alot of useful insights.
I've actually found smaller vectors to work better (64 and 128 - note the binary sizes, I think this helps performance by assisting theano with copying chunks of data to and from GPU)
That's a useful tidbit.
Reversing the inputs is something that may help. However, be careful which end you zero pad if you do this
Which side would you normally pad the zeros? From my experience, I also pad the zeros on the _right_ side (as does keras's pad sequence function). When you reverse the input, are you suggesting you pad on the left side? What would be the advantage of that if the network expects to always recieve input in reverse order?
Let me clarify that when I say reverse the input, you reverse _all_ of the inputs you give it. You never give the network the input in original order. Therefore, the network always expects to recieve in the input in reverse order. I kinda think of it as you reading a book in reverse, and you learning to read in reverse. You're always fed books that have words in reverse order.
Interesting that there's a mirroring strategy, but I feel that it would confuse the network. I'll go digging for some papers on it and understand it better.
Regarding Qu1:
I'm actually not writing any custom code for Keras. I do my own significant pre-processing to cluster the words and assign each word a unique integer id within the cluster. This allows me to get a 80k vocab down to ~400 different integers where it takes two integers to represent each word (a cluster id, and a word id). I chain the cluster id and word id together. So to represent 15 words, it takes a string of 30 numbers.
But I do heed your suggestion! Keras code is usually much more durable than custom code. I go with Keras code when possible.
Regarding Qu2:
This was my biggest question, so I appreciate all of your details.
for each prediction you need it to choose from 10k words. So for each row you'd have a matrix (2D) of size 10k X 100, so it's 3D.
Good to know it has to be 3d, in retrospect, I feel a little foolish for asking that. So to clarify, lets suppose you have a vocab of 10k words. Are you saying that you must one hot this 10k vocab? That seems super inefficient to me.
I guess what I'm proposing instead is that instead of one hotting, you give an integer, and you then you apply some sort of embedding (much like you did in your embedding layer for x_train). I'm cool with doing one hot, but it just seems inefficient computationally and ram-wise.
The output is 3d - (number of samples, size of each vector, length of the sequence).
Not to be a detail jerk, but from Keras's docs, I've seen the usual _order_ to be:
(x: number of samples, y: number of timesteps, z: vector size of softmax output)
So in our case:
(x: number of samples, y: max number of words per sentence, z: one-hot of the 10k words)
Regarding Qu 3:
Adding the artificial start and stop symbols allows the system to learn 'rules' (patterns may be a better term) that govern how the data is distributed at the start and end of sequences.
This makes alot more sense. It gives the network a clear understanding of when there is a start and a stop. I will definitely be adding these in.
This is late for me, so I hope I am making sense.
You're making alot more sense than textbooks.
Although i've heard of people having success with stacking RNN's, I haven't personally found that to do better, it usually under or overfits for me when I try that.
I'll definitely start with just one rnn for my decoding layer. One thing I've learned from my experience is that stacking rnn's does perform significantly better, but _only_ when you have big data. Many of my experiments have shown that to me. Many Google papers tend to use these big nets as well.
I plan on training on at least 30 million samples (sentences). If I did it on 3 million or less, I could see a single layer performing better. If you're doing better with just one lstm, that's a red flag to me that you don't have enough data IMHO.
I do have one more question from what you've written:
Qu 4: Why are you vectorizing words for your y_train if you're one hotting them in the end?
I apologize if this is an obvious answer, but as I understand it, you're one-hotting each of your words. If you have a 10k vocab, why don't you assign an integer per word, and then one hot each of the words respectively? How are you incorporating these 32 or 64 length word vectors?
I'm sorry if I really missed the boat on this one. I use word vectors to cluster words into groups as I mentioned above. But in the end, I assign each word an integer (word id). From that integer, I can then one hot. Wouldn't you all be doing the same? I'm talking strictly about the y_train data.
Apologies for the essay response, but this is such a good conversation. If you're interesting in skype chatting anytime, my skype name is the same as my username here. Hopefully I can help you at least a little bit for the help you have given me.
I think you'd want it left padded. Which is what I thought keras does, but I don't have time to check. The important point is to pad from the same side regardless of whether the input is reversed or not. I'd make sure it can work well without reversing before trying to reverse, with these tools you want to do the simplest thing possible to get it to start learning before you start getting creative. That way you have a baseline you can compare against to ensure you are improving things iteratively.
The reason I think you want to left pad is how this works. You have an encoding layer and a decoding layer. The encoding layer processing the input left to right to produce a vector representation of the entire sentence. That vector is then replicated (just due to how keras is built) so that that input is replicated for every predicted output label. Then the decoder (you have an encoded input sequence as a vector) runs an RNN to produce output, taking this repeated encoding and it's internal state (which has a feedback loop) and produces output. So if your input is right padded, the right most tokens are zeros, which normally causes the network to reset it's state, so your encoding is not great. I could be wrong on this, but that is my understanding. It's been a few months so I could be missing something. Let me know if keras right pads as I am not overriding this.
Why the clustering? Use the pre-trained word2vec embeddings as input rather than clusters. That way you can have a 300 element encoding without the need for a one hot representation. That's how most people handle this problem these days, they use a language model to learn word embeddings. You don't need to even do that yourself unless you have very domain specific data, you can use the ones google pre-trained on a massive corpus. That actually worked better for me than learning my own embeddings, although the optimal strategy is a combination of pre-trained and tuning using supervised embeddings.
Having it predict an output embedding is better than the one hot strategy. I didn't mention that as it further complicates matters. But having a dense layer in the output is actually the equivalent to doing that. The word2vec embeddings and similar are taken from a neural network like model that is trained on a one-hot encoding, the embeddings are the weights from each of the inputs to the hidden layer, as I understand it. There are many variants on this, but that's the basic idea. You can emulate that using a dense vector before you one hot the output. That is inefficient but you'll have a hard time doing something more efficient such as a hierarchical softmax (see word2vec) in keras without a lot of custom code.
Regarding the ordering- I think keras' changed somewhat from when I last used this (before you specified named dimensions). Make sure it matches whatever the docs say and you'll be good.
That's cool about stacking. I've gotten really good performance from small data on this, which is very unusual. The more model params, the more powerful the model and so by adding more layers you are giving more degrees of freedom. If you have big data then you can take adv of a much larger model. I'd start small and simple though, get reasonable performance and then starting making it more complex. Unfortunately I'd don't have the option of more data (which almost always helps), labelling my dataset is a very expensive operation and depends on a team of psychologists.
I think i discussed the last question above. To be honest, for performance i'd have it predict the pre-trainined word2vec vectors as outputs. Again, in my domain, smallish data, my vocab size is actually small enough that doing a one-hot is not an issue for me. The best thing from the literature is probably a hiearchical softmax, but you'd have to hand roll that.
Gensim's word2vec runs on theano now I think (at least i've been getting theano errors from it so I am asssuming so).You may be able to take their hiearchical softmax and plugin it into keras. If you get that working, please submit a pull request and give back to the community.
I'd make sure it can work well without reversing before trying to reverse, with these tools you want to do the simplest thing possible to get it to start learning before you start getting creative.
Words of wisdom. I'll do regular input first.
Let me know if keras right pads as I am not overriding this.
Keras does right pad. I can understand as its been a while for you. I've ran it several times, and it always pads on the right. To clarify, it places the zeros on the right side. Again, appreciate the detailed explanation.
Why the clustering? Use the pre-trained word2vec embeddings as input rather than clusters.
In the end, I think your right. The reason why I did the clustering is that my words are very specific to my domain. Pretrained vectors aren't very good at high biology and astronomy terms (my interest).
By clustering, I essentially train the net to know that certain terms are related. The word id within each cluster is ordered by word frequency. So the most frequent words in that specific cluster are used more often. Thus, if the net has to guess, it will guess a word id of 0, or 1, and it will choose a word that is more used.
When strictly talking about x_train input, I do not do any one hotting. I simply submit each word with its cluster id first, followed by its word id. Thus my 2d x input look like this:
[34, 2, 45, 0, 23, 1, 34, 4, 45, 3]
These ten numbers represents 5 words. In this way, the maximum integer used is ~400 (400 clusters with 400 words per cluster at most.) Therefore, when I do a softmax, I only have to do it over 400 options. This was the whole motivation behind clustering words. This 2d x input is fed into the embed layer.
The meaning of the 2nd and 3rd dimensions only matter if you are using certain loss functions (like categorical cross-entropy) that do a soft-max like operation over all classes,
Yes, I plan on using categorical cross-entropy which is why I asked. Good to know!
Unfortunately I'd don't have the option of more data (which almost always helps), labelling my dataset is a very expensive operation and depends on a team of psychologists.
Ahh I see. To add to the stacking idea, have you tried using a series of dense layers afterwards? They sometimes improve performance without the need for more data.
You've probably already considered this, but have you tried looking into unsupervised learning so that you can acquire far larger data sets? No idea what you're doing, but from my experience: always go with unsupervised if you can get at least 100x more data.
Having it predict an output embedding is better than the one hot strategy. I didn't mention that as it further complicates matters. But having a dense layer in the output is actually the equivalent to doing that.
That's what I figured. The line you said about the dense layer is really good to know.
That is inefficient but you'll have a hard time doing something more efficient such as a hierarchical softmax (see word2vec) in keras without a lot of custom code.
I'll stick the word clustering strategy right now so that the softmax is down to 400 choices. I read some threads on hierarchical softmax in keras, but it seems a bit painful to implement.
If you get that working, please submit a pull request and give back to the community.
I definitely want to be as helpful as I can to you guys. It will take me at least a week or two to fully implement these ideas. If I find anything interesting, I'll report my findings on this thread so that hopefully, you guys can benefit from it. I'll also be experimenting with using different learning rates and adding dense
layers as well at the end of the decoder. If I do any keras modifying, I'll be sure to submit a pull request!
Instead of the clustering i'd try trainining word2vec or GLOVE (from Stanford) on your biology astronomy dataset, if it's large enough. It doesn't work well on small data, on my small PhD dataset it didn't perform well. But for work I have a much larger dataset and the vectors learned there were very good, and that's very domain specific, I wasn't able to use the pre-trainined ones either for that. Then you can use those vectors, if they're any good, as your target outputs. You can ask word2vec for the top 10 matching terms, running that for some important keywords in your domain will tell you if it's worked well. Make sure you train more than one iteration though, that's the default and that sucked for me, but at 10-20 iterations, I was getting good results.
I corrected by comment about the meaning of the 2D and 3D dimensions, it does matter in this case, I believe, as we're using an RNN to generate the output sequentially so the dimension that refers to the time dimension is important. Apologies.
In terms of unsupervised learning, yeah i think it's very useful. Using the pre-trained google word2vec vectors has helped somewhat and that is unsupervised, at least in the manner i'm using it (it wasn't trained for the purposes I am using it for). It's supervised in the sense of it's training a language model, but if you are using for some other task, then i'd argue in that context, it's not. That's the easiest thing I can do right now in this regard, and the top 10 similar terms when I query that model are very good even for my specific domain (science essays). You can think of it as a form of soft clustering, where each element in the 300 dimension vector is a cluster or topic in your domain, and the value is a score of how much that word represents that topic. It's also important in word2vec to extract common phrases and treat those as words, i've found that helps tremendously. I've used a variant of association rules (applying the downward closure principle) to build a common phrase extractor to detect commonly occurring phrases in my domain.
Instead of the clustering i'd try trainining word2vec or GLOVE (from Stanford) on your biology astronomy dataset, if it's large enough.
Yes, I completely agree with you that the word vectors do indeed beat the clustering. I think the main reason that I didn't do the word vectors intiallly is that I simply don't know how to incorporate them into the input for my x_train
At this point, I just feel so bad because you have already helped me a ton, and I haven't given much to you. So if you're busy, no need to reply to what I'm saying below:
I guess what confused me about each word is that there are 300 numbers (for a 300 length vector), so how do you input 300 numbers on a 2d input for your x_train? I assumed you were using an embedding layer. But even if you do use an embedding layer, the vectors are not integers. Keras requests integers to be used in the embedding layer: http://keras.io/layers/embeddings/#embedding.
The clustering I was doing was nice because it gave me two integers per word, so I could embed them really easily (and understand what I'm doing! But the clustering strategy feels so amateur.)
Don't get me wrong, I know how to use word2vec and glove to generate word vectors for each word. I just didn't know how to format the inputs into the x_train or for that matter, the y_train either. Let me try to do some reading to figure it out.
But for work I have a much larger dataset and the vectors learned there were very good,
Yes, I have experienced the same when I use word2vec on my training data. When I ask for the top 10 matching terms, they are incredibly close. For example:
["neutron_star", "pulsar", "neutrons", "high_density", "high_rotation", "aftermath"]
Make sure you train more than one iteration though, that's the default and that sucked for me, but at 10-20 iterations, I was getting good results.
You are referring to the Word2vec training correct? Not the actual seq to seq model? If so, I didn't even think of this and I _should_ do this regardless!
You can think of it as a form of soft clustering, where each element in the 300 dimension vector is a cluster or topic in your domain, and the value is a score of how much that word represents that topic.
Yes, this is what lead me to the idea of doing a k-means cluster in the first place. But like I said, inputting direct word vectors into the model does beat the clustering idea.
It's also important in word2vec to extract common phrases and treat those as words, i've found that helps tremendously.
Yes, I do bi-gram, tri-gram, and quad-gram, and stop there. I figure penta-gram is just overkill, but heck I might try it sometime. But I have had major success with using the phrase extractor as well. Always good to hear that someone else is doing the same thing as you.
As a side note, something that helped cut down on extraneous words was pyenchant https://pythonhosted.org/pyenchant/api/enchant.html. Basic idea is to make your input text a list of words, and fix spelling errors (or recorrect words that shouldn't belong). First tokenize all the words of your input into a list using nltk. I don't know if this will be of any use to you, but if can help you, I'll feel better!
Here's my code. I apologize as its really messy right now:
words_not_vectorized = set()
all_words_untouched = set(tokenized_words_untouched)
print 'applying frequency distribution to original text'
word_freq = nltk.FreqDist(tokenized_words_untouched)
for eachword7 in all_words_untouched:
if word_freq[eachword7] < 2: #we choose 2 because a word is rarely mispelled incorrectly twice in the same way
words_not_vectorized.add(eachword7)
words_that_are_common = all_words_untouched - words_not_vectorized
print 'creating personal spelling dictionary'
with open ('listofspelledwords.txt','w+') as listofspelledwords:
for eachword12 in words_that_are_common: #add to dictionary for spelling corrector
listofspelledwords.write(eachword12+'\n')
del words_that_are_common
d = enchant.DictWithPWL("en_US","listofspelledwords.txt")
spelled_tokenized_words_untouched =[]
number_of_corrected_spelling_errors = 0
start_time = time.time()
print 'correcting spelling errors -- this will take a while'
for eachword13 in tokenized_words_untouched:
if d.check(eachword13): #word spelled correctly
spelled_tokenized_words_untouched.append(eachword13)
else: #word not spelled correctly
try:
spelled_tokenized_words_untouched.append(d.suggest(eachword13)[0])
# print 'changed '+eachword13+' to '+(d.suggest(eachword13)[0])
number_of_corrected_spelling_errors = number_of_corrected_spelling_errors +1
except IndexError:
spelled_tokenized_words_untouched.append(eachword13)
print 'the time for spell checking is below: '
print("--- %s seconds ---" % (time.time() - start_time))
print 'number of corrected words from spell check is: '+str(number_of_corrected_spelling_errors)
del number_of_corrected_spelling_errors
with the words that are left over, I find their respective hypernyms and replace them with the hypernyms. Note that I'm doing all of this _before_ I apply word2vec. This helps tremendously on word2vec results because it increases the frequency of the words:
'''-------------------------------HYPERNYM CLASSIFICATION SCHEME-------------------------------------'''
total_words = set(spelled_tokenized_words_untouched)
number_of_unclassified_words = 0
number_of_hypernyms_found = 0
# my_regex = r"\b" + re.escape(eachword4) + r"\b"
# text = re.sub(my_regex, hypernym_of_word, text)
hypernyms_replaced_one_text =spelled_tokenized_words_untouched
start_time = time.time()
print 'you are now finding and replacing uncommon words with hypernyms'
for eachword4 in words_not_vectorized:
use_hypernym = 0
use_synonym = 0
try:
synonym_set_of_word = (Word(eachword4)).synsets[0]
hypernym_set_of_word = synonym_set_of_word.hypernyms()[0]
hypernym_of_word = hypernym_set_of_word.name().partition('.')[0]
for n, eachword10 in enumerate(spelled_tokenized_words_untouched):
if eachword10 == eachword4:
hypernyms_replaced_one_text[n] = hypernym_of_word
number_of_hypernyms_found = number_of_hypernyms_found + 1
except IndexError:
number_of_unclassified_words = number_of_unclassified_words+1
print 'you completed the hypernym process in time below'
print("--- %s seconds ---" % (time.time() - start_time))
print 'below is the number of words originally not vectorized:'
print len(words_not_vectorized)
print 'below is the number of words with no hypernyms found'
print number_of_unclassified_words
print 'Below is the number of different words within original text'
print len(total_words)
print 'total number words in the original text below'
print len(spelled_tokenized_words_untouched)
del spelled_tokenized_words_untouched
print 'total number words in the hypernymed text below'
print len(hypernyms_replaced_one_text)
# print 'total number of words in hypernym filtered list'
# print len(replaced_uncommon_with_hypernyms_one_list)
print 'total number of hypernyms found and replaced:'
print number_of_hypernyms_found
print 'the number of different words BEFORE ANY PREPROCESSING WAS DONE including punctuation is '+str(len(all_words_untouched))
I'll keep this short. To use pre-trained word embeddings, just lop off the embedding layer. The input to the LSTM is 3D if I recall correctly. Different example (and not seq to seq) but here's a conv net that is using the pre-trained embeddings that I have used:
print('Build model...')
# input: 2D tensor of integer indices of characters (eg. 1-57).
# input tensor has shape (samples, maxlen)
nb_feature_maps = 32
n_ngram = 5 # 5 is good (0.7338 on Causer) - 64 sized embedding, 32 feature maps, relu in conv layer
embedding_size = emb_shape[0]
model = Sequential()
model.add(Convolution2D(nb_feature_maps, 1, n_ngram, embedding_size))
model.add(Activation("relu"))
model.add(MaxPooling2D(poolsize=(maxlen - n_ngram + 1, 1)))
model.add(Flatten())
model.add(Dense(nb_feature_maps, 1))
model.add(Activation("sigmoid"))
# NOTE: add in repeat layer and decoder here
You can adapt that to be the decoder part, or you can keep the RNN structure too, as I mentioned remove the embedding layer and load in your own embeddings in place of it.
Another option is doing convolutions over embeddings (allowing bi-gram, tri-gram, etc to be extracted as convolutions).
The iterations comments was for Word2Vec, correct.
Awesome thanks again @simonhughes22 , I really appreciate your help, and I'll definitely will do some digging around and experimentation! Removing the embedding layer was the part I was missing. It will take me about 3 weeks to really test the ideas you suggested. If I find something interesting or helpful, I"ll post it back here. Thanks again man!
One other thing, you could train a character rnn. That way you have only a small number of inputs and outputs. Like this: https://github.com/fchollet/keras/blob/master/examples/lstm_text_generation.py. Google character RNN for more papers. LeCun's team did some work on this with conv nets if I recall, and Karparthy (I think) with RNN's.
Thanks for the pointer, but that's actually where I started! I started predicting chars after inputting 30 chars. I didn't try seq to seq on it. But I do appreciate the suggestion! I think i'm gonna stick with words and try to make as much progress as I can. Currently building the seq to seq model. Hopefully training will start tomorrow.
Just as a side note, the predicting chars works well when you throw about 300mb to 3gb of text at it, and ramp your lstm layers to 8 or 10. If you give it 100 chars, and ask it to predict the next one, it will predict phrases of sentences pretty well. It takes forever to train though, so I suggest changing the learning rate on adam to 0.02 and decreasing it when the loss goes crazy.
Just as an update for anyone reading this thread -- I retested the padding_sequences, and it pads on the _left_ side. So I was completely wrong, and Simon's memory is better than mine!
:). @LeavesBreathe I only remember that because it's important in how the LSTM RNN works. As you run the input left to right, it outputs a vector based on that whole sequence, but that vector is a reflection of its internal state. However, that's more sensitive to the more recent values (which is why you want to output sequences a lot of the times as that's not). Once it hits the zeros it learns to reset it's state, so if the left padding were at the end, it wouldn't work very well as those are the last inputs processed, wiping its state. Which is why I said that's very important to consider especially when reversing the inputs. If you want to reverse the inputs, do so, but ensure that the padding is still on the left hand side. It 'may' not learn to reset it's state, but once it hits a zero it just needs to learn to predict more of them, so it doesn't need to remember anything else about the rest of the sentence at that point, so it will in practice use that to trigger the forget gate. Hope that makes sense
If you want to reverse the inputs, do so, but ensure that the padding is still on the left hand side.
Thank you, this makes much more sense. I didn't realize it reads them left to right, and that if the zeros were on the right side, it would wipe the state. Makes complete sense.
Sometime, I want to buy you a latte (grande)
@LeavesBreathe i realize you could be anywhere in the world so this is a long shot. But if you happen to be anywhere near the chicago area, or san jose (I visit there a few times a year), i'd take you up on that.
hahaha I'm actually in Cincinnati (looks like 4.5hr drive for me), so this could be a possibility =p If you want to add me on skype, my username is "leavesbreathe"
Hi guys,
I went through the whole conversation but I still have a simple question about the Embedding Layer.
Assume I have a vocabulary of size nvocab, and all my sequences have length nseq. Assume I have only one sample for simplicity. I want to find an embedding for each word with nembed dimensions.
1) Should the input to the embedding layer,xtrain, be a sequence of integers or one-hot encodings? Currently, my xtrain has a length of nseq, with each element an integer that can take a value up to nvocab.
2) If xtrain is a list of integers (rather than one-hot encodings), what does the weight matrix in the embedding layer look like? Let us look at one word in the sequence. What I understand so far is that the embedding layer internally converts this word (a single integer) to a one hot encoded vector of size nvocab (let's call this vector Vonehot). Then, the network learns an embedding weight matrix, Wemb, of size [ nembed x nvocab], and computes an embedding vector Vembed = Wemb x Vonehot. Vembed is a vector of length nembed. Is that true?
Thanks a lot.
@kg07
1). If you use an embedding layer, feed it a list of integers. Without that you'd need a one hot encoding
2). This is correct. An embedding layer just take a linear NNet layer, mapping a one hot encoding to a hidden layer. The 'embedding' is simply the weights associated with each input node. As there is a separate input node for every word, you get a separate embedding for each word. This is represented by a matrix of size (nembed * nvocab), although the 2 dimensions may be switched depending on implementation details. At least that's my understanding. Of course the literature never explains it that simply, which is a shame, but that's what I've inferred after much digging. I'd love for someone to correct me if that is wrong
@kg07
Give the embedding layer your integers -- the embedding layer _only_ accepts integers -- and your input should be 2d. (nb_samples, sequence)
sorry just saw simon's comments -- nvm mine
Thanks a lot guys!
Note that often they use a dictionary for performance purposes when implementing those embedding layers. But that is equivalent to what I described AFAIK. For any pedantic types :)
Hey @simonhughes22 , thanks for the pointers again. I've made some good headway so far.
I wanted to revisit one issue we discussed earlier: What is the best way to format the Y_train so that we can predict words? Which of these ideas do you think is the best?
I've read in several places that doing a softmax over 2k terms is just a very bad idea. You face the curse of dimensionality, meaning it gets exponentially harder to predict words. So if you have a vocab of 100k words you would have to do a 100k softmax. This seems like the option of last resort.
I've read in some papers doing 100k softmaxes but only with 8 Titans or so. Here's an example: http://papers.nips.cc/paper/5346-sequence-to-sequence-learning-with-neural-networks.pdf
That is inefficient but you'll have a hard time doing something more efficient such as a hierarchical softmax (see word2vec) in keras without a lot of custom code.
I think this is the best idea, but the hardest to implement because of the heavy writing. Some work has been done here: https://github.com/fchollet/keras/issues/438.
I feel that I'm not experienced enough to fully pull this off yet. But maybe in the future I will attempt to do this and submit a PR.
Another idea is to abandon the categorical softmax all together, and just simply predict vectors. Obviously the neural net is not going to predict the exact vectors. So you take the vectors it does produce for each word, and find the closest word that matches those vectors. I don't know if you can do this in word2vec, but I imagine you could. So for each word, you would predict lets say 32 vectors that describes the word.
I think this is complicated to implement. For each sequence of lets say 20 words, you're asking the neural net to produce 32 x 20 = 640 numbers. This seems like a nightmare to me. I guess you would use a linear/tanh activation, mse objective, and RMSprop optimizer?
Not to bring back bad ideas, but I do think the clustering discussed would work well here. The reason being that you do a softmax, but it is only over ~400 terms for 80k words. I've ran this to predict individual words (not sequences of words), and it always gets the cluster id and word id right after epoch 1.
Advantage: You only have to softmax over 400 terms. You get an added bonus that the word id will be near 0 or 1 (since all of the words in a clustered are ordered by frequency)
Disadvantage: You have to predict two individual integers per word. You also have to one hot, but its only over 400 numbers.
I don't mean to bother too much, but I just wanted to hear your thoughts on these options. It will take me at least a few days to implement/test each idea, so I would rather start with the best one and see what happens. Thanks!
Howdy,
I've been doing something like your idea 3 using pre-trained vectors as both inputs and outputs:
word_vector_size = 300 # this is the dimensionality of the word vectors I already have
dense_size = 512
model = Sequential()
M = Masking(mask_value=0)
M._input_shape = (1,max_len,d)
model.add(M)
model.add(GRU(word_vector_size, return_sequences=False))
model.add(Dropout(0.5))
model.add(Dense(d, activation="linear"))
#optimizer = rmsprop(lr=0.001, clipnorm=10) # another option
optimizer = adam(lr=0.001,clipnorm=10) # works for me
model.compile(optimizer=optimizer, loss='mse')
(Sorry for the ugly Masking hack - there is currently some bug such that just doing model.add(Masking)
doesn't work without an embedding layer at the moment.)
The inputs have shape: (n_samples, max_len, 300) and the outputs are (n_samples, 300). The vectors are dependency-based pre-trained word vectors from here: https://levyomer.wordpress.com/2014/04/25/dependency-based-word-embeddings/
I don't have concrete results yet, but it does learn, and is a much smaller output space than the one-hot idea (your idea 1). Before this I tried that with 10k word classes and it was also learning, but VERY slowly (as I only have a laptop GPU - GTX 870m); I found you need to use really heavy gradient clipping with rmsprop (clipnorm=0.1
), or no learning would take place. Also, the output space was much larger -> 10k by the number in the mini-batch, whereas with regression your output space is only 300 by the number in the mini-batch.
My goals are maybe different from yours - I want smart encodings of sentences such that nearest neighbors give sensible results, and better than TFIDF BoW nearest neighbors. It seems like something in this general RNN direction should work, but I probably need a bigger machine to do the training =)
@sergeyf Thanks for the tips!
(Sorry for the ugly Masking hack - there is currently some bug such that just doing model.add(Masking) doesn't work without an embedding layer at the moment.)
Thanks for mask hack -- I was trying to figure this out this morning!
(as I only have a laptop GPU - GTX 870m)
I gotta tell you man, getting a maxwell modern gpu is so worth it. I can't even imagine trying to do this on a laptop. You can try to get decent GPU's on ebay. Regularly see 980 TI's going for $600 and Titan X's going for $850. Just make sure they weren't used for bitcoin mining. This post helped me alot: http://timdettmers.com/2014/08/14/which-gpu-for-deep-learning/
Also, the output space was much larger -> 10k by the number in the mini-batch, whereas with regression your output space is only 300 by the number in the mini-batch.
Right. This is the whole idea with regression. You only have a 300 numbers per word (or however many you choose to use with word2vec.).
Currently my goal is to take a sentence, and predict the next sentence that makes somewhat logical sense. I think this is similar to what you're doing? It is similar to translation but the "translation" is the next sentence that should come.
Also, how are you converting your 300 number outputs to a word (when you do model.predict
)? Is there a function in word2vec that does that (I did a little searching and couldn't find one.)
Happy to help!
I am not converting the 300 number output to a word. I just leave it as is. Once the training is done, I feed entire sentences into the network, but take the representation that comes out of the RNN (before Dense) and use that as my representation of the sentence. Then I do NN in the sentence space. The idea being that I just fed a sequence of words into an RNN, so its final representation should be a sentence. Does that make sense? This is why I was pointing out that we may have different goals.
@sergeyf apologies for my misunderstanding. I understand what you're getting at. I guess I'm still kinda stuck though as which of the four ideas I mentioned above would be best for next-sentence prediction. =/
I definitely think what you're doing is smart, and could potentially work really well!
No worries!
I am not sure why you wouldn't just predict the next sentence as represented by word2vec vectors?
So input is "Horses run." as represented by [x_horses, x_run] and output is "They run quickly." as represented by [x_they, x_run, x_quickly]. Why ever convert things into categories instead of just leaving them always as vector embeddings?
@LeavesBreathe instead of predicting one hot vectors replace those one hots with the pre-trained vectors, word2vec or those dependency embeddings. Either way you output will be a matrix, you just drop one of the 2 dimensions from the size of your vocabulary to the size of the embedding. If that makes sense
@sergeyf @simonhughes22 Alright, I just feel stupid. Both of you are telling me the answer, and I can't understand it.
I completely understand x_train. The 3d matrix of (nb_samples, timesteps, word_vectors). And I can do the same with y_train as well. But when I do model.predict, won't the model have to predict word vectors for each word?
Suppose you use 32 scalars per word (when you set word2vec size = 32). This means that for each word in the sentence, you must predict 32 scalars per word correct? Not only that, you can't use a softmax. And what are the odds that the network is going predict the exact 32 scalars that correspond to a word? This is why I'm so lost.
IIRC you stack the one hot or the embeddings vertically, and then you have one column per output step up to max sequences. I may have the rows and cols reversed, but that's the idea
make sure, if predicting embedding, you use RMSE error. I don't think any of the other error metrics are correct for that. At prediction time (once trained) do a cosine sim search on most similar vectors
At prediction time (once trained) do a cosine sim search on most similar vectors
Thank you! This is what I thought you had to do. So the final network should look like this correct?
model = Sequential()
M = Masking(mask_value=0)
M._input_shape = (1,maxlen,word2vec_dimension)
model.add(M)
model.add(JZS1(hidden_variables, input_shape=(maxlen, word2vec_dimension), return_sequences = False)) #note for the input shape that you did not put the number of samples
model.add(Dropout(dropout))
model.add(Dense(hidden_variables)) #consider adding another dense here
model.add(Activation('relu'))
model.add(RepeatVector(maxlen))
for z in range(0,number_of_decoding_layers):
model.add(JZS1(hidden_variables, return_sequences=True))
model.add(Dropout(dropout))
model.add(TimeDistributedDense(max_features, activation="linear")) #consider adding another timedistributedense here
model.compile(loss='mean_squared_error', optimizer='rmsprop')
Particularly:
Loss = mse (or are you saying I should do rmse here instead?)
Optimizer = rmsprop or adam
Activation = Linear (or is there a better option?)
I think so
I've also had it work with binary cross-entropy, as that doesn't technically do a soft max, but it's not really meant to be used like that. fchollet recommended I do RMSE. But you could try that if desired. Outputs need to be in the range 0-1 to use bce I think, so you word vectors would need to be in that range.
Does RMSE vs MSE make much of a difference? It would change the steepness
of curvature near the minima and/or saddle points but not sure which one is
preferred. Probably depends on the problem (as with everything)...
On Thu, Oct 29, 2015 at 3:06 PM, Simon Hughes [email protected]
wrote:
I've also had it work with binary cross-entropy, as that doesn't
technically do a soft max, but it's not really meant to be used like that.
fchollet recommended I do RMSE. But you could try that if desired. Outputs
need to be in the range 0-1 to use bce I think, so you word vectors would
need to be in that range.—
Reply to this email directly or view it on GitHub
https://github.com/fchollet/keras/issues/395#issuecomment-152341583.
@sergeyf that's probably an empirical question. RMSE is a popular error metric as errors in the training data is assumed to be normally distributed according to the CLT, and so RMSE is the 'best' metric to minimize under those assumptions when you have real numbers not ordinals - at least in theory.
Outputs need to be in the range 0-1 to use bce I think, so you word vectors would need to be in that range.
I think Word2Vec does this, so I'll try that! Interesting that you can use bce -- would not have thought of that, but hopefully it will perform better than mse. If I end up trying RMSE, I'll submit a pull request for it.
There's a discussion about this happening on r/machinelearning: https://www.reddit.com/r/MachineLearning/comments/3qyn0m/sequence_to_sequence_mapping_via_lstm/
Their claim is that this doesn't work as well as classification!
@sergeyf this link from that discussion would back that up: https://github.com/yandex/faster-rnnlm. In summary, hierarchical softmax is used for speed, but you are sacrificing some accuracy for that efficiency gain. Doing a softmax over a sizable vocabulary is probably not feasible for a lot of real-world problems. NCE seems to be the way to go for these models, but I am unsure how you'd do that in a sequence learning model.
You 'could' try learning to predict a bag of words (BOW) representation instead of a sequence, i.e. a single vector, the same length as your vocabulary, with binary representations for each word. Then train a second model, a simple language model, to translate this into the most probable word sequence. But you've thrown away any word order in passing the predicted BOW between the two models, and so this probably wouldn't work as well, particularly as you can often re-order the words in a sentence and completely alter the meaning. But it very much depends on the problem you are solving. If it's not a linguistics problems but some other sequence, the ordering may be very easy to determine from a BOW type output.
@simonhughes22 thanks for that link - looks really interesting. My particular goal is to do NN queries for various sentences. The sentences tend to be on the shorter side, so it may indeed be not a big deal to lose the ordering. I've tried some models that seem to suggest that, but nothing conclusive yet.
@sergeyf @simonhughes22 , This reddit discussion is really good. I'll post my findings there that I've had so far.
Also, if you're interested, I submitted a PR for Root mean square error, so its on keras now if you want to use it.
Doing a softmax over a sizable vocabulary is probably not feasible for a lot of real-world problems.
I completely agree. Doing a softmax over 100k words seems like the wrong direction, even though that Google Seq to Seq paper did it (needed 8 titans though). This is why I suggested the clustering: so that you could use a softmax (and therefore, categorical_crossentropy) and not have to resort to mse or rmse. The reddit discussion seems to be criticizing mse hard.
In the meantime though, I've been directly inserting vectors and then using cos distance to predict the next sentence. Haven't had much luck though yet.
I'll comment back here when I have more info
@sergeyf this might be of use to you, although given how fast the field moves it's relatively old: http://www.utstat.toronto.edu/~rsalakhu/papers/topics.pdf - Hinton and Salakhutdinov paper
Thanks @simonhughes22 - I've seen a number of similar papers that all seem to use additional noise and stacked denoising autoencoders. It would be cool if Keras had an RBM implementation so I could try this out without massive hacking =)
I also found two non neural-network approaches.
One that makes use of pretrained word vectors and then does a vectorized word mover's distance between sentences - http://www.cs.cornell.edu/~kilian/papers/wmd_metric.pdf Python code for the first link is here: https://github.com/mkusner/wmd
And another that marginalizes out the noise that is added for the denoising autoencoder, yielding a closed-form solution: http://arxiv.org/pdf/1301.6770v1.pdf
It would be cool if Keras had an RBM implementation so I could try this out without massive hacking =)
I asked about this about a month ago and @EderSantana said it would be easy to implement, but I'm not sure if it would be worth it right now. RNN's might still be better than DBNs?
@sergeyf thanks, cool, i'll check it out
I'm not an expert, but seems like they are fundamentally different enough
to both be worthwhile?
On Sun, Nov 1, 2015 at 12:07 PM, LeavesBreathe [email protected]
wrote:
It would be cool if Keras had an RBM implementation so I could try this
out without massive hacking =)I asked about this about a month ago and @EderSantana
https://github.com/EderSantana said it would be easy to implement, but
I'm not sure if it would be worth it right now. RNN's might still be better
than DBNs?—
Reply to this email directly or view it on GitHub
https://github.com/fchollet/keras/issues/395#issuecomment-152859525.
RBM's are pretty simple to code up. Not all that different to an autoencoder though, which you could do in keras as is.
Hey Guys, just as an update. I used about 250mb of text to train. Basically, I'm getting results where it repeats the same word 4 or 5 times, then moves onto the next word. This was inputting words as 128 vectors, and then doing cos distance on output vectors.
This was with rmse, adam, and linear activation. Good news though is that it nails "sentence start" and "sentence end" every time. Found the best results with 2 LSTM encoders (hidden = 128), and 3 JZS1 decoder layers (hidden = 256). Also, I tried 2 time-dist-dense layers and it made it slightly better.
From the reddit discussion yesterday, I'm going to try inputs as vectors, and outputs as clusters + word id. It will probably take me a week to set this up properly and test. Will report results back here when I get them!
@LeavesBreathe you normally need to train it for a long time to get past that. I was never able to get great results, as it was more of an intellectual exercise, but it did seem to improve over time, and I didn't leave it running for a long period of time.
Good to know. I was training each model configuration for 50 epochs, adjusting the learning rate when loss was rising.
What I want to do is compare:
50 epochs of using cos distance with linear + rmse
to
50 epochs of using clustering with softmax + categorical_cross
And see which one performs better. Then do 200 epochs on the one that performs better =). With the cos distance, it takes me about 16 hrs for 50 epochs.
Crikey. Are you using the GPU?
yea 980 Ti...why you think that is too slow? Most of my time is wasted just loading matrices...i'm gonna upgrade from 16gb to 24 gb ram in a few days ....eventually though, I'm thinking of doing 64 gb of ram (but that's like in 4 or 5 months)
I'm mot a hardware guy, but that sounds fast. 16g of RAM is good for me. I am assuming you just have a lot of data. If you haven't already, check that Theano is utilizing the GPU, I had to jump through a few hoops to ensure that.
Wait i'm still confused -- are you saying training 50 epochs in 16 hrs is too slow or too fast?
If you haven't already, check that Theano is utilizing the GPU, I had to jump through a few hoops to ensure that.
Took me a solid day when I was starting to make sure GPU was being used, but I assure you that it is (gets pretty hot 60C).
Keep in mind I'm using 250mb of text, which translates to approx to 2 million samples. I train with batch size of 4096...but like I said, most of the time is spent loading matrices (which is why I want more ram)
Got it. Yeah you just have a lot of data. Mine is much faster, 50 epochs doesn't take me too long.
@simonhughes22
I've been following this thread and trying the network structure you shared on some conversation text data, but couldn't get it to learn something useful. Looks like the only thing it learns is outputting sentence start and sentence end. The only difference is I'm using categorical_crossentropy for loss function. I've added special tokens for sentence start/end, left padding, masking. My sequence_len = 100, nb_words=10000.
BTW for sentence prediction, is greedy approach the right thing to do? (argmax on final the one-hot output layer for each step in the sequence), or choose a word with the predicted probability?
Could you shed some insight on what could go wrong?
@oleole I can only speak to what i've tried. I have much shorter sequence lengths (inputs and outputs) and a pretty small vocabulary. @LeavesBreathe - also relevant to yoru question: When I trained it, It would start predicting the start and stop tokens, as those are the easiest to learn and most common, then it would move onto the most common words, and then repeat the more common phrases, before starting to predict something more interesting. I never got great results as I've mentioned a number of times, but it did start to predict sequences. So you may jsut need to let it train for a really long time. In the academic papers on this subject, the training times are pretty long from what i've read, it's a really hard problem. Also the longer the sequence the harder it will be to learn relationships over the length of that sequence.
Is it possible to shorten the sequences somehow (e.g. predict the first 5 or 10 words in the sequence) or just predict the next word only? Predicting the next word would undoubtedly be easier, and that would then be a language model, and you can even feed the prediction in at the end of the existing input and iterate that way to create the full output sequence, although you'll have to write a little bit of extra code to do that. I can't remember if I mentioned that here or on another issue. I guess i'd try that first for people who are having problems learning anything. I'd also start with a smallish dataset so you can iterate faster and test out the model structure before letting rip on the full dataset. The other thing to try, is to predict the output word vectors as opposed to the one hot encodings. In that case you would use RMSE or bce as your error metric (please see discussion above).
@simonhughes22 Thanks a lot for your long reply. I do feel it's learning slowly, but probably need much more time and data to move forward.
I agree learning the full sequence is a very challenging problem. From the "A Neural Conversational Model" paper, it's actually doing what you suggested, greedily learning the next word. I'll definitely give it a try.
The other idea of making word vector as target is also very interesting, but kind of discouraging hearing the results from other people on reddit. Several things I'm still not quite clear are:
@sergeyf For your experiment using pre-trained vectors as both inputs and outputs. What are the targets you are trying to predict? It looks like a word vector. Is it the next word given input sequence?
Yes, exactly. I am predicting the next word vector from a sequence of word
vectors. It didn't go very well!
On Mon, Nov 2, 2015 at 10:13 PM, oleole [email protected] wrote:
@sergeyf https://github.com/sergeyf For your experiment using
pre-trained vectors as both inputs and outputs. What are the targets you
are trying to predict? It looks like a word vector. Is it the next word
given input sequence?—
Reply to this email directly or view it on GitHub
https://github.com/fchollet/keras/issues/395#issuecomment-153255706.
Have you tried skip-thoughts vector https://github.com/ryankiros/skip-thoughts?
But there are no vector representation for predicted sequence (decoders), so how do you get it to work @LeavesBreathe ?
@oleole , I haven't really gotten much to work first. I too get sentence start and end tokens. However, my biggest problem (with doing y targets as word vectors) is that you get repeated words over and over again.
I don't fully understand your question. Your encoding layer will create a vector rep of your input sequence. The repeat vector
layer repeats that vector rep for all timesteps of your y output. I personally like to do two LSTM layer for encoding because I feel it capture more salient features. But this might be because I have a huge dataset (about 2 mil sentences).
Apologies if I didn't answer your question.
The skip thoughts paper is something of interest as well, and something I want to look into eventually. Right now, as mentioned above, I"m working on the clustering output. Taking longer than expected to get the matrices right.
. @LeavesBreathe - also relevant to yoru question: When I trained it, It would start predicting the start and stop tokens, as those are the easiest to learn and most common, then it would move onto the most common words, and then repeat the more common phrases, before starting to predict something more interesting.
This is really interesting to me @simonhughes22. I feel that this is a strong sign that you need more training data (I know you have a limited set). Usually if you have a situation where more epochs results in better results, that openly tells you that more data would be improve the model more instead.
Imagine you doubled your data. Then your model would get to the same level (loss) in _half_ the epochs, assuming your data is perfect. Still really good to know that patience is key in this.
@oleole skip thoughts looks promising, i've been meaning to try it out. The output varies as you are predicting a different output sequence for each input sequence. So I don't get the 'the output is fixed' comment. The start and end vectors can just be all 0's (or all 1's).
Hey guys,
So as an update, I've tried the input word vec (16 vectors) to cluster output. I've only ran it for 3 days, so there are plenty of different hyperparameters to try. To get to 50 epochs, it takes about 2 days, so its a slow process. However, I can tell you this:
Softmax + cross_entropy is much much better than linear regression/cos distance
This might look lame, but here's some sample output. Notice the word diversity!:
_posthumous respectively acute support bleeding association enough pregabalin gluten but hot into : in some of cocaine provides regular medical herrerasaurus review control the widespread coat upper_airway conclusively after glucose such_as linking the number and have cns may nucleus inflammation include respectively psa , patient trauma than occurs vascular evidence medical pseudounipolar dichotomy_between_
But I'm wondering about something a little more novel: changing the design of sequence to sequence.
Before, we were using a repeat vector for each timestep like in the model shown:
model = Sequential()
model.add(LSTM(hidden_variables_encoding, input_shape=(x_maxlen, word2vec_dimension), return_sequences = False))
model.add(Dropout(dropout))
model.add(Dense(hidden_variables_encoding))
model.add(Activation('relu'))
model.add(RepeatVector(y_maxlen))
for z in range(0,number_of_decoding_layers):
model.add(LSTM(hidden_variables_decoding, return_sequences=True))
model.add(Dropout(dropout))
model.add(TimeDistributedDense(y_matrix_axis, activation="softmax")) here
model.compile(loss='categorical_crossentropy', optimizer='adam')
But now, lets consider that we do _not_ use the repeat vector. But rather, we simply use a TimeDistrubtedDense
layer instead to get the right amount of timesteps for our target:
model = Sequential()
model.add(LSTM(hidden_variables_encoding, input_shape=(x_maxlen, word2vec_dimension), return_sequences = True))
model.add(TimeDistributedDense(y_maxlen)) #consider adding another timedistributedense here
for z in range(0,number_of_decoding_layers):
model.add(LSTM(hidden_variables_decoding, return_sequences=True))
model.add(Dropout(dropout))
model.add(TimeDistributedDense(y_matrix_axis, activation="softmax"))
model.compile(loss='categorical_crossentropy', optimizer='adam')
Before trying this for 4 or 5 days, I wanted to get your guys take on it. Thanks alot!
I think you would have to have the return_sequences=True
for the encoding
LSTM
for this to work (second line).
So, if I understand, the major difference would be is that instead of
feeding the decoder the final encoding at its every time step, you would be
feeding intermediate encodings at every time step. It's hard to say whether
it would be better or worse. Presumably the final encoding with
return_sequences = False
has more info in it than intermediate ones, and
the decoder having access to the final encoding _should_ be better by
intuition, but who knows what the truth is. RNNs surprise me often. =)
On Thu, Nov 5, 2015 at 2:38 PM, LeavesBreathe [email protected]
wrote:
Hey guys,
So as an update, I've tried the input word vec (16 vectors) to cluster
output. I've only ran it for 3 days, so there are plenty of different
hyperparameters to try. To get to 50 epochs, it takes about 2 days, so its
a slow process. However, I can tell you this:_Softmax + cross_entropy is much much better than linear regression/cos
distance_This might look lame, but here's some sample output. Notice the word
diversity!:_posthumous respectively acute support bleeding association enough
pregabalin gluten but hot into : in some of cocaine provides regular
medical herrerasaurus review control the widespread coat upper_airway
conclusively after glucose such_as linking the number and have cns may
nucleus inflammation include respectively psa , patient trauma than occurs
vascular evidence medical pseudounipolar dichotomy_between_But I'm wondering about something a little more novel: changing the design
of sequence to sequence.Before, we were using a repeat vector for each timestep like in the model
shown:model = Sequential()
model.add(LSTM(hidden_variables_encoding, input_shape=(x_maxlen, word2vec_dimension), return_sequences = False))
model.add(Dropout(dropout))
model.add(Dense(hidden_variables_encoding))
model.add(Activation('relu'))
model.add(RepeatVector(y_maxlen))
for z in range(0,number_of_decoding_layers):
model.add(LSTM(hidden_variables_decoding, return_sequences=True))
model.add(Dropout(dropout))
model.add(TimeDistributedDense(y_matrix_axis, activation="softmax")) here
model.compile(loss='categorical_crossentropy', optimizer='adam')But now, lets consider that we do _not_ use the repeat vector. But
rather, we simply use a TimeDistrubtedDense layer instead to get the
right amount of timesteps for our target:model = Sequential()
model.add(LSTM(hidden_variables_encoding, input_shape=(x_maxlen, word2vec_dimension), return_sequences = False))
model.add(TimeDistributedDense(y_maxlen)) #consider adding another timedistributedense here
for z in range(0,number_of_decoding_layers):
model.add(LSTM(hidden_variables_decoding, return_sequences=True))
model.add(Dropout(dropout))
model.add(TimeDistributedDense(y_matrix_axis, activation="softmax"))
model.compile(loss='categorical_crossentropy', optimizer='adam')Before trying this for 4 or 5 days, I wanted to get your guys take on it.
Thanks alot!—
Reply to this email directly or view it on GitHub
https://github.com/fchollet/keras/issues/395#issuecomment-154218066.
I think you would have to have the
return_sequences=True
for the encoding
LSTM
for this to work (second line).
My mistake, you're absolutely right. Thanks for saving me the compile error.
Presumably the final encoding with
return_sequences = False
has more info in it than intermediate ones, and
the decoder
Why is this necessarily true? For your encoding layer, couldn't use stack more LSTM's/TimeDistributedDense layers (in the 'encoding' portion) to make it just as sophisticated? Forgive me for my lack of understanding if this is an obvious answer.
I just don't like the repeatvector part of the original model. It seems to me that your repeating the same vector for each timestep. But wouldn't be useful it you were giving a _different_ vector for each timestep?
After all, you expect to see a different word for each timestep, so wouldn't make more sense to give a different vector input for each timestep?
First, let me say that I don't know what actually works or doesn't - the following is my rationale for what I believe should work better.
Let's say you have a sequence of words to encode: 'the cat is dancing'
In the first encoder-decoder type, you get an encoding(the cat is dancing)
that is repeated to every step of the first decode layer.
In the second type that you are proposing, you get the following encodings:
encoding(the)
encoding(the cat)
encoding(the cat is)
encoding(the cat is dancing)
But! Presumably encoding(the cat is dancing)
has strictly more encoded info than encoding(the cat is)
(or the others). So we are providing as much info as possible to the decoding layer's every step. In a way, this lets the decoding layer know about the whole sequence at every step, not just what has been seen up until now. It should be an advantage. Not sure if it actually is one...
@sergeyf what you're saying makes sense. I like that your providing as much information as possible. I feel kind of foolish for proposing this in the first place.
If my GPU is ever free, I'll just try this idea just for fun.
In the meantime, I'm going to try reversing input and see if I get better results (like the google seq seq paper did). I'm also going to try adding on more TimeDistributedDense
and LSTM/JZS1 layers. If I get anything better, I'll let you guys know.
Please don't feel foolish! Sometimes things that sound reasonable are
wrong. And we have no idea if they are unless we just try random stuff :)
That's my favorite way to learn.
On Nov 5, 2015 4:42 PM, "LeavesBreathe" [email protected] wrote:
@sergeyf https://github.com/sergeyf what you're saying makes sense. I
like that your providing as much information as possible. I feel kind of
foolish for proposing this in the first place.If my GPU is ever free, I'll just try this idea just for fun.
In the meantime, I'm going to try reversing input and see if I get better
results (like the google seq seq paper did). I'm also going to try adding
on more TimeDistributedDense and LSTM/JZS1 layers. If I get anything
better, I'll let you guys know.—
Reply to this email directly or view it on GitHub
https://github.com/fchollet/keras/issues/395#issuecomment-154248098.
Hey guys, I just want to draw some attention to thread https://github.com/fchollet/keras/issues/957. The bottom line is that our output's are _not_ masked, meaning the cost function is biased.
I think this explains why the results from linear regression tended to be "stretched". That is, it slowly drifted from word to word. I'm gonna talk to Eder, but I think to me, his explanations were quite clear.
@LeavesBreathe that sounds like a really interesting idea. When I want to test something that takes a long time, i run it on a small subset of data (if it's going to run a long time). If that works well or better than some other approach I am comparing it with on that same small subset then I try it on the full dataset or a larger subset. I'd advise that here. Smaller data should be easier for it to learn from too, although what it learns won't generalize nearly as well. Hope that helps.
RNN encoder-decoders do take really long to converge to good solutions... everybody seems to be reporting that. Another thing in the final layer decoder you may want an RNN with a readout (the final generated word is sent back to the inner RNN). I wrote a GRU with readout here: https://github.com/EderSantana/seya/blob/master/examples/imdb_readout.py#L53-L67
First of all, I can't believe seya now. Its so exciting to me that you have stateful GRU's and bidirectional rnn's. This is just amazing.
I need to readup a little bit on readouts to understand exactly what the advantage of this is, but it sounds very exciting. Thanks alot man.
@simonhughes22 I agree with you. Start small and grow bigger when testing anything major. Though I must say that the Sutskever RNN has much more promise.
c'mon you didn't know about Seya :D ??? That is the place where I'm cooking things before I push it to Keras. Some advanced examples that could crowd this repo are also there like Spatial Transformer Networks and DRAW. If more people use them and do suggestions, we could move them up here to main Keras.
To understand what I mean by readout see this figure by Cho et. al.
see the difference between encoder and decoder??? the generated symbol is sent back to the RNN in the decoder.
ahhhh I gotcha. So basically if I understand it correctly, lets say your decoding layer produces sentences. For y1 it produces: "the"
For y2 it produces: "cow"
since it is a readable GRU, for y3, it sees that you have written "the" and "cow" so it is more likely to pick "jumped" as y3?
Of course, y1, y2, and y3, are a distribution of percentages (assuming your using a softmax). But it would see these percentages. I usually apply a temperature after the percentages are produced (so I'm not always picking the highest percentage choice).
Anyways, I look forward to the tutorial you mentioned in the other thread. There's just so much to try now!
I was able to have a chat with K. Cho.. I believe that the readout is used at test time, the model is trained by 'teacher forcing' i.e. The decoder is fed with the true label (from previous time step as input) which is replaced by the readout at test time.
Note that feeding the readout to the next step is not explicitly needed as the information should be contained in the hidden state. For example, if you wanted to use the model to score a pair of sequences (and not generate the target sequence)
Note that feeding the readout to the next step is not explicitly needed as the information should be contained in the hidden state.
I'm trying to generate text, which is why this is so critical to me. I believe @simonhughes22 is as well.
@gautamb85 I thought I was cheating when I teacher forced during training xD good to know!!!
But note that not all the information needed is present in the hidden state, specially if you are using a deep readout. Several works report feeding back the readout.
Heh. We have had this debate before (on this thread somewhere I think).
The way I think about it is - so the input to your decoder RNN (if you dont feed in the prediction) is going to be the summary vector produced by the encoder. At every timestep.
Now, I know for a fact that this does work (atleast on simple tasks like numbers-> number strings). However, intuitively if the previous prediction is fed in, the decoder has a better idea of 'where it is' (if that made any sense). I would think this model would be more powerful.
I am using the approach as generative model to score a pair of audio sequences. So I cant be confident about the text generation problem.
@simonhughes22 can the sequence-to-sequence model you proposed in keras be used to score two sequences?
I dont think so because there is no way to input the second sequence. I guess it would need a graph() model. It was actually easier to just write my own code (not easy, but easier ;) )
Sent from my iPhone
On Nov 6, 2015, at 10:20 AM, LeavesBreathe [email protected] wrote:
Note that feeding the readout to the next step is not explicitly needed as the information should be contained in the hidden state.
I'm trying to generate text, which is why this is so critical to me. I believe @simonhughes22 is as well.
—
Reply to this email directly or view it on GitHub.
@EderSantana nope! You're good :)
Are you doing this with keras? It would need a graph() model ?
I wrote my own but it i only have sgd and momentum going and it would be nice to have more fancy optimization
Thanks for the tip on the readout
@gautamb85 - use a y shaped graphical model, or concatenate the two sequences. I suspect that'll be really tough to learn. although hopefully the changes @EderSantana is suggesting will work better. I don't have time to check (busy with work, phd, and kaggle:) ).
Yes I did use a graph inputing the "input" and the "teacher" (input delayed one step) than I pass both to the decoder GRU with merge_mode=concat
.
@EderSantana seya looks awesome. Please add to keras!
Off topic. But is anyone coming for NIPS? i say coming since I live in Montreal :)
Sent from my iPhone
On Nov 6, 2015, at 10:53 AM, Simon Hughes [email protected] wrote:
@EderSantana seya looks awesome. Please add to keras!
—
Reply to this email directly or view it on GitHub.
However, intuitively if the previous prediction is fed in, the decoder has a better idea of 'where it is' (if that made any sense). I would think this model would be more powerful.
This is exactly what I thought when @EderSantana explained the GRU(readout)
. Knowing where it it is in the readout I think would be incredibly powerful. I'll try to integrate this readout GRU, and report back if I get better val loss with the text generation. It will take me sometime to integrate in correctly.
I'll also graph my results publically if it helps anyone. I've started graphing here: https://plot.ly/~oxygen123/folder/home
@gautamb85 I've been sitting out for much of this chat, but I'll be at NIPS, we should organize a Keras meetup! We're doing a lot of sequence to sequence stuff as well, would be good to compare notes.
I will also be at NIPS!
On Fri, Nov 6, 2015 at 8:05 AM, Gautam Bhattacharya <
[email protected]> wrote:
Off topic. But is anyone coming for NIPS? i say coming since I live in
Montreal :)Sent from my iPhone
On Nov 6, 2015, at 10:53 AM, Simon Hughes [email protected]
wrote:@EderSantana seya looks awesome. Please add to keras!
—
Reply to this email directly or view it on GitHub.—
Reply to this email directly or view it on GitHub
https://github.com/fchollet/keras/issues/395#issuecomment-154449245.
@wxs It would really be good to compare notes! I know that another keras contributor lives in Montreal, I will try to get in touch with him. Maybe lets do a post on the google-group for a meetup?
Ps. Get ready for some serious cold :)
I will be in NIPS as well. I will interested in the meetup.
Best regards, Mariano
On Fri, Nov 6, 2015 at 10:46 AM, Gautam Bhattacharya <
[email protected]> wrote:
@wxs https://github.com/wxs It would really be good to compare notes! I
know that another keras contributor lives in Montreal, I will try to get in
touch with him. Maybe lets do a post on the google-group for a meetup?
Ps. Get ready for some serious cold :)—
Reply to this email directly or view it on GitHub
https://github.com/fchollet/keras/issues/395#issuecomment-154483117.
Best Regards, Mariano
@LeavesBreathe You might want to check out this ipython notebook
https://github.com/DTU-deeplearning/day3-RNN/blob/master/RNN.ipynb
He has setup an encoder-decoder and one with an attention mechanism. Note that he is doing it in the same way (not feeding in the prediction), but the attention model can kinda compensate for this.
There is also a toy-problem (text prediction :)) setup.
You would have to learn lasagne (also a really good package), and use more theano.
wow, that NIPS meetup will be fun. I'll go home in Dec. but you guys write blogs or something to let us know what happened.
I'll be at NIPS -- I'd love a meetup.
@LeavesBreathe You might want to check out this ipython notebook
https://github.com/DTU-deeplearning/day3-RNN/blob/master/RNN.ipynb
This is a really good find. You're right that I would need to learn lasagne, but it may be worth it if it allows more capabilities.
I think for right now, I'm gonna wait for @EderSantana 's tutorial and go from there. Keras has been a huge help to me, and I hope I can start to contribute to it. In the meantime, I'm gonna try to start implementing the readout GRU.
Moving NIPS discussion to #962!
I have done a seq2seq implementation, based on http://papers.nips.cc/paper/5346-sequence-to-sequence-learning-with-neural-networks.pdf.
https://github.com/farizrahman4u/seq2seq
It has stateful LSTM encoder and decoder, hidden state transfer from encoder to decoder, feedback decoder (output at step t is input at step t+1), depth and all the fancy stuff.
@farizrahman4u I saw your code it looks pretty interesting. @LeavesBreathe I think you want to check that out.
One little thing, I saw that you use two LSTMs, an encoder and another one for decoder. Isn't that approach more similar to Socher's encoder-decoder than the model presented in the paper you cite? Did it provide you better results than using a single RNN? Tkx for sharing.
wow @farizrahman4u , I'm overwhelmed. This Keras community is ridiculous. A huge, huge thanks. I have a few questions if you have time:
input_length
and output_length
. But if we add a masking layer _before_ the seq2seq layer, can we mask all zeros? This would be for both input _and_ output. Output being really important (so cost function is not biased)```
seq2seq = Seq2seq(input_length=x_maxlen, input_dim=word2vec_dimension,hidden_dim=hidden_variables_encoding,
output_dim=hidden_variables_decoding, output_length=y_maxlen, batch_size=10, depth=5)
model = Sequential()
M = Masking(mask_value=0)
M._input_shape = (x_maxlen, word2vec_dimension)
model.add(M)
model.add(seq2seq)
model.add(GRU(hidden_variables_decoding, return_sequences = True))
model.add(TimeDistributedDense(y_matrix_axis, activation = 'softmax')
model.compile(loss='categorical_crossentropy', optimizer='adam')
```
One little thing, I saw that you use two LSTMs, an encoder and another one for decoder. Isn't that approach more similar to Socher's encoder-decoder than the model presented in the paper you cite?
@EderSantana, from my understanding of Google's Seq to Seq paper, they got the best results using _four_ LSTM's to encode and _four_ LSTMs to decode giving them a total of 8 LSTM's.
This is why I asked the question above. Ideally, you would want to use much data, along with multiple RNN's. The idea is that each one captures more salient features within your data.
Interestingly, one thing that improves my results from previous prediction experiments is to use not just one type of rnn. Instead, for my decoder layer, using something like:
LSTM, JZS1, GRU (in that order)
gives me better results. I always lead with a LSTM because I feel it captures the most amount of features. Anyways, my two cents for what its worth. LIke I said earlier, I'll be publically graphing all my experiments (and labelling them as best as I can) with @EderSantana and @farizrahman4u mods on Keras. Tonight, I'm gonna try leading with a bidirectional LSTM and see what happens.
Hi @LeavesBreathe. You have to mask only after the seq2seq2, not before.
["How are you <EOL> <EOL> <EOL> <EOL> <EOL>",
"I am fine <EOL> <EOL> <EOL> <EOL> <EOL>"]
If you mask the EOLs in the input, the decoder will find it difficult to terminate its outputs with EOLs. Instead, it will fill up its output length with garbage words.
Seq2seq
model, try playing around with the LSTMEncoder and LSTMDecoder layers and make your own custom Seq2seq models (for e.g, multiple dense layers in between encoder and decoder, one encoder - many decoders models etc.). Its fun! @EderSantana In Cho et al, the output from encoder is fed to the decoder at every time step. And readout is also present. But in seq2seq, the output from the encoder is fed to the decoder in the first time step only. Also, the hidden state is copied from encoder to decoder. So my model is more similar to the seq2seq.
LSTMDecoder2
, it is similar to Cho et al, the output from encoder is fed to the decoder at every time step, along with output of previous time step. You may or may not enable hidden state copying when using this decoder. (Should work better when not enabled in case of conversational model, as it could remember not only what was said by human in the previous time steps, but also what it said in the previous time steps. ).EnglishToFrench = Seq2seq()
EnglishToFrench.compile()
EnglishToFrench.train()
encoder_data = EnglishToFrench.encoder.get_weights()
EnglishToSpanish=Seq2Seq()
EnglishToSpanish.encoder.set_weights(encoder_data)
EnglishToSpanish.compile()
EnglishToSpanish.train()
You can also train multiple language pairs simultaneously (Encode in English, decode in other languages):
EnglishEncoder = LSTMEncoder()
FrenchDecoder = LSTMDecoder()
SpanishDecoder = LSTMDecoder()
GermanDecoder = LSTMDecoder()
dense = Dense()
EnglishEncoder.decoders = [FrenchDecoder, SpanishDecoder, GermanDecoder] #Multiple decoders. Wow!
model = Graph()
model.add_input(EnglishEncoder, "english")
model.add_node(dense,"dense", input="english")
model.add_output(FrenchDecoder, "french", input="dense")
model.add_output(SpanisDecoder, "spanish", input="dense")
model.add_output(GermanDecoder, "german", input="dense")
model.compile()
model.train()
If you mask the EOLs in the input, the decoder will find it difficult to terminate its outputs with EOLs. Instead, it will fill up its output length with garbage words.
Huh. I always thought you wanted to mask your input, but what you're saying makes sense. I guess the question is then, is how do you mask the output given the seq2seq model? Would it be something like this?
seq2seq = Seq2seq(input_length=x_maxlen, input_dim=word2vec_dimension,hidden_dim=hidden_variables_encoding,
output_dim=hidden_variables_decoding, output_length=y_maxlen, batch_size=10, depth=5)
model = Sequential()
model.add(seq2seq)
model.add(Masking(mask_value=0))
model.add(TimeDistributedDense(y_matrix_axis, activation = 'softmax')
model.compile(loss='categorical_crossentropy', optimizer='adam')
try playing around with the LSTMEncoder and LSTMDecoder layers and make your own custom Seq2seq models
Yes! I definitely want to do that. From last night, I actually got better results with using strictly bidirectional lstms from seya.
Once I have things working, I'll make a BidirectionalLSTMEncoder and Decoder. I'll either submit a pull request to you or @EderSantana. It will take me at least a few weeks to get there though. I have alot of matrix setup and testing I need to do.
I have added a new decoder: LSTMDecoder2, it is similar to Cho et al, the output from encoder is fed to the decoder at every time step, along with output of previous time step
This is really cool. I can't wait to try all of this out. I also saw you updated the conversational.py -- Thanks!
Thanks for pointing out that the optimal value is 4 and not 5.
Its not necessarily that the _optimal_ value is 4. I actually went to med school for a while and studied neurology heavily. You'll find that the brain appropriates different amounts of neurons for different tasks (along with different types of neurons). And different types of conversations require different amounts of neurons.
The bottom line is that, for different areas of conversations (or topics), there is a sweet spot of the number of neurons you want, and the brain automatically optimizes to that level. Talking about the weather takes less neurons and watts than compared to theoretical physics discussions. This is also why people who speak 4 or 5 languages have a more neurons appropriated for language processing and creation.
Right now, we manually do that optimizing by trying different amounts of layers/hidden states. So in summary, its not that 4 is the most optimal for seq to seq. It is for that dataset, with the task of _translating_ English to French, 4 happens to be the most optimal. If you tried translating English to Chinese, I bet the optimal level would be 6 or 7.
Didn't mean to ramble, but all I'm trying to say is: experiment with different levels. Sometimes, I get better results with lower amount of hidden states, but _more_ levels. Sorry if I'm beating a dead horse.
@LeavesBreathe Thanks for clearing up the layer depth thing. Regarding masking:
one note, if anybody ever have to mask outputs, use sample_weight
not masking to choose which values affect the cost function. In practice it doesn't matter what the network output after EOL so we don't actually need to make it learn to output zeros. But, if you are not familiar with sample_weight
do as @farizrahman4u says and just use the model as he suggested.
@farizrahman4u Thanks for such a detailed description in regards to masking. I completely understand everything you're saying in regards to input. Fortunately, I do not have any 'out-of-vocab' words, so I won't be masking any input.
Thank you also for clarifying that EOL is a word. And is the first part to be predicted correctly. Makes complete sense.
I guess my main question is this: Is there a disadvantage to having your neural net predict repeated EOLs?
I feel like its additional thing your network has to learn. If you're predicting sentences, your network has to learn there is a length of 50 for each sentence predicted. Supposed the predicted sentence has 30 words. Then the network has to learn to predict EOL's for the rest of the 20 timesteps.
I'm wondering if this comes at a cost to the network. It might not, and I might just be worrying about nothing.
At this point, I feel bad asking more questions, so if you don't have time, don't bother addressing below:
There is another important aspect that I'm still cloudy on: transferring hidden states
In your seq2seq model, you transfer the hidden state from the encoder to the decoder, which makes complete sense. It eliminates the need for the RepeatVector
. But beyond that, many people talk about transferring the hidden state from layer to layer within the decoder and encoder..
Suppose for the decoder layer, we stack 4 LSTMs (depth = [4,4]
). Does the hidden state transfer from layer to layer?
If so, what is the difference between transferring the hidden states, and just regularly stacking 4 LSTM Keras layers? Why is transferring the hidden state advantageous?
Thanks alot again.
Is there any attention based model case with keras?
Thank you very much
@LeavesBreathe
I guess my main question is this: Is there a disadvantage to having your neural net predict repeated EOLs?
Not much.
Then the network has to learn to predict EOL's for the rest of the 20 timesteps.
This is not as tough as it sounds. Your network DOES NOT learn like this:
After 1 EOL, I should output EOL
After 2 EOLs, I should output EOL
After 3 EOLs, I should output EOL
......
After 19 EOLs I should output EOL
Instead, it just learns:
If my previous output is EOL:
output EOL
else if I do not have anything more to say:
output EOL
This rule is very simple to learn when compared to the complex stuff your seq2seq model learns, like translation, conversation etc. So dont worry.
Thanks @farizrahman4u for the clarification. The if statement makes much more sense. I'll get working on this and report back here if I find anything interesting that may help you guys.
I was having a some trouble with masking my data a while back, and I was hoping someone could clarify a few things for me, before I try a large experiment.
Q1. As my data is a 3D tensor, I pad the zeros at the BOTTOM of each individual feature matrix. Is this ok? (its a matrix of zeros)
I know the padding function in keras pads to the right, but in my case it has to be either above or below.
Otherwise I have a masking layer after my input layer with mask value set to (0.)
Thanks
@LeavesBreathe
In your seq2seq model, you transfer the hidden state from the encoder to the decoder, which makes complete sense. It eliminates the need for the RepeatVector. But beyond that, many people talk about transferring the hidden state from layer to layer within the decoder and encoder.
I have added a new function to StatefulRNN, called broadcast_state
. So you can send your hidden state from any StatefulRNN to another. For e.g, Here is a Seq2seq model with depth 2. The hidden state of encoder1 is propagated throughout the model.
encoder1 = LSTMEncoder(.........return_sequences=True)
encoder2 = LSTMEncoder(..........)
dense = Dense(.......)
decoder = LSTMDecoder2(.........)
encoder3 = LSTMEncoder(.....return_sequences=True)
encoder4 = LSTMEncoder(.....return_sequences=True)
#Connect hidden layers
encoder1.broadcast_state(encoder2)
encoder2.broadcast_state(decoder)
decoder.broadcast_state(encoder3)
encoder3.broadcast_state(encoder4)
#Build model
seq2seq = Sequential()
seq2seq.add(encoder1)
seq2seq.add(encoder2)
seq2seq.add(dense)
seq2seq.add(decoder)
seq2seq.add(encoder3)
seq2seq.add(encoder4)
I will be updating Seq2seq and Conversational shortly, stay tuned!
Wow Fariz...seriously man you're awesome. This broadcast_state
will be incredibly useful, and I definitely hope the main Keras gets it eventually.
I am a little confused as to the seq2seq model you built in your snippet of code.
You go from encode-> dense -> decode.
RepeatVector
for it, but it is better to transfer hidden states instead.Also: Small technicality. For your encoder1 and decoder , you should have return_sequences = True
correct?
In the weeks to come, I really hope to give back to you (and everyone else) by testing a ton of seq2seq models. Hopefully I can give you some insight as I imagine you are doing seq to seq as well.
@gautamb85, I'm not familiar with mfcc, but couldn't you simply do a variation of what @farizrahman4u suggested to me? Couldn't you, instead of masking, place some sort of 'SILENT' token where the zeros are? In this way, you don't have to worry about affecting the seq2seq model or its output data.
@LeavesBreathe hmm. worth a shot, but its trickier. I can't add a symbol, it has to be a floating point number (I hate real-valued data, why didn't I do NLP. lol). I'll keep you posted if something works.
@farizrahman4u First off.. stellar job on the seq2seq model!! :)
Q. Can I use the model to SCORE a pair of sequences?
In Cho's original paper, they use the model to re-score english-french translation pairs.
So, say I have a trained model, and I have pairs of sequences to test - can I evaluate the conditional likelihood of seq-2 given seq-1 i.e. p(y|x)
Doing something like this would not give me the correct data likelihood will it? (as I want the log likelihood and not the negative log likelihood, I would multiply by -1)
objective_score = -1*model.evaluate(X_test, Y_test, batch_size=32)
I can't add a symbol, it has to be a floating point number
It doesn't necessarily have to be a symbol, it can be a real number, just keep it consistent. Suppose you use "3" as your silent token. Assuming your doing speech to text, your model will learn that 3 is associated with the silent token output.
As an aside, I would choose a real number that is far away from your other numbers that represent your data. That way, it is treated a completely separate entity from the rest of your data. As an fyi, this is how the brain does it in your Auditory Cortex. It determines silence as a certain value, and you consciously recognize that value as silence. This is why conditions like chronic tinnitus can't be cured: the brain never hears the "silence" value and continues to output the "ringing-in-my-ears" value.
@LeavesBreathe
How are you going from the 2D output of the dense layer to the 3d input required by the decoder?
The decoder's input is 2D (Even though it inherits from LSTM). The output from each time step then becomes the input for the next time step.
Also: Small technicality. For your encoder1 and decoder , you should have return_sequences = True correct?
A decoder always has return_sequence = True by default. And yes, return_sequence = True for encoder1
@LeavesBreathe I have added a new class called DeepLSTM. It has built in hidden state propagation.
Example:
deep = DeepLSTM(input_dim=100, output_dim=100,depth=4, return_sequences=True, inner_return_sequences=False, remember_state=False, batch_size=32)
Notice the inner_return_sequences argument is False, which means the inner LSTMs will behave like Cho's enocder-decoder..(RepeatVector in between non-sequence-returning RNNs)
On a side note, you can also use broadcast_state
to send hidden state from one LSTM to multiple LSTMs ..(simply pass them as a list)
@farizrahman4u If you get a chance to answer my question, I would be most grateful :)
(pasted from above)
Q. Can I use the model to SCORE a pair of sequences?
In Cho's original paper, they use the model to re-score english-french translation pairs.
So, say I have a trained model, and I have pairs of sequences to test - can I evaluate the conditional likelihood of seq-2 given seq-1 i.e. p(y|x)
Doing something like this would not give me the correct data likelihood will it? (as I want the log likelihood and not the negative log likelihood, I would multiply by -1)
@gautamb85, don't mean to post over yours, but I do want to get back to Fariz.
@farizrahman4u ...I'm just running out of ways to complement and thank.
The output from each time step then becomes the input for the next time step.
Clever. Really clever.
Its great you can list LSTMs with broadcast_state
-- This is gold.
So is the idea with DeepLSTM is that it saves you lines of code right? You _could_ technically do the deep LSTM with the code snippet you gave above with broadcast_state
correct?
To incorporate the DeepLSTM
in the seq2seq model, would it be something like this?
encoder1 = DeepLSTM(input_dim=100, hidden_dim=100, output_dim=100,depth=4, return_sequences=True, inner_return_sequences=False, remember_state=False, batch_size=32)
dense = Dense(.......)
decoder = LSTMDecoder2(.........)
encoder2 = DeepLSTM(input_dim=100, hidden_dim=100, output_dim=100,depth=4, return_sequences=True, inner_return_sequences=False, remember_state=False, batch_size=32)
#Connect hidden layers
encoder1.broadcast_state(decoder)
decoder.broadcast_state(encoder2)
#Build model
seq2seq = Sequential()
seq2seq.add(encoder1)
seq2seq.add(dense)
seq2seq.add(decoder)
seq2seq.add(encoder3)
seq2seq.add(Dropout(dropout))
seq2seq.add(TimeDistributedDense(y_matrix_axis, activation = 'softmax'))
seq2seq.compile(loss='categorical_crossentropy', optimizer='adam')
@LeavesBreathe @EderSantana @farizrahman4u @gautamb85 this is really interesting. I've had some work to do so missed the party somewhat. Can everyone post the model and library that they end up with as being the optimal approach for their dataset when all's said and done, and hopefully issue some pull requests to keras so we can get all this in one place? Great work guys
Can everyone post the model that they end up with as being the optimal approach when all's said and done
Glad you're back. I will post my best models along with the training graph as they come in! All my graphs are here: https://plot.ly/~oxygen123/folder/home
@LeavesBreathe Yes, saving lines of code is the idea.And it does all the RepeatVector stuff automatically. Also, for encoder1 return_sequences=False. Depth of encoder2 should be 3, so that for decoding you have 1 decoder + 3 encoders = 4 layers deep.
@gautamb85 I saw your comment just now. I will come up with a detailed answer + code in a few hours.
@farizrahman4u You Sir, are a deep learning angel ! :)
I did have a question related to a previous post. I think I read/saw that you are using 2 lstm's in the encoder.
The idea behind this is a hierarchical RNN? I saw a paper recently, so the upper LSTM takes as its input (at each of its time steps) the encoding produced by the lower LSTM.
So if the lower RNN is fed words, the last time step would represent a sentence. Consequently, the last tilmestep of the upper LSTM would encode the whole document.
Of course, as you mentioned, many seq2seq architectures can be experimented with.
PS. Its really late and I might be imaging all this. If that is the case. Please ignore.
@farizrahman4u I don't know if this helps, but I can confirm (for sure) at least in Cho et al.'s approach, that the decoder is trained by teacher forcing (though I guess you don't have to). That is, during training the decoder is fed the TRUE label of the previous time-step, whereas at test time the true label is replaced by the prediction.
If we want to score a pair of sequences, we still have the true labels at test time, so those can be used. I am not sure if you need the readout or if it helps in this case, since you have the true label.
@farizrahman4u You Sir, are a deep learning angel ! :)
Amen
That is, during training the decoder is fed the TRUE label of the previous time-step, whereas at test time the true label is replaced by the prediction.
Maybe I'm missing something but isn't this what we already do? For example, right now, I give it 2 sentences and ask it to predict the next one. During training, I give it the next one it should predict (label). During testing, I test how close its prediction is to the actual sentence that is labelled.
I guess what I'm asking is: how else would you do this? You have to do teacher forcing?
The idea behind this is a hierarchical RNN? I saw a paper recently, so the upper LSTM takes as its input (at each of its time steps) the encoding produced by the lower LSTM.
I believe the whole idea is this:
You start off with with a basic encoder LSTM and decoder LSTM:
words --> Embedding --> Encoder LSTM --> Dense --> Decoder LSTM --> TimeDistributedDense--> Softmax
However, to make this neural net capture more salient featuers, we add _another_ encoding level after the decoder:
words --> Embedding --> Encoder1 LSTM --> Dense --> Decoder LSTM --> Encoder2 LSTM --> TimeDistributedDense--> Softmax
Lastly, we want to ensure that our encoder1 and our encoder2 are big enough to capture all levels of abstraction, so we add _multiple_ layers of LSTMs _within_ encoder1 and encoder2:
words --> Embedding --> Encoder1 LSTM (4 LSTMs) --> Dense --> Decoder LSTM (1 LSTM) --> Encoder2 LSTM (3 LSTMs) --> TimeDistributedDense--> Softmax
As a side note, there are Dense
+ RepeatVector
in between _each_ of the LSTM's within the encoder1 and encoder2 levels.
All hidden states are transferred (propagated) from each previous LSTM to the next LSTM. To do this, you need to use Fariz's broadcast_state
he has built in, so you can't use Keras to do this alone. I did not include broadcast state below because it would take up too much space. All layed out, it looks like this:
model = Sequential()
#Encoder 1 layer
model.add(LSTM(hidden_variables_encoding, input_shape=(x_maxlen, word2vec_dimension), return_sequences = False))
model.add(Dense(hidden_variables_encoding)
model.add(RepeatVector(x_sent_len))
model.add(LSTM(hidden_variables_encoding, return_sequences = False))
model.add(Dense(hidden_variables_encoding)
model.add(RepeatVector(x_sent_len)
model.add(LSTM(hidden_variables_encoding, return_sequences = False))
model.add(Dense(hidden_variables_encoding)
model.add(RepeatVector(x_sent_len)
model.add(LSTM(hidden_variables_encoding, return_sequences = False))
model.add(Dense(hidden_variables_encoding)
#Decoder Layer
model.add(LSTMDecoder2(.........))
#Encoder 2 Layer -- notice the change from x_sent_len to y_sent_len
model.add(LSTM(hidden_variables_encoding, return_sequences = False))
model.add(Dense(hidden_variables_encoding)
model.add(RepeatVector(y_sent_len)
model.add(LSTM(hidden_variables_encoding, return_sequences = False))
model.add(Dense(hidden_variables_encoding)
model.add(RepeatVector(y_sent_len)
model.add(LSTM(hidden_variables_encoding, return_sequences = True))
#softmax
model.add(TimeDistributedDense(y_matrix_axis, activation = 'softmax'))
model.compile(loss='categorical_crossentropy', optimizer='adam')
Anyone feel free to correct me if I'm wrong. I know this might be overkill, but I figure being as clear as possible is best so we're all on the same page.
@LeavesBreathe
Maybe I'm missing something but isn't this what we already do? For example, right now, I give it 2 sentences and ask it to predict the next one. During training, I give it the next one it should predict (label). During testing, I test how close its prediction is to the actual sentence that is labelled.
I guess what I'm asking is: how else would you do this? You have to do teacher forcing?
The idea is that a RNN is being used as a generative model of your data. So I want to find the log likelihood (negative cross entropy) of seq-2 given seq-1 or p(seq-2 | seq-1). So its like I have the correct label (say it was produced by some other system) and I want to score how good it is.
Yes, you would teacher force. Like a language model. Say you have a sequence and a probability model P(seq) (given bt the RNN), if you wanted to find how 'likely' the sequence is given the model, you would feed in the current time-step and ask it to predict the next one. You wouldn't (perhaps additionally) feed the prediction to the next step (because you might end up finding the likelihood of a different sequence. I'm not sure about that last sentence.lol)
I need to look at the code again, but this model seems a little strange. Not to mention, I might be wrong.
So the decoder produces an output at every time-step, which is being fed to a new encoder. This guy will/should encode a sequence of predictions produced by the decoder into a single vector (if it works as a standard encoder).
Though a standard encoder produces a single output, so connecting it to a time-distributed layer without a repeatVector should break your code (which I assume it doesn't, so something else must be going on)
If you are connecting an encoder to a timedistributed dense layer, it is not really an 'encoder' as it must be producing an output at every time-step to feed to the dense layer.
model.add(LSTM(hidden_variables_encoding, input_shape=(x_maxlen, word2vec_dimension), return_sequences = False))
model.add(Dense(hidden_variables_encoding)
model.add(RepeatVector(x_sent_len))
model.add(LSTM(hidden_variables_encoding, return_sequences = False))
model.add(Dense(hidden_variables_encoding)
model.add(RepeatVector(x_sent_len)
model.add(LSTM(hidden_variables_encoding, return_sequences = False))
model.add(Dense(hidden_variables_encoding)
model.add(RepeatVector(x_sent_len)
model.add(LSTM(hidden_variables_encoding, return_sequences = True))
model.add(Dense(hidden_variables_encoding)
I am not sure if this is an 'encoder' per say. As the last return sequences = True.
I think the idea is to get a single vector encoding of the sequence
Hey @gautamb85 I think this is a good discussion, though it would require alot of typing to communicate back and forth, do you want to skype chat for a bit if you're free? I think talking this out would be easier, and we can post back here when we have a conclusion? My username is leavesbreathe
@LeavesBreathe The final layer of encoder1 should have return _sequences=False
@LeavesBreathe The final layer of encoder of 1 should have return _sequences=False
My error. You want a final vector in the end of the encoder, so setting return_sequences=False
makes complete sense.
Fariz, if you want to join the skype chat, feel free to!
@gautamb85 I havent tried the Cho's rescoring thing. Can you please explain it to me.. like what is your input, what is the output, what are you trying to optimize etc?
@LeavesBreathe You two skype and later comment your conclusions here for us. I have some work pending, #964, #928, and documentation of #893. So pretty busy:)
Sounds good -- I'll skype with @gautamb85 (if he's cool with that), and we'll get back as to what we agree is the most optimal model. I'll test that model tonight if I can get everything setup.
@farizrahman4u
Lets say I have an english-french translation pair. So you would encode the english sentence (input 1).
For the decoder, in Cho's paper (appendix) the GRU non-linearity gets 3 inputs
- the context/encoding produced by the encoder (and its associated weight matrix)
- The french sentence (which at test time is replaced by the prediction)
- The hidden activation from the previous time-step.
1. Providing an extra input (so that the context vector is fed explicitly to each time-step) might prove a little problematic as it might need a new layer to be written.
You can refer to https://github.com/dagbldr/dagbldr
He has a conditional_GRU layer Vs a standard GRU_layer
2. Though a workaround might be to just use the encoding to initialize the decoder recurrence (like in sutskiver's paper)
3. If one wants to use the model to generate translations, then the prediction is needed at test time. If the model is to be used for scoring, we just need to evaluate the log-likelihood / negative cost
- optimization criteria etc are exactly the same.
- I had an idea how to do this using a graph() model, but the problem was getting an additional input to a GRU/LSTM layer. Plus I couldn't initialize the decoder recurrence as I didn't have a stateful RNN. So I can try that out.
Does a graph() model make sense for what I described? Il post some pseudo code in a little while)
@LeavesBreathe Sure, Skype sounds good. Today is kinda busy, but I might be available later at night (if you are a late sleeper). Tomorrow evening should be cool. I'm in Montreal, hoping you're in an easy to co-ordinate with time-zone :)
I'm in Cincinnati, so we have the same time zone. I have some things I gotta do tonight, but I'm free now untill 5. Or 9pm to 11pm tonight. Whatever works best for you. Just add me on skype and we can figure out a time. Sometimes I step away from my desk for a while, but if we schedule a time, I'll be sure to be online then.
@gautamb85
Providing an extra input (so that the context vector is fed explicitly to each time-step) might prove a little problematic as it might need a new layer to be written
This is already done using the LSTMDecoder2 layer in seq2seq. The output from encoder is repeatedly input to the decoder at every time step.
@farizrahman4u O sweet! I didn't look at decoder2 yet.
so I would do a graph() model sorta like this
input-1 -> encoder
input-2 -> LSTMdecoder2
Does a graph() architecture make sense ? I would need it to feed in the second sequence no?
I'll try out and report back, but yes, it should solve the problem.
Q. I noticed that you have a dense layer between stacked LSTM in the encoder, and also between the encoder and decoder. Why do you do it this way, is it shape related?
Yes. the Dense
in between encoder and decoder is to make the shape compatible. There is no dense in between the LSTM stack layers btw. (Its only there in @LeavesBreathe 's comment). there are RepeatVector
s in between the stack layers, again, for shape compatibility.
@gautamb85 This could be done fairly easily.. all you have to do is add an additional input to LSTMDecoder2, so that it at a given time step it will have 3 inputs:
_Note: French word embedding size should be same as dimension of context vector from the encoder._
Now lets see some pseudo code:
french = Sequential()
french.add(Embedding(...))
english = Sequential()#this is your encoder
english.add(Embedding(...))
english.add(DeepLSTM(..., return_sequence=False))
#Cheat keras..make it think its a single input
english.add(Reshape(1,french_embedding_size))
merge=Merge(english, french, 'concat')
#Decoder
deocoder=LSTMDecoder3()
model = Sequential()
model.add(merge)
model.add(LSTMDecoder3(....))
#optionally
english.broadcast_state(decoder)
model.compile(...)
I will post code for LSTMDecoder3 shorty!
Done!
from seq2seq.lstm_decoder import LSTMDecoder2
class LSTMDecoder3(LSTMDecoder2):
def _step(self, si, sf, sc, so,
x_tm1,
h_tm1, c_tm1, v,
u_i, u_f, u_o, u_c, w_i, w_f, w_c, w_o, w_x, v_i, v_f, v_c, v_o, b_i, b_f, b_c, b_o, b_x):
#Inputs = output from previous time step, vector from encoder, french sentence
xi_t = T.dot(x_tm1, w_i) + T.dot(v, v_i) + si + b_i
xf_t = T.dot(x_tm1, w_f) + T.dot(v, v_f) + sf + b_f
xc_t = T.dot(x_tm1, w_c) + T.dot(v, v_c) + sc + b_c
xo_t = T.dot(x_tm1, w_o) + T.dot(v, v_o) + so + b_o
i_t = self.inner_activation(xi_t + T.dot(h_tm1, u_i))
f_t = self.inner_activation(xf_t + T.dot(h_tm1, u_f))
c_t = f_t * c_tm1 + i_t * self.activation(xc_t + T.dot(h_tm1, u_c))
o_t = self.inner_activation(xo_t + T.dot(h_tm1, u_o))
h_t = o_t * self.activation(c_t)
x_t = T.dot(h_t, w_x) + b_x
return x_t, h_t, c_t
def get_output(self, train=False):
ip = self.get_input(train)
v = ip[0]#English context vector from encoder
S = ip[1:]#French Sentence
si = T.dot(S, self.S_i)
sf = T.dot(S, self.S_f)
sc = T.dot(S, self.S_c)
so = T.dot(S, self.S_o)
[outputs,hidden_states, cell_states], updates = theano.scan(
self._step,
sequences=[si, sf, sc, so],
outputs_info=[v, self.h, self.c],
non_sequences=[v, self.U_i, self.U_f, self.U_o, self.U_c,
self.W_i, self.W_f, self.W_c, self.W_o,
self.W_x, self.V_i, self.V_f, self.V_c,
self.V_o, self.b_i, self.b_f, self.b_c,
self.b_o, self.b_x],
truncate_gradient=self.truncate_gradient)
if self.state_input is None and self.remember_state:
self.updates = ((self.h, hidden_states[-1]),(self.c, cell_states[-1]))
for o in self.state_outputs:
o.updates=((o.h, hidden_states[-1]),(o.c, cell_states[-1]))
return outputs
def set_params(self):
super(LSTMDecoder3, self).set_params()
dim = self.input_dim
hdim = self.hidden_dim
self.S_i = self.init((dim, hdim))
self.S_f = self.init((dim, hdim))
self.S_c = self.init((dim, hdim))
self.S_o = self.init((dim, hdim))
self.params += [self.S_i,self.S_c, self.S_f, self.S_o]
def build(self):
self.set_params()
self._build()
Might have typos/indentation issues because I am typing this on my phone and can not test it right now.
I wil test out.
a couple of questions
englsh.add(Reshape(1,french_embedding_size))
merge=Merge(english, french, 'concat')
deocoder=LSTMDecoder3()
model = Sequential()
model.add(merge)
model.add(LSTMDecoder3(....))
Q. So you are concatenating the encoder output and the embedding for the french word into a single vector. So decoder3 takes this guy (concatenated vector) as the new additional input ?
It is also getting the encoding explicitly at every time-step (as was the case for lstmdecoder2)?
Q. I don't get why you need to concatenate them.
Can't it be -
french = Sequential()
french.add(Embedding(...))
english = Sequential()#this is your encoder
english.add(Embedding(...))
english.add(DeepLSTM(...output_dim=xdim, return_sequence=False))
(maybe a reshape over here is needed)
Q. can't I get the output from deepLSTM, like : context = output fro Deep_LSTM
and then do model.add(context) instead of add(merge)
Unless its not easy to get the layer output
deocoder=LSTMDecoder3()
model = Sequential()
model.add(context)
model.add(LSTMDecoder3(....))
PS. You typed this on your phone?
Mind = Blown :)
@gautamb85
PS. You typed this on your phone?
Most of it is copy-pasted from LSTMDecoder2.
. So you are concatenating the encoder output and the embedding for the french word into a single vector.
No. I am concatenating the encoder output and the French SENTENCE (NOT WORD) into a single matrix(not vector)
If the French sentence was [word1, word2, word3, word4], after merging it with the context vector it would look like : [context, word1, word2, word3, word4]
So decoder3 takes this guy (concatenated vector) as the new additional input ?
No. This guy is THE input. Not additional. Technically, the number of inputs for for LSTMDecoder2 and LSTMDecoder3 is same (which is 2). But logically,LSTMDecoder3 has one extra input(the merged input could be seen as 2 inputs)
It is also getting the encoding explicitly at every time-step (as was the case for lstmdecoder2)?
YES
Q. I don't get why you need to concatenate them
- We are packing the inputs for the decoder into a single tensor. The decoder then separates them out.
- If your sentence pair is
["how are you", "comment allez-vous"]
, your merged guy will look like this:
[ f("how are you"), e("comment"), e("allez"), e("vous")]
. Here f("how are you") is your context vector that you get from the encoder.
Lets analyze the input to the decoder and its output for 4 time steps.
Hope this helps.
In your code, you are not using your french model at all !! Am I missing something?
My bad with the disconnected french model. (Also typed on my phone. Lol)
I see. Yea, thats why u concatenate them, as the context is getting replaced after the first timestep.
I thought it was like this :
You concatenate so that the context is there at every step. Is that correct?
Thanks again.
I will update you when I get something going.
Sent from my iPhone
On Nov 10, 2015, at 2:48 PM, Fariz Rahman [email protected] wrote:
@gautamb85
PS. You typed this on your phone?
Most of it is copy-pasted from LSTMDecoder2.
. So you are concatenating the encoder output and the embedding for the french word into a single vector.
No. I am concatenating the encoder output and the French SENTENCE (NOT WORD) into a single matrix(not vector)
If the French sentence was [word1, word2, word3, word4], after merging it with the context vector it would look like : [context, word1, word2, word3, word4]
So decoder3 takes this guy (concatenated vector) as the new additional input ?
No. This guy is THE input. Not additional.
It is also getting the encoding explicitly at every time-step (as was the case for lstmdecoder2)?
YES
Q. I don't get why you need to concatenate them
Lets analyze the input to the decoder for 4 time steps.
Time1 : x1 = LSTM(context, word1, context)
Time2: x2 = LSTM(context, word2, x1)
Time3: x3 = LSTM(context, word3, x2)
Time4: x4 = LSTM(context, word4, x3)
Hope this helps.—
Reply to this email directly or view it on GitHub.
@gautamb85 I think we are talking about slightly different models. Can you give me an example of your x_train, y_train etc?
@farizrahman4u I will have to get back to you on that in a little while.
From your code of LSTMdecoder3:
xi_t = T.dot(x_tm1, w_i) + T.dot(v, v_i) + si + b_i
where v is the context vector from the encoder and v_i is the weight matrix (for input gate only) and x_tm1 is the prediction from the previous time-step.
Could I replace the prediction x_tm1 with the actual french word, or alternatively, add not the T.dot(x_t,W)
where x_t is the current french word. Which I guess is the need to concatenate the context and the french sentence.
But I think what you are suggesting with the architecture (the concatenation) should achieve the same thing.
I need more clarity on what we are trying to do here. Lets start with what your training data will look like. I will then pick the best model for you.
I eventually want to use the model with speech.
So my training data would be a pairs of recordings that are padded to the same length. So it would be a 3D matrix - (N_samp, maxlen, feat_dim) representing a mini N_samp examples (corresponding to seq-1) and a similar matrix corresponding to seq-2.
Now this model can be trained by teacher forcing, i.e. instead of feeding the prediction (as you do in LSTMdecoder2 I believe) you can feed in the true label from the previous time-step.
This is how Cho et. al trained there models, both for generating translations and scoring pairs of translations. In the generation case, you need to replace the true label with the prediction as you don't have the true label.
However in the scoring case, you can use the model as it (if it is trained by teacher forcing), since we have access to both seq-1 and seq-2
This paper is pretty interesting: http://arxiv.org/pdf/1511.01432.pdf
I gotta read it some more to fully understand it, but looks like there's even more we can implement.
Fariz, I'm having some difficulty with getting your classes to work. I'm gonna try a few variations first, but if I can't get any of them working, I'll report back here tomorrow afternoon or so.
@farizrahman4u
Thank you for your decoder code, does your decoder code is an attention model?
@tttwwy No.Its just a stateful LSTM with readout and hidden state broadcasting.
@LeavesBreathe Please open an issue in seq2seq for any problem you are facing. Try recloning. I just made an update.
@farizrahman4u I will be sure to open an issue on your seq2seq. Give me at least a full day as I'm testing alot of variations/debugging before I come to you with the final problem.
@tttwwy I agree with you that attention is very important. I'm working on some code that I hope to implement in two to three weeks to address this issue. Maybe add on to Fariz's seq to seq model so we all have one working model.
@gautamb85 did you want to still chat tonight? I think it would be good to share each other's ideas with each other. Doesn't have to be long. I can't pm you over github, so if you can just add me on skype and we can figure out a time. I'm free all day today.
Hey. Sorry for the late reply. Are u free at 10-1030 eastern time?
Sent from my iPhone
On Nov 11, 2015, at 12:29 PM, LeavesBreathe [email protected] wrote:
@farizrahman4u I will be sure to open an issue on your seq2seq. Give me at least a full day as I'm testing alot of variations/debugging before I come to you with the final problem.
@tttwwy I agree with you that attention is very important. I'm working on some code that I hope to implement in two to three weeks to address this issue. Maybe add on to Fariz's seq to seq model so we all have one working model.
@gautamb85 did you want to still chat tonight? I think it would be good to share each other's ideas with each other. Doesn't have to be long. I can't pm you over github, so if you can just add me on skype and we can figure out a time. I'm free all day today.
—
Reply to this email directly or view it on GitHub.
@gautamb85 @LeavesBreathe @farizrahman4u have you seen this: http://www.tensorflow.org/tutorials/seq2seq/index.md? Google just open sourced a deep learning toolkit with a graphical interface. Includes a sequence to sequence model. I am unreasonably excited. Has a python interface and an attentional model, something I've really wanted and needed for my research. It also has a python interface
@simonhughes22 since you are interested in models besides Keras, have you seen blocks-examples
? They have a machine translation model with attention working out the box for en-cs. https://github.com/mila-udem/blocks-examples
Slightly off topic:
@fchollet tweeted that keras will be supporting seamlessly both theano and tensorflow. Does this mean that keras models could run on android? Because tensorflow has an android example. In the mean time, is there any way to get a keras model work on anndroid as of now? Has anyone tried it (like turn of all the c++ stuff and run theano in pure python mode?)
@gautamb85 , sorry but we had a power outage yesterday -- internet is stil out but we can chat hopefully tonight. Do you mind adding me on skype (username is leavesbreathe) so that we don't need to take up space on this thread to schedule talking?
Hey guys so I pretty much spent the entire day reading up on TensorFlow. I think the bottom line is that they have more capabilities (attention mechanism), however it is so much more messy than Keras. So basically, I've decided to try out TensorFlow, but I still want to use Keras (as I like Keras's community and logic flow more).
With all of that being said, I think it would be interesting to compare results using Keras versus TensorFlow. I hope to have TF up in the next week or so to see what type of results I'm getting.
@EderSantana @simonhughes22 @melonista @LeavesBreathe
there is an attention based nmt model , which may have some help for you .
https://github.com/kyunghyuncho/dl4mt-material/tree/master/session3
@LeavesBreathe TensorFlow code is messy when compared to Keras. It is more easy to contribute to Keras, which means in the long run Keras will be filled with a lot of features. We even have a working implementation of Nural Turing Machine !( #990 ). We are the first to open source it.
It is more easy to contribute to Keras, which means in the long run Keras will be filled with a lot of features.
I totally agree with you. Keras is much cleaner and easier to contribute to (I don't even think tf is allowing PR's).
However, I want at least try it for a little bit because there may be a few things I learn from TF that we can implement in Keras. For example, they give an attention model mechanism here:
https://github.com/tensorflow/tensorflow/blob/master/tensorflow/models/rnn/seq2seq.py#L453-520
Btw, I was looking at the neural turing machine earlier that Eder wrote and it looks so cool
I agree with @farizrahman4u !!! Keras is much easier to program, cause we are focused in offering higher level APIs.
To be more precise, I think we are the first to open source a NTMachine with RNN controllers. Others had implementations with feedforward controllers (which is less powerful).
For example, here is a simple LSTM classifying the MNIST (running row by row) in TensorFlow https://github.com/EderSantana/TwistedFate/blob/master/mnist_lstm.py It is fast to start running, but we have to hard code all dimensions (fix batch size, fix sequence length, etc)
@LeavesBreathe Sorry for not getting back to you. I don't have a skype account, aI will set it up over and we add you over the weekend.
@farizrahman4u I had a question about your code. Specifically realting to prediction feedback:
def _step(self, si, sf, sc, so,
x_tm1,
h_tm1, c_tm1, v,
u_i, u_f, u_o, u_c, w_i, w_f, w_c, w_o, w_x, v_i, v_f, v_c, v_o, b_i, b_f, b_c, b_o, b_x):
#Inputs = output from previous time step, vector from encoder, french sentence
xi_t = T.dot(x_tm1, w_i) + T.dot(v, v_i) + si + b_i
xf_t = T.dot(x_tm1, w_f) + T.dot(v, v_f) + sf + b_f
xc_t = T.dot(x_tm1, w_c) + T.dot(v, v_c) + sc + b_c
xo_t = T.dot(x_tm1, w_o) + T.dot(v, v_o) + so + b_o
i_t = self.inner_activation(xi_t + T.dot(h_tm1, u_i))
f_t = self.inner_activation(xf_t + T.dot(h_tm1, u_f))
c_t = f_t * c_tm1 + i_t * self.activation(xc_t + T.dot(h_tm1, u_c))
o_t = self.inner_activation(xo_t + T.dot(h_tm1, u_o))
h_t = o_t * self.activation(c_t)
x_t = T.dot(h_t, w_x) + b_x
return x_t, h_t, c_t
Q. In that code snippet, x_t is the prediction (which is getting fed back via scan) and it is intialized as v (the context produced by the encoder). correct?
Q. I am confused because, if this was regression, then x_t represents the actual prediction of the model. However for classification, this x_t would get fed to a softmax function, and then we would sample/argmax to get the actual prediction.
Is it equivalent to feedback just x_t? (without doing softmax etc) and does it work the same way at test time? I mean, at test time the x_t (before softmax) is fed back as the 'prediction', however the actual prediction (visible) is made by feeding these hidden predictions to a softmax layer after the 'scanning (theano)' is done.
Q. I assume the get_outputs function that (returns the outputs) feeds them to a dense (softmax) layer that makes the prediction ?
Ps. I know a lot of that may not sound clear, and I am happy to clarify.
@gautamb85 no problem no rush -- I don't know if its necessary that we talk, but if you want to chat, I think it would be good! add me whenever you want!
@gautamb85 x_tm1 is the output from the previous timestep(with initial value v), it need not be the actual prediction of the model at that timestep(which is y_tm1,because it is difficult to access). Still, x_tm1 is still a good representation of y_tm1. The above layer is a general layer - like the default LSTM in Keras. Whether it should do regression or classification is up to you. You simply stack activation layers over it. That being said, try doing a sigmoid/tanh over x_t and see if you find anything interesting.
Hey Guys, I'm back from exploring TensorFlow, and I'm fired up to keep working on @farizrahman4u's seq2seq. I'm having a few issues Fariz, which I will post to your seq2seq channel. However, here are a few takeaways, I got from TensorFlow:
@LeavesBreathe I am sort of offline right now. But as my seq2seq repo has got more attention than I anticipated, I would be spending more time on it once i am done with my upcoming exams:).
Best of luck with the exams -- I must say that your seq2seq repo has gotten much attention from my skype contacts -- they keep asking me about it and are constantly comparing it to tensorflow's seq2seq. I think you and Tensorflow have the best working seq2seq models right now.
@LeavesBreathe
. I think you and Tensorflow have the best working seq2seq models right now.
Thats a huge compliment. Thanks!
Regarding attention mechanism, I will be converting the following project to keras:
https://github.com/npow/RNN-EM
The API will be similar to that of an LSTM, so just replacing all LSTMs with the RNN_EM class would give you a seq2seq model with attention mechanism.
That's gonna be killer. Beyond that, the only major feature that I see that tensorflow has is a sampled_softmax, but I'm trying to work on a hierarchical softmax right now. It will definitely take me a while as it has been attempted already in Keras.
@LeavesBreathe excited for that: would be a very useful addition.
Hey Guys just as another update, I dabbled more with TensorFlow, and I am getting some pretty good results with it. The lowest perplexity I've gotten has been 31. Though with Fariz's Seq2Seq, I got a perplexity of 45 (and that's with no attention) so there is plenty of optimization still left to do!
I'm also working on a generic attention decoder, but waiting for multiple input support on graph rather than using a concat, I feel that it's cleaner. I'm organizing a hackathon on attention models this Saturday so I'll keep you posted. If you guys have any suggestion that would be appreciated.
Good to hear. My attention has shifted more to tensorflow lately, and I'm trying to implement curriculum learning right now. Its really hard for me to determine whether tensorflow or keras is a better platform. I think they each have their strengths.
Fariz's platform is really good as well. Right now, because of attention, I am still getting better results in tensorflow.
Also, Fariz. I recently had a chat with kkastner and he mentioned a possible improvement with averaging the hidden states in the decoder. I'm going to test that over the next two weeks (each run takes like 6 days). If I get improved results, I'll report back here on this thread as this is something pretty easy to implement.
@farizrahman4u @LeavesBreathe thanks for a very illuminating discussion here. Would you guys talk a bit about the incoming data into the seq2seq or deepLSTMs that you're using? I'm a bit unclear right now about how to model my data in order to do the predictions so far.
@viksit there are alot of ways you can input your data into seq2seq. The most basic is one-hotting, where each word is assigned one word. Then you can do what's called an embedding layer, which will compress this vector into a readable format. If you scroll up this thread, @simonhughes22 helped me understand this.
Ah, I'm already using custom trained word vectors. Some thoughts. I'm using a toy dataset here. I did see @simonhughes22s comment above, and rather than "book-ending" my sentences - I'm simply using end of sentence markers.
What is your name?
My name is X.
How old are you?
I am Y years old.
I then transform this data into,
X_train,y_train
What is your name?<EOS>,My name is X.<EOS>
How old are you?<EOS>,I am Y years old.<EOS>
An end-to-end example would actually be quite helpful to infer some of the details.
@viksit ,lately, I have been working with Tensorflow lately. Not saying that Fariz's platform is bad or anything, but tf has just given me a better understanding of what's going on.
As such, I don't mask anything. I simply pad with 0's at the end for input and output.
each word is a timestep, so I'm not quite sure what you're asking.
When you say build...do you mean compile? about 10 mins. To train? about 5 days, depending on data size and compute power. For me I have two 980 ti's and it takes me about 5 days for a network with 1million samples.
I don't have an end to end example, but you might want to check out this: http://www.tensorflow.org/tutorials/seq2seq/index.html
Hi everyone,
I have made the following updates to my seq2seq implementation:
SimpleSeq2seq
model which uses only pure Keras to serve as performance baseline, against which other models can be compared.Recurrent
API.Thanks alot fariz, I'm really swamped right now, but this is nice to have for sure. Really appreciate the help.
Cool!
I have been following this thread. I am not exactly doing text generation or machine translation. @farizrahman4u @LeavesBreathe @simonhughes22 your insights were really useful. @LeavesBreathe I would like to know if tensorflow gave you good results. I am going to tryout @farizrahman4u model as well. So I just wanted to know if tensorflow worked better. I would like to get a suggestion as well if something like CNN for feature extraction followed by LSTM encoder-decoder. I am working at character level and not word level. Thank a lot in advance..
pralav I will say tensorflow is really good but it is incredibly slow and hogs a ton of memory as of right now. They have plans to improve, but be prepared to spend alot of money on graphics cards if you plan on using tensorflow.
Integrating CNN's within LSTM's is something that could potentially lead to interesting results. Have to be careful how you structure it though.
@LeavesBreathe Thanks a lot. Hopefully will inform once i experiment with more models. Thank you.
@LeavesBreathe So what do you suggest using instead of tensorflow for NLP tasks which will help in minimize the memory hog?
Well you can wait until tensorflow improves its memory hogging problems, buy more gpu's, or simply switch to theano which is much faster and doesn't hog the memory nearly as much as of right now =/
Hi @farizrahman4u ,
In this post https://github.com/nicolas-ivanov/debug_seq2seq it is mentioned that the seq2seq implementation of yours is not performing well, and perhaps not generating sequences properly. Have you looked into those examples ? Has your recent commits, especially the attention modules, were to fix some of the issues mentioned there ? Or has there been any other tests like this which show good results with your implementation, perhaps benchmark with tensorflow ?
Thanks
Assume we are trying to learn a sequence to sequence map. For this we can use Recurrent and TimeDistributedDense layers. Now assume that the sequences have different lengths. We should pad both input and desired sequences with zeros, right?
Isn't it possible to just skip the input_length parameter and give just the input_dim? Or input_shape=(None, input_dim)?
@NickShahML : isn't now people use encoder-decoder architecture for sequence to sequence learning?
Most helpful comment
Hi @farizrahman4u ,
In this post https://github.com/nicolas-ivanov/debug_seq2seq it is mentioned that the seq2seq implementation of yours is not performing well, and perhaps not generating sequences properly. Have you looked into those examples ? Has your recent commits, especially the attention modules, were to fix some of the issues mentioned there ? Or has there been any other tests like this which show good results with your implementation, perhaps benchmark with tensorflow ?
Thanks