I am solving an NLP task and I am trying to model it directly as a sequence using different RNN flavors. How can I use my own Word Vectors rather than using an instance of layers.embeddings.Embedding?
UPDATED NOV 13, 2017
You have to pass a weight matrix to the Embedding
layer. Here is an example:
Let's say index_dict
is a dictionary that maps all the words in your dictionary to indices from 1
to n_symbols
(0
is reserved for the masking).
So, an example index_dict
is the following:
{
'yellow': 1,
'four': 2,
'woods': 3,
'ornate': 31,
'woody': 5,
'cyprus': 6,
'marching': 7,
'canes': 8,
'caned': 9,
'hermann': 10,
'lord': 11,
'meadows': 12,
'shaving': 13,
'swivel': 14
...
}
And you also have a dictionary called word_vectors
that maps words to vectors like so:
{
'yellow': array([0.1,0.5,...,0.7]),
'four': array([0.2,1.2,...,0.9]),
...
}
The following code should do what you want
# assemble the embedding_weights in one numpy array
vocab_dim = 300 # dimensionality of your word vectors
n_symbols = len(index_dict) + 1 # adding 1 to account for 0th index (for masking)
embedding_weights = np.zeros((n_symbols, vocab_dim))
for word,index in index_dict.items():
embedding_weights[index, :] = word_vectors[word]
# define inputs here
embedding_layer = Embedding(output_dim=vocab_dim, input_dim=n_symbols, trainable=True)
embedding_layer.build((None,)) # if you don't do this, the next step won't work
embedding_layer.set_weights([embedding_weights])
embedded = embedding_layer(input_layer)
# ... continue model definition here
Note that this kind of setup will result in your embeddings being trained from their initial point! If you want them fixed, then you have to set trainable=False
.
No need of skipping the embedding layer.. setting word vectors as the initial weights of embedding layer is a valid approach. The word vectors will get fine tuned for the specific NLP task during training.
Has anybody else attempted to embedded the word vectors into a model?
I've managed to create the model however Im not able to achieve a worthwhile level of accuracy yet. I've used the "20 newsgroups dataset" from scikit-learn to test this model, with my own w2v vectors. The best accuracy I've achieved so far is 28%, over 5 epochs, which is not great (scikit script best - 85%). I intend to continue experimenting with the network configuration (inner dimensions and epochs initially). I do suspect that the number of dimentions is too high for such a small dataset (1000 samples).
Will update again if the results improve
From what I've seen, training your own vectors on top of a custom dataset has given me much better accuracy within that domain.
That said - any updates on this?
@viksit
There are 3 approaches:
The third one is the best option(Assuming the word vectors were obtained from the same domain as the inputs to your models. For e.g, if you are doing sentiment analysis on tweets, you should use GloVe vectors trained on tweets).
In the first option, everything has to be learned from scratch. You dont need it unless you have a rare scenario. The second one is good, but, your model will be unnecessarily big with all the word vectors for words that are not frequently used.
@farizrahman4u agreed on those counts. The domains are a bit more specific and I have a lot more luck with option (2) than with (1) or (3) so far. An easy way to address the size problem with (2) is to prune out the vocabulary itself to the top k words.
@farizrahman4u Thanks for sharing the ideas. I have a question on your first approach.
Learning embeddings from scratch: each word in the dictionary is represented as one-hot vector, and then this vector is embedded as a continuous vector after applying embedding layer. Is that right?
@MagicYoung Yes.
Looks cool! Are the results to your liking?
Results after 2 epochs:
Validation Accuracy: 0.8485
Loss: 0.3442
@dandxy89
Wonderful examples for me.I'll try it in my context. Thanks.
And if I want to feed pre-trained word2vec to lstm directly, how to handle different sequence lengths?
My trying is: setting maxlen=200(every sequence) and word2vec dim=600, if some sequence's length is 100, then the first 100 rows([0:100)) has float numbers(every row is a word vector), then [100:200) rows is padding with zeors, which means the remaining rows is all zeros. But after doing that, I get loss NaN, which confuses me for a long time. like the issue #1360 , what I can do then?
Thanks.
@liyi193328 are you using keras' sequence.pad_sequences(myseq, maxlen=maxlen)?
Looks like your padding is to the _right_ of the vectors where as it should be to the left.
from keras.preprocessing import sequence
In [69]: sequence.pad_sequences([[1,2], [1,2,3]], maxlen=10)
Out[69]:
array([[0, 0, 0, 0, 0, 0, 0, 0, 1, 2],
[0, 0, 0, 0, 0, 0, 0, 1, 2, 3]], dtype=int32)
Secondly, you should be using categorical_crossentropy as your loss as opposed to mean_squared_error. See https://github.com/fchollet/keras/issues/321
@viksit Thanks.
I don't use sequence.pad_sequence. I pad zeros for last rows manually.
Because the input of pad_sequence is 2D array, setting A, then A[i,j] is the index of a word in vocabulary, The each row of A represents a sentences; But in my case, every word is a 600 dim vector, not a index. So I can't use it.
I don't think is what I think right? How I can pad zeros then?
Thanks.
You need to pad _before_ you convert the words to vectors (presumably you have a step where you have only word indexes).
@viksit Thanks.
But How do I change the word index to word vector ?
Actually I don't use word indexes, I use word vector directly.
And More specifically, three sentences like:
so Index 2D array is [ [1,2,3], [4], [2,3] ] ---padding---> [1,2,3], [0,0,4], [0,2,,3]
the word vector(4 dim) each is:
then after padding ,the 3D array shape is (2,3,4), like:
if the specific example right?
Thanks.
Thats correct.
@viksit Thanks.
My mistake in initialing array with np.empty. should use np.zeros. Everything goes well now.
@dandxy89 @farizrahman4u @viksit
Do you mean, when initialize Embedding Layer with weights learned for other corpus calculated word2vec, the LSTM model can quickly convergence?
Here is my code, just iter 2-3 times will get high score.
https://github.com/taozhijiang/chinese_nlp/blob/master/DL_python/dl_segment_v2.py
Thanks for sharing. I benefit a lot. @
Follow @liyi193328 comment:
If this is my X_train:
[
[ [1,1,1,1], [2,2,2,2], [3,3,3,3] ],
[ [0,0,0,0], [0,0,0,0], [5,5,5,5] ],
[ [0,0,0,0], [2,2,2,2], [3,3,3,3] ]
]
How should I structure my Y_train given each word (word-level training) will have it's own tag(Y and is also multi class). Is it like:
[
[1,2,3],
[0,0,3],
[0,2,1]
] ?
Because I am having an error:
"Exception: All input arrays and the target array must have the same number of samples."
Thank you very much!
@ngopee
For multi classes, if the label class is 2(total 3 classes), then it must be transformmed as 1D array [0,0,1];
Specifically if the sentence has x tokens. if every token has a label and has y classes, then all the labels's(Y_train) shape is (x,y), 2D array.
May you solve problemes.
@liyi193328
Thank you very much for your reply!
Yes, I realised that later on but now having another issue which I'm not sure how to fix:
"if every token has a label and has y classes, then all the labels's(Y_train) shape is (x,y), 2D array". I'm not sure I follow this part. But below is what I have so far.
So, in my case, each token(Word vector) has is own tag.
Here is a sample of my input:
X_train =
[
[ 8496 1828 …5447]
[ 9096 8895 …13890]
[ 5775 115 … 15037]
[ 6782 9918 … 5048]
]
Y_train=
[
array([[ 0., 0., 0., 1.], [ 0., 0., 1., 0.], …[ 0., 0., 0., 1.]]),
array([[ 0., 0., 1., 0.], [ 0., 0., 0., 1.],…[ 0., 0., 1., 0.]]),
array([[ 0., 0., 1., 0.], [ 0., 0., 0., 1.], …[ 0., 0., 1., 0.]]),
array([[ 0., 1., 0., 0.], [ 0., 1., 0., 0.], …[ 0., 1., 0., 0.]])
]
I am getting this error:
AssertionError: Theano Assert failed!
Apply node that caused the error: Assert(Elemwise{Composite{(i0 - EQ(i1, i2))}}.0, Elemwise{eq,no_inplace}.0)
Inputs types: [TensorType(int8, matrix), TensorType(int8, scalar)]
Inputs shapes: [(1, 100), ()]
Inputs strides: [(100, 1), ()]
Inputs values: ['not shown', array(0, dtype=int8)]
Here is my code:
vocab_dim = 300
maxlen = 100
batch_size = 1
n_epoch = 2
print('Keras Model...')
model = Sequential() # or Graph or whatever
model.add(Embedding(output_dim=vocab_dim,
input_dim=n_symbols + 1,
mask_zero=True,
weights=[embedding_weights]))
model.add(LSTM(vocab_dim, return_sequences=True))
model.add(Dropout(0.3))
model.add(TimeDistributedDense(input_dim=vocab_dim, output_dim=1))
print('Compiling the Model...')
model.compile(loss='categorical_crossentropy',
optimizer='adam',
class_mode='categorical')
print("Train...")
model.fit(X_train, y_train, batch_size=batch_size, nb_epoch=n_epoch,
validation_data=(X_test, y_test), show_accuracy=True)
print("Evaluate...")
score, acc = model.evaluate(X_test, y_test,
batch_size=batch_size,
show_accuracy=True)
print('Test score:', score)
print('Test accuracy:', acc)
Thank you very much!
@ngopee
model.add(TimeDistributedDense(input_dim=vocab_dim, output_dim=1)): output_dim == 1?
I think the number is the number of classes, while I do not read the whole context.
@liyi193328
Thank you very much for pointing that out. That would have been yet another mistake.
However this did not seem to fix the issue I previously had. Any insight on what I could be doing wrong?
Thanks!
@ngoee
Sorry for late reply! The meaning of your code is a many-to-one classification, but your actual goal is many-to-many. So it needs to change the logic.
@ngopee
What are the dimensions of your output/target?
@liyi193328
Yes, you are right. I thought the TimeDistributedDense layer is what I needed to make the model many to many ? This is what I understood by reading other Keras issues. Could you please explain to me what actually is the TimeDistributedDense layer or point me to any good reading material?
Is it possible for me to have only 1 LSTM layer and no more layer afterwards?
@dandxy89
I did a mistake above but my output dimension is 3. I have 3 classes.
Thank you very much for your replies! I appreciate it!
@ngopee
https://github.com/fchollet/keras/issues/1029 may help you.
@liyi193328
Thank you very much! I got rid of the embedding layer and I'm feeding my pre-trained word vectors to the LSTM layer. Now it works!
Thanks again for the help!
Which is more efficient? Feeding the pre-trained word vectors directly or using the embedding layer with word vector weights? I would think that the two options are equivalent if you just set layer.trainable = False.
@PiranjaF are you asking in terms of performance of the model?
@viksit I think that the two approaches would be identical in terms of the final solution after X iterations, but have differences in training time. Is that true and which approach would then be the fastest?
So yes, I'm asking in terms of the training performance.
From my experiments, they were definitely off. I've seen better quality when taking global word2vec data and tuning it for a domain by using them as starting weights for an embedding layer. But less performant. Letting an embedding layer loose on your own domain data is faster, but quality wise (at least in my scenario), it suffered more.
But did you also set the weights for the embedding layer using the global word2vec and then turn off the trainable attribute for the embedding layer? That should make them identical I would guess.
@dandxy89: I tried replicating your script. The Best I got was Validation Accuracy: 0.7804, Loss: 0.4597 even after 30 epochs.
Any suggestions ? I have blatantly copied your code and ran it as it is. No where close to the results you reported: Validation Accuracy: 0.8485, Loss: 0.3442
@ngopee @farizrahman4u @viksit @liyi193328 @dandxy89
Can you pls share your piece of code where you directly fed the word-vectors to the model (without embedding layer) ?
all looks great
to sum up
may you pls recommend link to final best code for keras word2vec lstm pls, better for short texts and sentiment analysis,
but actually any code for keras word2vec lstm
Thank you very much in advance...
@anujgupta82 - I have modified my script and pushed to my repository. The modification I made was to increase the number of iterations when building the word2vec model. Also, I never got around directly feeding the vectors into the model instead of an embedding matrix. It should be a fairly small change in order to do that and quite interesting to see the difference.
25000/25000 [==============================] - 1321s - loss: 0.2803 - acc: 0.8849 - val_loss: 0.3220 - val_acc: 0.8609
Evaluate...
25000/25000 [==============================] - 320s
('Test score:', 0.32195980163097382)
('Test accuracy:', 0.86087999999999998)
@Sandy4321 - take a look at this. It may help.
@dandxy89
Ran your revised code, still got much lower numbers
Epoch 1/2
25000/25000 [==============================] - 224s - loss: 0.5040 - acc: 0.7463 - val_loss: 0.4709 - val_acc: 0.7625
Epoch 2/2
25000/25000 [==============================] - 224s - loss: 0.3768 - acc: 0.8274 - val_loss: 0.4571 - val_acc: 0.7810
Evaluate...
25000/25000 [==============================] - 37s
('Test score:', 0.4571321966743469)
('Test accuracy:', 0.78095999999999999)
Can someone else try his code and report the numbers u r getting ?
@fchollet: any reasons for this ? ran the experiment couple of times to eliminate chance
@anujgupta82
Feeding the vectors directly:
model = Sequential()
model.add(LSTM(num_hidden_nodes , return_sequences=True, input_shape=(maxlen, vocab_dim)))
model.add(Dropout(0.2))
model.add(LSTM(num_hidden_nodes , return_sequences=True))
model.add(Dropout(0.2))
model.add(TimeDistributedDense(output_dim =nb_classes,input_dim= num_hidden_nodes , activation='softmax'))
@anujgupta82 what backend are you using? I used Tensorflow, not that I think that will have a significant impact.
@dandxy89: I am using theano (latest version) and running my code on AWS g2.2xlarge instance.
My virtual environment is as follows:
backports.ssl-match-hostname==3.4.0.2
boto==2.39.0
bson==0.4.1
bz2file==0.98
certifi==2015.9.6.2
Cython==0.23.4
funcsigs==0.4
gensim==0.12.3
graphviz==0.4.10
h5py==2.5.0
httpretty==0.8.10
Keras==0.2.0
matplotlib==1.4.3
mock==1.3.0
nltk==3.1
nose==1.3.7
numpy==1.10.1
pandas==0.17.0
pbr==1.8.1
plotly==1.8.11
pydot==1.0.2
pymongo==3.0.3
pyparsing==1.5.7
python-dateutil==2.4.2
pytz==2015.7
PyYAML==3.11
requests==2.8.1
scikit-learn==0.16.1
scipy==0.16.0
seaborn==0.6.0
six==1.10.0
sklearn==0.0
smart-open==1.3.2
Theano==0.7.0
tornado==4.2.1
wheel==0.24.0
Will want to know what is causing the difference of results ? can you pls add your virtual environment as requirements.txt in your git repo. It will help me replicate things better. Menwhile can you once re-run your code with my environment
@ngopee : I will try your suggestion and get back to you soon
Using sudo pip install keras gensim -U
?
In the meantime I will rerun the code on Theano and let you know my result.
@dandxy89 : upgraded the packages and got the results
Train...
Train on 25000 samples, validate on 25000 samples
Epoch 1/5
25000/25000 [==============================] - 211s - loss: 0.4418 - acc: 0.7908 - val_loss: 0.3645 - val_acc: 0.8428
Epoch 2/5
25000/25000 [==============================] - 211s - loss: 0.2836 - acc: 0.8842 - val_loss: 0.3214 - val_acc: 0.8624
Epoch 3/5
25000/25000 [==============================] - 212s - loss: 0.1818 - acc: 0.9320 - val_loss: 0.3684 - val_acc: 0.8576
Epoch 4/5
25000/25000 [==============================] - 212s - loss: 0.0992 - acc: 0.9650 - val_loss: 0.4339 - val_acc: 0.8570
Epoch 5/5
25000/25000 [==============================] - 211s - loss: 0.0580 - acc: 0.9800 - val_loss: 0.4965 - val_acc: 0.8512
Evaluate...
25000/25000 [==============================] - 57s
('Test score:', 0.49650939433217051)
('Test accuracy:', 0.85124)
Thank you so much for all the help
Anuj,
May you share link to your code, pls
It would be great to try
On Mar 13, 2016 6:46 AM, "Anuj Gupta" [email protected] wrote:
@dandxy89 https://github.com/dandxy89 : upgraded the packages and got
the resultsTrain...
Train on 25000 samples, validate on 25000 samples
Epoch 1/5
25000/25000 [==============================] - 211s - loss: 0.4418 - acc:
0.7908 - val_loss: 0.3645 - val_acc: 0.8428
Epoch 2/5
25000/25000 [==============================] - 211s - loss: 0.2836 - acc:
0.8842 - val_loss: 0.3214 - val_acc: 0.8624
Epoch 3/5
25000/25000 [==============================] - 212s - loss: 0.1818 - acc:
0.9320 - val_loss: 0.3684 - val_acc: 0.8576
Epoch 4/5
25000/25000 [==============================] - 212s - loss: 0.0992 - acc:
0.9650 - val_loss: 0.4339 - val_acc: 0.8570
Epoch 5/5
25000/25000 [==============================] - 211s - loss: 0.0580 - acc:
0.9800 - val_loss: 0.4965 - val_acc: 0.8512
Evaluate...
25000/25000 [==============================] - 57s('Test score:', 0.49650939433217051)
('Test accuracy:', 0.85124)Thank you so much for all the help
—
Reply to this email directly or view it on GitHub
https://github.com/fchollet/keras/issues/853#issuecomment-195932433.
Great! Looking at the results @anujgupta82 the model is over-fitting after two epochs.
@Sandy4321 the link I recommended above contains all the data and scripts.
Dan,
I see thanks,
By the way do you know links to other attempts to use word2vec for
sentiment analysis, may be pure rnn with enforced improved optimisation
like Rprop , since I have only CPU lap top , so some fast calculation code
is needed
Thanks
Sandy
On Mar 13, 2016 10:25 AM, "Dan Dixey" [email protected] wrote:
Great! Looking at the results @anujgupta82
https://github.com/anujgupta82 the model is over-fitting after two
epochs.@Sandy4321 https://github.com/Sandy4321 the link I recommended above
contains all the data and scripts.—
Reply to this email directly or view it on GitHub
https://github.com/fchollet/keras/issues/853#issuecomment-195964330.
I am running the code I linked to above on an old Dell laptop, and it's running fine. If you use the very well prepared documentation on Keras.io and the examples you should easily be able to do what you have described above.
You can bypass the Gensim requirement and get the model to learning its own embedding matrix as it trains. I would recommend taking a look at any of the "IMDB" examples that are available.
@Sandy4321 : My code is the one provided by @dandxy89
@dandxy89 I used your code to train a "stacked lstm" and got 0.8710 My attempt is to push lstm to its limits for imdb classification.
replaced
model.add(input_dim)
by
model.add(LSTM(1024, return_sequences=True)) # return_sequences=True forces it to return a sequence
model.add(Dropout(0.3))
model.add(LSTM(1024))
model.add(Dropout(0.3))
model.add(Dense(1, activation='sigmoid'))
When do you know its a good time to try a deep LSTM ?
Anuj
may you share code ?
On Mon, Mar 14, 2016 at 2:45 AM, Anuj Gupta [email protected]
wrote:
@dandxy89 https://github.com/dandxy89 I used your code to train a
"stacked lstm" and got 0.8710 My attempt is to push lstm to its limits for
imdb classification.
Input to a LSTM unit is a sequence, by default it doesn't return a sequencemodel.add(LSTM(1024, return_sequences=True)) # return_sequences=True
forces it to return a sequence
model.add(Dropout(0.3))
model.add(LSTM(1024))
model.add(Dropout(0.3))
model.add(Dense(1, activation='sigmoid'))When do you know that a deep LSTM might help ?
—
Reply to this email directly or view it on GitHub
https://github.com/fchollet/keras/issues/853#issuecomment-196168945.
@viksit @farizrahman4u any suggestion as to how can I further improve the results (code ^)
new drop out for embeddings layer : conventional drop outs leads to overfititng
http://arxiv.org/abs/1512.05287
A Theoretically Grounded Application of Dropout in Recurrent Neural Networks
Yarin Ga
Anuj
super
thanks a lot
but
https://github.com/anujgupta82/DeepNets/blob/master/LSTM/IMDB_Embedding_w2v_LSTM_3.ipynb
_Train your own w2v model_ on the dataset vocab
if it is needed to use ready
w2v model
for example from gensim?
On Mon, Mar 14, 2016 at 12:45 PM, Anuj Gupta [email protected]
wrote:
@Sandy4321 https://github.com/Sandy4321
https://github.com/anujgupta82/DeepNets/tree/master/LSTM—
Reply to this email directly or view it on GitHub
https://github.com/fchollet/keras/issues/853#issuecomment-196405308.
@Sandy4321 have done that too, used pre-trained google word2vec model - ""GoogleNews-vectors-negative300.bin" but got very similar results.
Will share those notebooks too
Thanks for the scripts! It is unclear to me whether unknown words should be removed from the sentence, added as a separate token in the word2vec or "masked" in Keras. Perhaps @dandxy89 or someone else could help explain this?
If a word does not appear a number of things that can be done are:
The NLTK or SpaCy packages are very good for the first few points :+1:. So is the NLTK Book
It really depends what you intend to use your model for... If this in production environment then I would suggest simply using an unknown label in your dictionary and keep the masking for padding.
@dandxy89 can we change the dataset/vocab during further training of a word2vec model ?
So, for example can i take google's pretrained model "GoogleNews-vectors-negative300.bin" and further train it on my dataset
Why I might want to do so: To compensate for the lack of huge data unlike googlenews dataset but to finetune the model to my dataset.
My understanding says gensim surely does not allow this
@dandxy89 Thanks for the help! I'm curious as to why you in your script both adds 1 to n_symbols
and afterwards adds 1 more when setting input_dim
. Don't you only need len(vocab) + 1
(not len(vocab) + 2
) to account for the 0th index?
@PiranjaF error on my part - intend to be used as an example only...
@anjishnu store the vectors somewhere since the bin file is huge then feed them into models. Read the documentation provided by Gensim and look at the source code for more insight in to how it is working...
@dandxy89 No worries - it's great code. I'm probably overlooking something really simple, but where do you get the IMDB dataset from?
from keras.datasets import imdb
@dandxy89 Love the code example! Can you help a novice out, what do maxlen and input_length refer to and what are their effect?
@BrianMiner
_input_length_ / _maxlen_ : Both in my example are equivalent to one another. The purpose of them is to transform each of the examples (sentences) to a fixed length. For example, sentences irrespective of length will be extended or reduced to that fixed size. Those that are extended are typically padded with 0s, it can be any value however it most cases 0 is used as the masking value.
For further information check the documentation and run the code line by line
@PiranjaF
Ah, so if there are 105 words in a sentence, the last 5 are dropped?
Yes It's one of the decisions you have to make during the model building phase.Same as if you want to include stopwords, how to deal with unseen words, pruning the vocabulary ect....
@farizrahman4u @viksit
"""
There are 3 approaches:
Learn embedding from scratch - simply add an Embedding layer to your model
Fine tune learned embeddings - this involves setting word2vec / GloVe vectors as your Embedding layer's weights.
Use word word2vec / Glove word vectors as inputs to your model, instead of one-hot encoding.
The third one is the best option(Assuming the word vectors were obtained from the same domain as the inputs to your models. For e.g, if you are doing sentiment analysis on tweets, you should use GloVe vectors trained on tweets).
In the first option, everything has to be learned from scratch. You dont need it unless you have a rare scenario. The second one is good, but, your model will be unnecessarily big with all the word vectors for words that are not frequently used.
"""
Just add some resource on which option is better.
http://emnlp2014.org/papers/pdf/EMNLP2014181.pdf shows that option 2 is often better than option 3, and they are both better than option 1. The paper experiments with different datasets, though it is using CNN
Quoting myself:
The second one is good, but, your model will be unnecessarily big with all the word vectors for words that are not frequently used.
I found this "issue" quite informative :) I am curious if a valid application for word embeddings like this is for search engine relevance, where we have the search term, title and description of a web page and the relevance rating (1-5). Something similar to this: https://karthkk.wordpress.com/2016/03/22/deep-learning-solution-for-netflix-prize/ but with query and title text instead of movie and user ids. I started coding this in Keras and was curious if it made sense?
I am wondering more about the embedding layer in keras. Is there any notion of context words around each word like this: http://deeplearning.net/tutorial/rnnslu.html#word-embeddings
Or is the embedding simply each word to a vector to be trained?
If I want to use the third methods ,how can I transfer my data into the correct format ?
My data set is big ,and my wordvec is 200D, when I use pad_sequences, it shows memeory error? @farizrahman4u
@BrianMiner the embedding layer in keras, by default, simply transforms integers (one hot representations) into dense vectors of fixed size. For eg, in the docstring,
[[4], [20]] -> [[0.25, 0.1], [0.6, -0.2]]
@snowxiaoru can you post your code here? It's easier to understand when you have the actual code!
Thanks ! It really helps me!
Tutorials on word2vec and vec2word on RNN?
to anyone who would like some code on converting embeddings to a matrix, I have some here:
https://github.com/braingineer/ikelos/blob/master/ikelos/data/embeddings.py#L66
eventually I'll write up a step, but not any time in May.
Can someone explain why in @sergeyf snippet there are two additions of the number 1?
First addition:
n_symbols = len(index_dict) + 1 # adding 1 to account for 0th index (for masking)
I fully understand that we have to add a masking row to the beginning of the embedding matrix.
Second addition:
embedding_weights = np.zeros((n_symbols+1,vocab_dim))
Why add 1 again? Also the name of the variables are misleading, vocab_dim
has nothing to do with the vocabulary, it is the dimension of the embedding vector.
If someone familiar with the Keras implementation can write an organized documentation about the underlying conventions with this pad mask, that would be highly appreciated.
I forget what I did there but see here for a more definitive example: https://github.com/fchollet/keras/blob/master/examples/pretrained_word_embeddings.py
Thank you @sergeyf, the code in the linnk looks correct, just one addition. Have a nice day...
Skimmed al comments and did not read everything, but imho these reasons exist for fixing the embedding:
Embedding
would do for you (except that it also trains).Embedding
can be huge, wasting a lot of computational power which one might rather use for training a different part of the net (for example the actual RNN, convolutional layer or whatever) Remember, neural networks are not just about being able to do stuff, but also about being able to do it with the least computational power possible :)Of course, you can make your own layer which does a matrix-lookup for the right vector, but that's basically what Embedding
does but without the training. So I did this:
class FixedLayerMixin:
def build(self, *args, **kwargs):
super(FixedLayerMixin, self).build(*args, **kwargs)
self.trainable_weights = []
class FixedEmbedding(FixedLayerMixin, Embedding):
pass
This mixin allows any layer to work as is, but without training. At least, I hope it does. Any notes of Keras developers on whether what I'm doing is OK are, of course, well appreciated :)
I still do not understand can I use Embedding layer and NOT TUNE IT AT ALL just as the saving memory mechanism?
An Embedding Layer has weights, and they need to be tuned: https://github.com/fchollet/keras/blob/master/keras/layers/embeddings.py#L95
There are, however, pretrained word2vec models, which are already trained and hence need not to be retrained if they fit your needs.
What should be the padding vector if i am using pretrained word2vec from google?
Should i use word like 'stop' and transform it to vector with google word2vec model or should i use just vector of zeros?
What happened to the weights argument to the Embedding function? I am following the tutorial on the blog here: https://blog.keras.io/using-pre-trained-word-embeddings-in-a-keras-model.html where we pass in the pre-trained embedding matrix. However, in the latest keras version, the weights function has been deleted. The PR that deleted it is here: https://github.com/fchollet/keras/commit/023331ec2a7b0086abfc81eca16c84a1692ee653
@prinsherbert This approach may lead to over-fitting. For example you have 100000 words (pretty small vocabulary) all with 100 vector size if they are trainable you will get 10 millions free parameters "from nothing".
@MaratZakirov Yes, it is wise to pretrain or just train with a large corpus such as Wikipedia to prevent overfitting. And yes, you need a high document to word ratio.
From what I understand, the Embedding layer in Keras performs a lookup for a word index present in an input sequence, and replaces it with the corresponding vector through an embedding matrix.
However, what I am confused about it what happens when we want to test/apply a model on unknown data? For example, if there is a word in the test document which is not present in the training vocabulary, we could compute the corresponding vectors using character n-grams using pre-trained Fast-text model. However, this term would not be present in the word_index that was generated while training the model, and a lookup in the embedding matrix would fail.
One possible solution can be to create a word_index from the entire dataset, including the test data, as done for this example: https://github.com/fchollet/keras/blob/master/examples/pretrained_word_embeddings.py
However, I would like to avoid that so that the model is applicable to unknown data.
Any suggestion for workarounds for that?
I think unknown words in general are mapped to random vectors, but this is not what the Embedding-layer does. In text processing people often consider the vocabulary prior knowledge. And if your model is word based, you will unlikely learn about words not seen in the training data anyway.
Tweets are examples of data with many words that you'll likely will not encounter during training, but that do appear in testing, because people make typos. More importantly, they also add (hash) tags, which are typically some concatenation of words. If you search the literature, you'll notice many tweet-classifiers use a character level convolutional layer, and than some classifier on top of that (like LSTM).
So if you want to generalize to new words, consider character level features/classifiers.
It seems like these responses ignore the main issue, is if weights argument was removed in Keras 2.0?
@BrianMiner Yes, it seems like embedding weights have indeed been removed. My guess is it can still be used via the option 'embeddings_initializer' and coding a custom initializer which returns embedding weights. @vijay120
@prinsherbert Yes, I am aware of such techniques. I was only wondering of an elegant way to add embeddings of OOVs based on character n-gram embeddings using Fast-text during test to avoid completely ignoring OOV terms in a word based model.
Hello @madhumita-git @prinsherbert
I also have a similar scenario, where I want to use a character-based model for another NLP task.
Basically, my input data is a 3D tensor containing _n_ sentences, each containing _m_ words, each represented as a vector of _o_ characters. So, 1sr dimension = batch, 2nd dimension = temporal dimension (max length of sentence), 3rd dimension = max characters of the word
So that my model starts with:
word_input = Input((self.max_length, self.max_word_length))
Now I would like to use characters embeddings on the character level (and on the top of this, using 1D convolution+MaxPooling to obtain a fixed-size vector representation of the word, in a similar way as in this paper: "Learning Character-level Representations for Part-of-Speech Tagging" http://proceedings.mlr.press/v32/santos14.pdf).
Any idea about how I could use an Embedding
layer in such way in keras?
If you are working on large data, it is recommended to directly use word vectors as an input to LSTM layer rather than having an embedding layer. This avoids matrix multiplication which takes a lot of time if there are more sequences.
@sergeyf
Hi! Is it just me or would your embedding_weights
numpy zeros array be a +1 too wide?
Since you've already set the width to be n_symbols = len(index_dict) + 1
to account for 0th index
But in your embedding_weights it's embedding_weights = np.zeros((n_symbols+1,vocab_dim))
, which would be the same as the original n_symbols = len(index_dict) + 2
? Why the extra +1 to length?
vocab_dim = 300 # dimensionality of your word vectors
n_symbols = len(index_dict) + 1 # adding 1 to account for 0th index (for masking)
embedding_weights = np.zeros((n_symbols+1,vocab_dim))
for word,index in index_dict.items():
embedding_weights[index,:] = word_vectors[word]
# assemble the model
model = Sequential() # or Graph or whatever
model.add(Embedding(output_dim=rnn_dim, input_dim=n_symbols + 1, mask_zero=True, weights=[embedding_weights])) # note you have to put embedding weights in a list by convention
model.add(LSTM(dense_dim, return_sequences=False))
model.add(Dropout(0.5))
model.add(Dense(n_symbols, activation='softmax')) # for this is the architecture for predicting the next word, but insert your own here
I think that +2 thing only made sense in an old version of Keras, and I
absolutely can't remember why anymore. But I do remember being annoyed.
I should probably update this comment, huh? It's like documentation by now!
On Sep 2, 2017 4:50 PM, "Eric" notifications@github.com wrote:
@sergeyf https://github.com/sergeyf
Hi! Is it just me or would your embedding_weights numpy zeros array be a
+1 too wide?Since you've already set the width to be n_symbols = len(index_dict) + 1
to account for 0th indexBut in your embedding_weights it's embedding_weights =
np.zeros((n_symbols+1,vocab_dim)), which would be the same as the
original n_symbols = len(index_dict) + 2? Why the extra +1 to length?vocab_dim = 300 # dimensionality of your word vectors
n_symbols = len(index_dict) + 1 # adding 1 to account for 0th index (for masking)
embedding_weights = np.zeros((n_symbols+1,vocab_dim))
for word,index in index_dict.items():
embedding_weights[index,:] = word_vectors[word]assemble the model
model = Sequential() # or Graph or whatever
model.add(Embedding(output_dim=rnn_dim, input_dim=n_symbols + 1, mask_zero=True, weights=[embedding_weights])) # note you have to put embedding weights in a list by convention
model.add(LSTM(dense_dim, return_sequences=False))
model.add(Dropout(0.5))
model.add(Dense(n_symbols, activation='softmax')) # for this is the architecture for predicting the next word, but insert your own here—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/fchollet/keras/issues/853#issuecomment-326771127, or mute
the thread
https://github.com/notifications/unsubscribe-auth/ABya7HryYQV4Ygn9iVChsKgLxS33U9i8ks5sec2hgaJpZM4GRCom
.
Hello!
Following this:
https://blog.keras.io/using-pre-trained-word-embeddings-in-a-keras-model.html
I tried to use pretrained word embeddings with my Embedding layer in Keras.
Yet I am getting this error:
ValueError: Layer weight shape (10000, 100) not compatible with provided weight shape (88585, 100)
at this line:
model.add(Embedding(max_features, 100, input_length=max_review_length,mask_zero=True, weights=[embedding_matrix]))
From what I see Keras 2 + is not supporting embedding weights (yes?).
I've tried older Keras 1.2 and 1.1.2 versions, but they still gave me the same error.
Anyone can advise whether I am doing something wrong?
Or what would be the proper way to use my own embeddings in Embedding layer?
Thanks!
Providing the code I am using below:
from keras.datasets import imdb
from keras.models import Sequential
from keras.layers import Dense
from keras.layers import Convolution1D, Flatten, Dropout
from keras.layers.embeddings import Embedding
from keras.preprocessing import sequence
from keras.callbacks import TensorBoard
from gensim.models import word2vec
import numpy as np
import os
import keras
#Using keras to load the dataset with the top_words
max_features = 10000 #max number of words to include, words are ranked by how often they occur (in training set)
max_review_length = 1600
(X_train, y_train), (X_test, y_test) = imdb.load_data(nb_words=max_features)
print 'loaded dataset...'
#Pad the sequence to the same length
X_train = sequence.pad_sequences(X_train, maxlen=max_review_length)
X_test = sequence.pad_sequences(X_test, maxlen=max_review_length)
index_dict = keras.datasets.imdb.get_word_index()
print 'loading glove...'
embeddings_index = {}
f = open(os.path.join('/home/ejaksla/PycharmProjects/MachineLearningPlayground/BachelorDegree/glove_word2vec/glove.6B/', 'glove.6B.100d.txt'))
for line in f:
values = line.split()
word = values[0]
coefs = np.asarray(values[1:], dtype='float32')
embeddings_index[word] = coefs
f.close()
print 'creating embedding matrix...'
embedding_matrix = np.zeros((len(index_dict) + 1, 100))
for word, i in index_dict.items():
embedding_vector = embeddings_index.get(word)
if embedding_vector is not None:
# words not found in embedding index will be all-zeros.
embedding_matrix[i] = embedding_vector
print('Found %s word vectors.' % len(embeddings_index))
print 'assembling model..'
# Using embedding from Keras
model = Sequential()
model.add(Embedding(max_features, 100, input_length=max_review_length,mask_zero=True, weights=[embedding_matrix]))
Is there any concensus of whether @sergeyf's approach still works? It does indeed appear that the weights argument has been removed, but it's still being used in the examples here..
Have you run this and made sure it fails?
https://blog.keras.io/using-pre-trained-word-embeddings-in-a-keras-model.html
Says updated for keras 2
On Nov 13, 2017 9:33 AM, "DomHudson" notifications@github.com wrote:
Is there any concensus of whether @sergeyf https://github.com/sergeyf's
approach still works? It does indeed appear that the weights argument has
been removed, but it's still being used in the examples here
https://github.com/fchollet/keras/blob/master/examples/pretrained_word_embeddings.py#L122
..—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/fchollet/keras/issues/853#issuecomment-343936220, or mute
the thread
https://github.com/notifications/unsubscribe-auth/AHKCOfLkfQgjIc873-xQZm1gVJ9UEWj5ks5s2FMegaJpZM4GRCom
.
Hi everyone,
Looks like this is still a references for some people. Here is what I do now with Keras 2.0.8:
def set_embedding_layer_weights(embedding_layer, pretrained_embeddings):
dense_dim = pretrained_embeddings.shape[1]
weights = np.vstack((np.zeros(dense_dim), pretrained_embeddings))
embedding_layer.set_weights([weights])
# load up your pretrained_embeddings here
d = pretrained_embeddings.shape[1] # should be np.array
embedding_layer = Embedding(output_dim=d, input_dim=n_vocab, trainable=True)
embedding_layer.build((None,)) # if you don't do this, the next step won't work
set_embedding_layer_weights(embedding_layer, pretrained_embeddings)
Note! This version assumes that the pretrained_embeddings
array does not come with a mask first row, and explicitly make an all-zeros row for it here: weights = np.vstack((np.zeros(dense_dim), pretrained_embeddings))
. If you already have a special mask row, then feel free to just do embedding_layer.set_weights([pretrained_embeddings])
Hope that helps.
Thanks for the reply both!
@AllardJM It doesn't error no. I've done a little investigation and with the following code I've come to the conclusion that it is being utilised and in fact, the original solution (using a weights
kwarg) that @sergeyf posted does still work.
import numpy as np
from keras import initializers
from keras.layers import Embedding, LSTM, Dense
from keras.models import Sequential
weights = np.concatenate([
np.zeros((1, 100)), # Masking row: all zeros.
np.ones((1, 100)) # First word: all weights preset to 1.
])
print(weights)
array([[ 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
0., 0., 0., 0., 0., 0., 0., 0., 0.],
[ 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
1., 1., 1., 1., 1., 1., 1., 1., 1.]])
layer = Embedding(
output_dim = 100,
input_dim = 2,
mask_zero = True,
weights = [weights],
)
model = Sequential([
layer,
LSTM(2, dropout = 0.2, activation = 'tanh'),
Dense(1, activation = 'sigmoid')
])
model.compile(
optimizer = 'adam',
loss = 'binary_crossentropy',
metrics = []
)
print(layer.get_weights())
[array([[ 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
0., 0., 0., 0., 0., 0., 0., 0., 0.],
[ 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
1., 1., 1., 1., 1., 1., 1., 1., 1.]], dtype=float32)]
The recent solution above also appears to work, but is probably more inefficient as it's initialising the weights and then overwriting them. This can be seen as shown:
layer = Embedding(
output_dim = 100,
input_dim = 2,
mask_zero = True
)
layer.build((None,))
print(layer.get_weights())
[array([[ -2.64064074e-02, -4.05902900e-02, -1.71032399e-02,
6.36395207e-03, 4.03554030e-02, -2.91514937e-02,
-3.05371974e-02, 1.60062015e-02, -4.58858572e-02,
-2.71607353e-03, -6.45029533e-04, -3.60430926e-02,
-4.47065122e-02, -4.46958952e-02, 8.49759020e-03,
-2.07597855e-02, -4.63474654e-02, -4.47412431e-02,
.....
layer.set_weights([weights])
print(layer.get_weights())
[array([[ 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
0., 0., 0., 0., 0., 0., 0., 0., 0.],
[ 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
1., 1., 1., 1., 1., 1., 1., 1., 1.]], dtype=float32)]
Great! This is what I found as well, the example code still works fine to pass in weights to initialize the embedding.
Cheers!
On 11/15/2017 08:05 AM, DomHudson wrote:
Thanks for the reply both!
@AllardJMhttps://github.com/allardjm It doesn't error no. I've done a little investigation and with the following code I've come to the conclusion that it is being utilised and in fact, the original solution (using a weights kwarg) that @sergeyfhttps://github.com/sergeyf posted does still work.
import numpy as np
from keras import initializers
from keras.layers import Embedding, LSTM, Dense
from keras.models import Sequential
weights = np.concatenate([
np.zeros((1, 100)), # Masking row: all zeros.
np.ones((1, 100)) # First word: all weights preset to 1.
])
print(weights)
array([[ 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
0., 0., 0., 0., 0., 0., 0., 0., 0.],
[ 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
1., 1., 1., 1., 1., 1., 1., 1., 1.]])
layer = Embedding(
output_dim = 100,
input_dim = 2,
mask_zero = True,
weights = [weights],
)
model = Sequential([
layer,
LSTM(2, dropout = 0.2, activation = 'tanh'),
Dense(1, activation = 'sigmoid')
])
model.compile(
optimizer = 'adam',
loss = 'binary_crossentropy',
metrics = []
)
print(layer.get_weights())
[array([[ 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
0., 0., 0., 0., 0., 0., 0., 0., 0.],
[ 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
1., 1., 1., 1., 1., 1., 1., 1., 1.]], dtype=float32)]
The recent solution above also appears to work, but is probably more inefficient as it's initialising the weights and then overwriting them. This can be seen as shown:
layer = Embedding(
output_dim = 100,
input_dim = 2,
mask_zero = True
)
layer.build((None,))
print(layer.get_weights())
[array([[ -2.64064074e-02, -4.05902900e-02, -1.71032399e-02,
6.36395207e-03, 4.03554030e-02, -2.91514937e-02,
-3.05371974e-02, 1.60062015e-02, -4.58858572e-02,
-2.71607353e-03, -6.45029533e-04, -3.60430926e-02,
-4.47065122e-02, -4.46958952e-02, 8.49759020e-03,
-2.07597855e-02, -4.63474654e-02, -4.47412431e-02,
-7.12857256e-03, 3.30050252e-02, -1.70418713e-02,
-3.46117802e-02, 1.63293723e-02, -3.06463335e-02,
-3.92450131e-02, 2.13836078e-02, 3.40061374e-02,
3.08677852e-02, 4.10322733e-02, 3.48727070e-02,
3.77323031e-02, 4.75023203e-02, -4.60593663e-02,
4.89875488e-02, -1.86587516e-02, 1.37329465e-02,
-1.24689462e-02, -2.74951141e-02, -2.39574052e-02,
-4.11705412e-02, -2.67224889e-02, -1.86454095e-02,
9.51218046e-03, 1.30047565e-02, -1.28185125e-02,
1.50464000e-02, -3.25884894e-02, 1.06664898e-03,
3.91772352e-02, -4.15717773e-02, 3.98341119e-02,
1.08094336e-02, -2.93221492e-02, -3.67895775e-02,
-1.90059599e-02, 3.34730162e-03, -2.74142250e-02,
-4.06444333e-02, -4.97532897e-02, -4.81352210e-04,
-2.15560924e-02, 4.51278277e-02, -2.36345585e-02,
4.39978205e-02, 2.73948014e-02, 4.52689640e-02,
1.53716626e-02, -2.49101524e-03, -8.96360632e-03,
3.06243300e-02, 4.95609641e-02, 2.66981137e-04,
3.92680196e-03, -2.85005327e-02, 4.53012399e-02,
3.41285653e-02, 4.43088599e-02, -1.21087050e-02,
-3.81706282e-02, -3.51855792e-02, 3.59421670e-02,
-3.01210601e-02, -3.23626027e-02, 4.94807661e-02,
-1.53903933e-02, 9.66792088e-03, 1.23059156e-03,
-1.84051401e-03, -3.88573073e-02, -3.77015956e-02,
4.48914282e-02, 3.49486731e-02, -4.73317020e-02,
1.45648129e-03, -2.16338988e-02, -3.01712025e-02,
-4.01302688e-02, 1.65192429e-02, -3.59362774e-02,
2.93326676e-02],
[ -4.46196795e-02, -2.18213685e-02, 2.72371471e-02,
4.23214212e-02, -3.41014937e-02, 4.29243445e-02,
3.27980518e-03, -4.80787531e-02, 3.40308845e-02,
-6.82551879e-03, 4.03380400e-04, 4.45233956e-02,
4.18974236e-02, -1.88305825e-02, 7.91913306e-04,
4.96885180e-03, -1.89449489e-02, 3.14035825e-02,
4.15420346e-02, -3.21644135e-02, -3.54666486e-02,
-3.17389816e-02, 2.59683859e-02, -3.76684554e-02,
4.51624401e-05, 4.44507562e-02, -4.96175438e-02,
-4.82493974e-02, 4.00636811e-03, -4.86469679e-02,
-2.88026463e-02, -4.70020436e-02, -1.23844091e-02,
-1.96035542e-02, -4.45893705e-02, -2.10967846e-02,
4.90186326e-02, -1.49804656e-03, -3.46895168e-03,
3.20515819e-02, -3.41350446e-03, 3.22987102e-02,
-3.75118107e-02, 3.66315842e-02, 6.32166862e-03,
-2.67616995e-02, -2.28005182e-02, 3.59728225e-02,
-1.14186527e-02, 6.25128765e-03, -1.01642106e-02,
1.16781592e-02, -3.82909179e-03, 3.07524931e-02,
-3.32702114e-03, 1.29272817e-02, -4.88508958e-03,
4.88356426e-02, 3.67677584e-02, -1.22928238e-02,
-5.73384156e-03, 2.96543725e-02, 4.05017398e-02,
-9.28649586e-03, -2.95463633e-02, -4.89737280e-02,
-4.42623487e-03, -4.81910333e-02, 8.44216347e-03,
-4.26033465e-03, 2.13968400e-02, -2.50094850e-02,
-4.68100868e-02, -1.76477917e-02, -1.68486964e-02,
1.41983628e-02, -3.38780954e-02, -3.14644054e-02,
4.16858196e-02, 4.50237580e-02, -6.27965620e-03,
5.43129456e-04, 3.63374949e-02, -1.94281098e-02,
-3.25115174e-02, 3.43143530e-02, 4.91250828e-02,
-4.51278165e-02, 4.32032421e-02, 3.06754243e-02,
-1.41283274e-02, 4.49896120e-02, -3.07326354e-02,
4.95368838e-02, -8.92946147e-04, 3.42890918e-02,
-1.97444838e-02, 3.26766376e-03, 3.58569697e-02,
4.43595164e-02]], dtype=float32)]
layer.set_weights([weights])
print(layer.get_weights())
[array([[ 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
0., 0., 0., 0., 0., 0., 0., 0., 0.],
[ 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
1., 1., 1., 1., 1., 1., 1., 1., 1.]], dtype=float32)]
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHubhttps://github.com/fchollet/keras/issues/853#issuecomment-344586281, or mute the threadhttps://github.com/notifications/unsubscribe-auth/AIZkgnZmiRcmk9KcVAniAyspRySVNzz7ks5s2uG2gaJpZM4GRCom.
Is there standard code or a function that takes a model built in gensim word2vec and converts it into the dictionary format's (i.e. index_dict and word_vectors the first comment above)? Otherwise I will write my code for this but that seems much less efficient.
Thanks!
--
So, an example index_dict is the following:
{
'yellow': 1,
'four': 2,
'woods': 3,
'ornate': 31,
'woody': 5,
'cyprus': 6,
'marching': 7,
'canes': 8,
'caned': 9,
'hermann': 10,
'lord': 11,
'meadows': 12,
'shaving': 13,
'swivel': 14
...
}
And you also have a dictionary called word_vectors that maps words to vectors like so:
{
'yellow': array([0.1,0.5,...,0.7]),
'four': array([0.2,1.2,...,0.9]),
...
}
@aksg87 You could use the gensim.models.keyedvectors.KeyedVectors.get_keras_embedding
method?
The KeyedVectors
instance is accessible from a Word2Vec
instance via the wv
attribute, for example:
model = Word2Vec.load(fname)
embedding_layer = model.wv.get_keras_embedding(train_embeddings=True)
Source: https://github.com/RaRe-Technologies/gensim/blob/develop/gensim/models/keyedvectors.py#L1048
Thank you so much for your reply. I ended up finding some examples and wrote it out:
I'll have to try the version you provided.
Also, if you set train_embeddings=True the weights in the layer will change from the word2vect output is this advisable in general?
The code I wrote to do this:
embeddings_index = dict()
f = open('vectors.txt')
for line in f:
values = line.split()
word = values[0]
coefs = asarray(values[1:], dtype='float32')
embeddings_index[word] = coefs
f.close()
print('Loaded %s word vectors.' % len(embeddings_index))
dim_len = len(coefs)
print('Dimension of vector %s.' % dim_len)
embedding_matrix = zeros((vocab_size, dim_len))
for word, i in tqdm(t.word_index.items()):
embedding_vector = embeddings_index.get(word)
if embedding_vector is not None and np.shape(embedding_vector) != (202,):
embedding_matrix[i] = embedding_vector
if np.shape(embedding_vector) == (202,):
print(i)
print("embedding_vector", np.shape(embedding_vector))
print("embedding_matrix", np.shape(embedding_matrix[i]))
Another question I have is my final output is a softmax prediction on several classes (396 to be exact).
The output vector is messy (see below).
Is their a clean way to both 1) convert this into the top 3 labels predicted and 2) write a custom accuracy function which checks how often the softmax predicts the top 3?
array([ 2.74735111e-22, 0.00000000e+00, 0.00000000e+00,
0.00000000e+00, 3.84925198e-38, 0.00000000e+00,
1.72161353e-34, 1.86862336e-26, 6.87889553e-07,
1.09056833e-04, 1.17705227e-26, 6.17638065e-08,
6.54662412e-23, 3.28686365e-05, 4.67332768e-08,
0.00000000e+00, 5.22176857e-10, 4.09760102e-38,
0.00000000e+00, 5.86631461e-17, 1.14025260e-08,
4.42352757e-07, 8.37238900e-08, 0.00000000e+00,
1.48040133e-14, 3.42079135e-14, 2.47516301e-20,
...
Also, if you set train_embeddings=True the weights in the layer will change from the word2vect output is this advisable in general?
I don't think there's a 'correct' answer to this - it's up to you and the problem you're modelling. By having a trainable embeddings layer the weights will be tuned for the model's NLP task. This will give you more domain specific weights at the cost of increased training time.
It's quite common to train initial weights on a large corpus (or to use a pre-trained third party model) and then use that to seed your embedding layer. In this case you will likely find benefit if you do train the embeddings layer with the model. However, if you've trained your own Word2Vec model on exactly the domain you're modelling, you may find that the difference in results is negligible and that training the layer is not preferential over a shorter training time.
Is their a clean way to convert this into the top 3 labels predicted
To do this you could use numpy's argpartition
method.
>>> predictions = np.array([0.1, 0.3, 0.2, 0.4, 0.5])
>>> top_three_classes = np.argpartition(predictions, -3)[-3:]
>>> top_three_classes
array([1, 3, 4])
Write a custom accuracy function which checks how often the softmax predicts the top 3?
Yes this should be fairly straightforward utilising the above logic and a custom metric class or function.
To calculate accuracy, I created a few functions and used Map to apply them on my prediction which essentially tell me how often my model's 'Top 3' prediction contains the true answer. (At the very end I basically counted 'True' vs 'False' to arrive at a percentage. I thought Keras might have a way to overwrite their Accuracy function but didn't see a way.)
Most helpful comment
UPDATED NOV 13, 2017
You have to pass a weight matrix to the
Embedding
layer. Here is an example:Let's say
index_dict
is a dictionary that maps all the words in your dictionary to indices from1
ton_symbols
(0
is reserved for the masking).So, an example
index_dict
is the following:And you also have a dictionary called
word_vectors
that maps words to vectors like so:The following code should do what you want
Note that this kind of setup will result in your embeddings being trained from their initial point! If you want them fixed, then you have to set
trainable=False
.