When I run the following code, I get a TypeError.
import cPickle as pickle
import re
with open ('titles.pkl') as t:
titles = pickle.load(t)
import gensim
from gensim.models.doc2vec import TaggedDocument, LabeledSentence, Doc2Vec
sentences = list(LabeledSentence(re.sub('[^a-zA-Z]', ' ', key).lower().split(), ("ID_" + str(value),)) for key, value in titles)
model = Doc2Vec(size=200, min_count=1, workers=16)
model.build_vocab(sentences)
model.alpha = 0.025
for epoch in range(20):
model.train(sentences)
model.alpha -= 0.001
model.min_alpha = model.alpha
# this part of the code is in the documents for docvecs
docvec = model.docvecs[99]
sims = model.docvecs.most_similar(docvec)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/home/ubuntu/gensim_title_recommender/venv/local/lib/python2.7/site-packages/gensim/models/doc2vec.py", line 428, in most_similar
for doc, weight in positive + negative:
TypeError: 'numpy.float32' object is not iterable
That sample should work, no? titles is an array of arrays where I have document titles and IDs as 'key, value`. My goal here is to be able to provide a new title and get back one of the original ids, ie, use what I understand are the similarity functions in doc2vec. Other calls to the model work:
sentence = u'This is my super cool title'
doc = re.sub('[^a-zA-Z]', ' ', sentence).lower().split()
model.most_similar(positive=doc)
So I believe I've built the model correctly, and that this is a bug.
In case it's relevant, this is on ubuntu 14.04.
Requirements.txt:
behave>=1.2.4
boto>=2.32.0
pep8
python-dateutil
gevent
grequests
psycopg2
cython
scipy==0.15.1
gensim
Provisioning script:
#!/bin/bash
add-apt-repository -y ppa:rwky/redis
sudo DEBIAN_FRONTEND=noninteractive apt-get -y -o Dpkg::Options::="--force-confdef" -o Dpkg::Options::="--force-confold" update
sudo DEBIAN_FRONTEND=noninteractive apt-get -y -o Dpkg::Options::="--force-confdef" -o Dpkg::Options::="--force-confold" upgrade
apt-get install -y redis-server ntp python-software-properties software-properties-common python-virtualenv
apt-get install -y python-dev python-pip git gfortran libopenblas-dev liblapack-dev postgresql postgresql-contrib libpq-dev
pip install virtualenv
Try supplying the 'positive' example vector inside a list – [docvec]. (The code should but doesn't yet properly recognize a single ndarray like it does a single int index or string tag.)
Ah, that got it!
>>> sims = model.docvecs.most_similar([docvec])
>>> sims
[('VID_29538736', 1.0), ('VID_29543021', 0.613202691078186), ('VID_29543020', 0.5868339538574219), ('VID_29538735', 0.5410645008087158), ('VID_29601774', 0.5222368240356445), ('VID_29464301', 0.5215977430343628), ('VID_29491783', 0.5171146392822266), ('VID_29463134', 0.5082449913024902), ('VID_29473381', 0.5064586400985718), ('VID_29616925', 0.5051187872886658)]
I guess those docs are a bit out of date is all.
So how would I query this model for a new document to try to get a similar doc by ID? Ideally, I'd do something like the query I had originally posted:
>>> sentence = u'Students get back in the classroom'
>>> doc = re.sub('[^a-zA-Z]', ' ', sentence).lower().split()
>>> model.docvecs.most_similar(positive=doc)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/home/ubuntu/gensim_title_recommender/venv/local/lib/python2.7/site-packages/gensim/models/doc2vec.py", line 435, in most_similar
raise KeyError("doc '%s' not in trained set" % doc)
KeyError: u"doc 'everything' not in trained set"
I tried to follow the example for LSI, but that seems incorrect for doc2vec (ie, the whole point is not to use the single-word corpus, but to look at documents as a whole). Ideally, I'd think that I'd calculate a score for that sentence, and then query to get the most similar vector from the docvecs in the model (or an error, if it has words the model has never seen before). I could do something like infer_vector, or calculate the vectors of individual words and then dot product them all together, etc, etc, but that seems wrong. That sentence above does give me some pretty interesting results from my model, but ultimately, I want one of those VID strings I used in training.
>>> sent = u'Students get back in the classroom'
>>> doc = re.sub('[^a-zA-Z]', ' ', sent).lower().split()
>>> model.most_similar(positive=doc)
[(u'teachers', 0.44251883029937744), (u'class', 0.4367973506450653), (u'dorm', 0.40028077363967896), (u'kids', 0.3963126540184021), (u'apply', 0.39425891637802124), (u'freshmen', 0.3912411034107208), (u'to', 0.38409990072250366), (u'healthy', 0.38326704502105713), (u'helping', 0.3827580213546753), (u'start', 0.37929749488830566)]
Yes, you would use infer_vector() to evolve a reasonable model-compatible vector for the new document's tokens, and then supply that raw vector as the positive example. So roughly like:
new_doc_vec = model.infer_vector(doc)
model.docvecs.most_similar([new_doc_vec])
A good sanity check is to re-infer a vector for a document already in the model. The nearest doc will (usually) be the same document, if the model is well-trained and the documents meaningfully understandable. ('Usually' because there's effectively some randomness in the process.)
Any words not part of the model's vocabulary – whether because they didn't make a min_count cutoff, of are all-new in a new inferred text – are simply ignored as if they weren't part of the provided example text.
As you note, the 'paragraph vectors' training methods don't primarily compose a doc vector from word vectors. (Sometimes, such a straightforward sum or average of word vectors is used as a baseline against which to compare doc vectors from other methods.) Rather, the doc vector is induced based on its ability to predict words – sometimes even without any contribution from other word vectors (in plain DBOW mode), or mixed with word vectors that are not a precursor input but co-trained at the same time (the DM modes or DBOW with interleaved dbow_words=1 training.) Definitely try DBOW (dm=0) mode, it can be faster/better for many uses. (Don't take the Doc2Vec class defaults as best-practice recommendations; they're somewhat arbitrary and in any case different datasets/end-purposes seem to benefit from different modes.)
That's definitely working, thanks!
Would it make sense to be able to mess around in the infer_vectors function? I would expect that a training example that was fed into the model would return a high score (ie, > .9), but I'm not seeing that. For instance, I see:
>>> titles[100]
[u"Brad Boxberger on Kevin Kiermaier's play", 29484932]
>>> sent = u"Brad Boxberger on Kevin Kiermaier's play"
>>> doc = re.sub('[^a-zA-Z]', ' ', sent).lower().split()
>>> inf_vec = model.infer_vector(doc)
>>> model.docvecs.most_similar([inf_vec])
[('VID_29619465', 0.7008367776870728), ('VID_29619425', 0.673180878162384), ('VID_29484932', 0.6729332208633423), ('VID_29583883', 0.6665624380111694), ('VID_29583877', 0.6541317701339722), ('VID_29619468', 0.6540207862854004), ('VID_29583870', 0.6502735614776611), ('VID_29585875', 0.6245695352554321), ('VID_29588914', 0.6218901872634888), ('VID_29578095', 0.6111888885498047)]
>>> for title in titles:
... if title[1] == 29619465:
... print title
...
[u'Play of the Week: Play 2', 29619465]
So it's not an entirely unreasonable retrieval (it got that it's a play, and a sports play), but the score is pretty low, and the right doc is the third in the list.
Do you think that this is just messing around with model construction, or could something else be at play here? That's why I'm thinking of looking at the infer_vector, especially since I'm not limiting terms during model inclusion, ie, min_count=1.
Some thoughts:
infer_vector (see its optional arguments) might make the inference more consistent with the bulk-trained vectors (more likely to find a vector closest to the bulk vector for the same input)dbow_words=1 parameter, word vectors aren't trained at all (they'll be their randomized initial values at the end of training) This was more than enough information to get me started, thanks!